Overview
Discover all structured data on a site by crawling it. The crawler follows internal links to find pages that might not be in the sitemap, validates each page’s schema, and produces a comprehensive site-wide report.
How It Works
1. User submits a starting URL and optional crawl depth
2. ValidGraph starts an asynchronous crawl job
3. The crawler:
– Follows internal links up to the configured depth
– Extracts and validates JSON-LD on each page
– Respects robots.txt directives
4. Results are aggregated into a site-wide report
5. User is notified when the crawl completes
Tier Availability
| Tier | Available |
|——|———–|
| Free | No |
| Pro | No |
| Agency | Yes |
| Enterprise | Yes |
Related Features
– Sitemap Auto-Import: Alternative URL discovery method
– Bulk CSV Validation: Manual URL list validation
– Site-Wide Score: Score computed from crawl results
Mini-Tutorial
Step 1: Start a Crawl
Navigate to Discovery > Site Crawler and enter your starting URL (e.g., https://example.com).
Step 2: Configure Crawl Settings
– Max Depth: How many levels of internal links to follow (1-5). Deeper crawls take longer.
– Max Pages: Limit crawl size. Agency plan = 200 pages max, Enterprise = 1,000 pages max.
– Respect Robots.txt: Always enabled. Crawler honors disallow rules.
Step 3: Submit
Click “Start Crawl.” The job enters a queue and begins asynchronously.
Step 4: Monitor Progress
The status page updates in real-time showing pages discovered and validated.
Step 5: Review Results
When complete, see a site-wide report with total pages, schema types found, and common errors.
Technical Details
Start Crawl Request
POST /api/v1/crawl-site
{
"url": "https://example.com",
"max_depth": 2,
"max_pages": 200,
"respect_robots_txt": true
}
Response:
{
"crawl_id": "crawl_xyz789",
"status": "queued",
"created_at": "2025-03-22T14:30:00Z",
"estimated_pages": 150
}
Check Crawl Status
GET /api/v1/crawl-site/crawl_xyz789/status
Response (in progress):
{
"crawl_id": "crawl_xyz789",
"status": "in_progress",
"progress": {
"pages_crawled": 45,
"pages_validating": 12,
"pages_queue": 93
},
"elapsed_time_seconds": 180,
"estimated_remaining": 300
}
Response (completed):
{
"crawl_id": "crawl_xyz789",
"status": "completed",
"summary": {
"total_pages": 150,
"pages_with_schema": 128,
"pages_with_errors": 22,
"avg_score": 81.3
},
"type_distribution": {
"Article": 85,
"Product": 28,
"Organization": 15
},
"common_errors": [
{
"error": "missing aggregateRating",
"count": 18,
"affected_pages": 18
}
],
"completed_at": "2025-03-22T14:50:00Z"
}
References
– Robots.txt Specification
– Web Crawling Best Practices
– Schema.org HTML Embedding
– ValidGraph Crawler Documentation