Site Crawler – ValidGraph

Last updated: March 22, 2026

Overview

Discover all structured data on a site by crawling it. The crawler follows internal links to find pages that might not be in the sitemap, validates each page’s schema, and produces a comprehensive site-wide report.

How It Works

1. User submits a starting URL and optional crawl depth
2. ValidGraph starts an asynchronous crawl job
3. The crawler:
– Follows internal links up to the configured depth
– Extracts and validates JSON-LD on each page
– Respects robots.txt directives
4. Results are aggregated into a site-wide report
5. User is notified when the crawl completes

Tier Availability

| Tier | Available |
|——|———–|
| Free | No |
| Pro | No |
| Agency | Yes |
| Enterprise | Yes |

– Sitemap Auto-Import: Alternative URL discovery method
– Bulk CSV Validation: Manual URL list validation
– Site-Wide Score: Score computed from crawl results

Mini-Tutorial

Step 1: Start a Crawl

Navigate to Discovery > Site Crawler and enter your starting URL (e.g., https://example.com).

Step 2: Configure Crawl Settings

– Max Depth: How many levels of internal links to follow (1-5). Deeper crawls take longer.
– Max Pages: Limit crawl size. Agency plan = 200 pages max, Enterprise = 1,000 pages max.
– Respect Robots.txt: Always enabled. Crawler honors disallow rules.

Step 3: Submit

Click “Start Crawl.” The job enters a queue and begins asynchronously.

Step 4: Monitor Progress

The status page updates in real-time showing pages discovered and validated.

Step 5: Review Results

When complete, see a site-wide report with total pages, schema types found, and common errors.

Technical Details

Start Crawl Request

POST /api/v1/crawl-site
{
  "url": "https://example.com",
  "max_depth": 2,
  "max_pages": 200,
  "respect_robots_txt": true
}

Response:

{
  "crawl_id": "crawl_xyz789",
  "status": "queued",
  "created_at": "2025-03-22T14:30:00Z",
  "estimated_pages": 150
}

Check Crawl Status

GET /api/v1/crawl-site/crawl_xyz789/status

Response (in progress):

{
  "crawl_id": "crawl_xyz789",
  "status": "in_progress",
  "progress": {
    "pages_crawled": 45,
    "pages_validating": 12,
    "pages_queue": 93
  },
  "elapsed_time_seconds": 180,
  "estimated_remaining": 300
}

Response (completed):

{
  "crawl_id": "crawl_xyz789",
  "status": "completed",
  "summary": {
    "total_pages": 150,
    "pages_with_schema": 128,
    "pages_with_errors": 22,
    "avg_score": 81.3
  },
  "type_distribution": {
    "Article": 85,
    "Product": 28,
    "Organization": 15
  },
  "common_errors": [
    {
      "error": "missing aggregateRating",
      "count": 18,
      "affected_pages": 18
    }
  ],
  "completed_at": "2025-03-22T14:50:00Z"
}

References

– Robots.txt Specification
– Web Crawling Best Practices
– Schema.org HTML Embedding
– ValidGraph Crawler Documentation

Overview

How It Works

Tier Availability

Mini-Tutorial

Step 1: Start a Crawl

Step 2: Configure Crawl Settings

Step 3: Submit

Step 4: Monitor Progress

Step 5: Review Results

Technical Details

Start Crawl Request

Check Crawl Status

References

SERVICES

COMPANY

RESOURCES

Overview

How It Works

Tier Availability

Related Features

Mini-Tutorial

Step 1: Start a Crawl

Step 2: Configure Crawl Settings

Step 3: Submit

Step 4: Monitor Progress

Step 5: Review Results

Technical Details

Start Crawl Request

Check Crawl Status

References

SERVICES

COMPANY

RESOURCES