Skip to main content
POST
/
v1
/
crawl
Start SmartCrawler
curl --request POST \
  --url https://api.example.com/v1/crawl

Start Crawl

POST /v1/crawl Start a new crawl job using SmartCrawler. Choose between AI-powered extraction or cost-effective markdown conversion.

Request Body

Content-Type: application/json

Schema

{
  "url": "string",
  "prompt": "string",
  "extraction_mode": "boolean",
  "cache_website": "boolean",
  "depth": "integer",
  "breadth": "integer",
  "max_pages": "integer",
  "same_domain_only": "boolean",
  "batch_size": "integer",
  "schema": { /* JSON Schema object */ },
  "rules": {
    "exclude": ["string"],
    "include_paths": ["string"],
    "exclude_paths": ["string"],
    "same_domain": "boolean"
  },
  "sitemap": "boolean",
  "render_heavy_js": "boolean",
  "stealth": "boolean"
  "webhook_url": str
}

Parameters

ParameterTypeRequiredDefaultDescription
urlstringYes-The starting URL for the crawl
promptstringNo*-Instructions for data extraction (*required when extraction_mode=true)
extraction_modebooleanNotrueWhen false, enables markdown conversion mode (NO AI/LLM processing, 2 credits per page)
cache_websitebooleanNofalseWhether to cache the website content
depthintegerNo1Maximum crawl depth
breadthintegerNonullMaximum number of links to crawl per depth level. If null/undefined, unlimited (default). Controls the ‘width’ of exploration at each depth. Useful for limiting crawl scope on large sites. Note: max_pages always takes priority. Ignored when sitemap=true.
max_pagesintegerNo10Maximum number of pages to crawl
same_domain_onlybooleanNotrueWhether to crawl only the same domain
batch_sizeintegerNo1Number of pages to process in each batch
schemaobjectNo-JSON Schema object for structured output
rulesobjectNo-Crawl rules for filtering URLs. Object with optional fields: exclude (array of regex URL patterns), include_paths (array of path patterns to include, supports wildcards * and **), exclude_paths (array of path patterns to exclude, takes precedence over include_paths), same_domain (boolean, default: true). See Rules section below for details.
sitemapbooleanNofalseUse sitemap.xml for discovery
render_heavy_jsbooleanNofalseEnable heavy JavaScript rendering
stealthbooleanNofalseEnable stealth mode to bypass bot protection using advanced anti-detection techniques. Adds +4 credits to the request cost
webhook_urlstrNoNoneWebhook URL to send the job result to. When provided, a signed webhook notification will be sent upon job completion. See Webhook Signature Verification below.

Example

{
  "url": "https://scrapegraphai.com/",
  "prompt": "What does the company do? and I need text content from there privacy and terms",
  "cache_website": true,
  "depth": 2,
  "breadth": null,
  "max_pages": 2,
  "same_domain_only": true,
  "batch_size": 1,
  "stealth": true,
  "rules": {
    "include_paths": ["/privacy", "/terms", "/about"],
    "exclude_paths": ["/admin/*", "/api/**"],
    "same_domain": true
  },
  "webhook_url":"https://api.example.com/handle/webhook"
  "schema": {
    "$schema": "http://json-schema.org/draft-07/schema#",
    "title": "ScrapeGraphAI Website Content",
    "type": "object",
    "properties": {
      "company": {
        "type": "object",
        "properties": {
          "name": { "type": "string" },
          "description": { "type": "string" },
          "features": {
            "type": "array",
            "items": { "type": "string" }
          },
          "contact_email": { "type": "string", "format": "email" },
          "social_links": {
            "type": "object",
            "properties": {
              "github": { "type": "string", "format": "uri" },
              "linkedin": { "type": "string", "format": "uri" },
              "twitter": { "type": "string", "format": "uri" }
            },
            "additionalProperties": false
          }
        },
        "required": ["name", "description"]
      },
      "services": {
        "type": "array",
        "items": {
          "type": "object",
          "properties": {
            "service_name": { "type": "string" },
            "description": { "type": "string" },
            "features": {
              "type": "array",
              "items": { "type": "string" }
            }
          },
          "required": ["service_name", "description"]
        }
      },
      "legal": {
        "type": "object",
        "properties": {
          "privacy_policy": { "type": "string" },
          "terms_of_service": { "type": "string" }
        },
        "required": ["privacy_policy", "terms_of_service"]
      }
    },
    "required": ["company", "services", "legal"]
  }
}

Markdown Conversion Example (No AI/LLM)

For cost-effective HTML to markdown conversion without AI processing:
{
  "url": "https://scrapegraphai.com/",
  "extraction_mode": false,
  "depth": 2,
  "breadth": null,
  "max_pages": 5,
  "same_domain_only": true,
  "stealth": true,
  "rules": {
    "include_paths": ["/docs/*", "/blog/**"],
    "exclude_paths": ["/admin/*", "/api/**"],
    "same_domain": true
  },
  "webhook_url":"https://api.example.com/handle/webhook"
}

Rules Example

Control which URLs to crawl using the rules object:
{
  "url": "https://example.com/",
  "prompt": "Extract product information",
  "depth": 3,
  "breadth": null,
  "max_pages": 50,
  "rules": {
    "exclude": ["https://example.com/logout", "https://example.com/admin/*"],
    "include_paths": ["/products/*", "/blog/**"],
    "exclude_paths": ["/admin/*", "/api/**", "/private/*"],
    "same_domain": true
  }
}

Rules Parameters

FieldTypeDefaultDescription
excludearray[]List of URL patterns (regex) to exclude from crawling. Matches full URL.
include_pathsarray[](Optional) List of path patterns to include (e.g., ["/products/*", "/blog/**"]). Supports wildcards: * matches any characters, ** matches any path segments. If empty or not specified, all paths are included.
exclude_pathsarray[](Optional) List of path patterns to exclude (e.g., ["/admin/*", "/api/**"]). Supports wildcards: * matches any characters, ** matches any path segments. Takes precedence over include_paths. If empty or not specified, no paths are excluded.
same_domainbooleantrueRestrict crawling to same domain
Both include_paths and exclude_paths are optional. If include_paths is not specified or empty, all paths are included. If exclude_paths is not specified or empty, no paths are excluded. The exclude_paths patterns take precedence over include_paths patterns.
When extraction_mode: false, the prompt parameter is not required. This mode converts HTML to clean markdown with metadata extraction at only 2 credits per page (80% savings compared to AI mode).

Response

  • 200 OK: Crawl started successfully. Returns { "task_id": "<task_id>" }. Use this task_id to retrieve the crawl result from the Get Crawl Result endpoint.
  • 422 Unprocessable Entity: Validation error.
See the Get Crawl Result endpoint for the full response structure.

Webhook Signature Verification

When you provide a webhook_url, SmartCrawler will send a POST request to your endpoint upon job completion. Each webhook request includes a signature header for verification.

Webhook Headers

HeaderDescription
Content-Typeapplication/json
X-Webhook-SignatureSignature for verifying the webhook authenticity

Signature Format

The X-Webhook-Signature header contains your webhook secret in the format:
sha256={your_webhook_secret}

Verifying Webhooks

To verify that a webhook request is authentic:
  1. Retrieve your webhook secret from the dashboard
  2. Compare the X-Webhook-Signature header value with sha256={your_secret}
from flask import Flask, request, abort

app = Flask(__name__)
WEBHOOK_SECRET = "your_webhook_secret"

@app.route("/webhook", methods=["POST"])
def handle_webhook():
    signature = request.headers.get("X-Webhook-Signature")
    expected_signature = f"sha256={WEBHOOK_SECRET}"

    if signature != expected_signature:
        abort(401, "Invalid signature")

    # Process the webhook payload
    data = request.json
    print(f"Received webhook for job: {data['request_id']}")
    return {"status": "ok"}

Webhook Payload

The webhook POST request contains the following JSON payload:
{
  "event": "crawler_completed",
  "request_id": "uuid-of-the-crawl-job",
  "webhook_id": "uuid-of-this-webhook-delivery",
  "status": "success | failed",
  "timestamp": "2024-01-28T12:00:00Z",
  "duration_seconds": 45.2,
  "errors": [],
  "result": "..."
}
FieldTypeDescription
eventstringAlways “crawler_completed”
request_idstringThe crawl job ID (same as task_id)
webhook_idstringUnique ID for this webhook delivery
statusstring”success” or “failed”
timestampstringISO 8601 timestamp of completion
duration_secondsfloatTotal job duration in seconds
errorsarrayList of error messages (empty if successful)
resultstringThe crawl result data (null if failed)
Make sure to configure your webhook secret in the dashboard before using webhooks. Each user has a unique webhook secret for secure verification.