Start a new web crawl request with AI extraction or markdown conversion
/v1/crawl
Start a new crawl job using SmartCrawler. Choose between AI-powered extraction or cost-effective markdown conversion.
application/json
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
| url | string | Yes | - | The starting URL for the crawl |
| prompt | string | No* | - | Instructions for data extraction (*required when extraction_mode=true) |
| extraction_mode | boolean | No | true | When false, enables markdown conversion mode (NO AI/LLM processing, 2 credits per page) |
| cache_website | boolean | No | false | Whether to cache the website content |
| depth | integer | No | 1 | Maximum crawl depth |
| breadth | integer | No | null | Maximum number of links to crawl per depth level. If null/undefined, unlimited (default). Controls the ‘width’ of exploration at each depth. Useful for limiting crawl scope on large sites. Note: max_pages always takes priority. Ignored when sitemap=true. |
| max_pages | integer | No | 10 | Maximum number of pages to crawl |
| same_domain_only | boolean | No | true | Whether to crawl only the same domain |
| batch_size | integer | No | 1 | Number of pages to process in each batch |
| schema | object | No | - | JSON Schema object for structured output |
| rules | object | No | - | Crawl rules for filtering URLs. Object with optional fields: exclude (array of regex URL patterns), include_paths (array of path patterns to include, supports wildcards * and **), exclude_paths (array of path patterns to exclude, takes precedence over include_paths), same_domain (boolean, default: true). See Rules section below for details. |
| sitemap | boolean | No | false | Use sitemap.xml for discovery |
| render_heavy_js | boolean | No | false | Enable heavy JavaScript rendering |
| stealth | boolean | No | false | Enable stealth mode to bypass bot protection using advanced anti-detection techniques. Adds +4 credits to the request cost |
| webhook_url | str | No | None | Webhook URL to send the job result to. When provided, a signed webhook notification will be sent upon job completion. See Webhook Signature Verification below. |
rules object:
| Field | Type | Default | Description |
|---|---|---|---|
| exclude | array | [] | List of URL patterns (regex) to exclude from crawling. Matches full URL. |
| include_paths | array | [] | (Optional) List of path patterns to include (e.g., ["/products/*", "/blog/**"]). Supports wildcards: * matches any characters, ** matches any path segments. If empty or not specified, all paths are included. |
| exclude_paths | array | [] | (Optional) List of path patterns to exclude (e.g., ["/admin/*", "/api/**"]). Supports wildcards: * matches any characters, ** matches any path segments. Takes precedence over include_paths. If empty or not specified, no paths are excluded. |
| same_domain | boolean | true | Restrict crawling to same domain |
include_paths and exclude_paths are optional. If include_paths is not specified or empty, all paths are included. If exclude_paths is not specified or empty, no paths are excluded. The exclude_paths patterns take precedence over include_paths patterns.extraction_mode: false, the prompt parameter is not required. This mode converts HTML to clean markdown with metadata extraction at only 2 credits per page (80% savings compared to AI mode).{ "task_id": "<task_id>" }. Use this task_id to retrieve the crawl result from the Get Crawl Result endpoint.webhook_url, SmartCrawler will send a POST request to your endpoint upon job completion. Each webhook request includes a signature header for verification.
| Header | Description |
|---|---|
| Content-Type | application/json |
| X-Webhook-Signature | Signature for verifying the webhook authenticity |
X-Webhook-Signature header contains your webhook secret in the format:
X-Webhook-Signature header value with sha256={your_secret}| Field | Type | Description |
|---|---|---|
| event | string | Always “crawler_completed” |
| request_id | string | The crawl job ID (same as task_id) |
| webhook_id | string | Unique ID for this webhook delivery |
| status | string | ”success” or “failed” |
| timestamp | string | ISO 8601 timestamp of completion |
| duration_seconds | float | Total job duration in seconds |
| errors | array | List of error messages (empty if successful) |
| result | string | The crawl result data (null if failed) |