SmartCrawler
AI-powered website crawling and multi-page extraction
Overview
SmartCrawler is our advanced LLM-powered web crawling and extraction service. Unlike SmartScraper, which extracts data from a single page, SmartCrawler can traverse multiple pages, follow links, and extract structured data from entire websites or sections, all guided by your prompt and schema.
Try SmartCrawler instantly in our interactive playground - no coding required!
Getting Started
Quick Start
Required Headers
Header | Description |
---|---|
SGAI-APIKEY | Your API authentication key |
Content-Type | application/json |
Parameters
Parameter | Type | Required | Description |
---|---|---|---|
url | string | Yes | The starting URL for the crawl. |
prompt | string | Yes | Instructions for what to extract. |
depth | int | No | How many link levels to follow (default: 1). |
max_pages | int | No | Maximum number of pages to crawl (default: 20). |
schema | object | No | Pydantic or Zod schema for structured output. |
rules | object | No | Crawl rules (see below). |
sitemap | bool | No | Use sitemap.xml for discovery (default: false). |
Get your API key from the dashboard
Crawl Rules
You can control the crawl behavior with the rules
object:
Field | Type | Default | Description |
---|---|---|---|
exclude | list | [] | List of URL substrings to skip |
same_domain | bool | True | Restrict crawl to the same domain |
Example Response
Example Response
Example Response
llm_result
: Structured extraction based on your prompt/schemacrawled_urls
: List of all URLs visitedpages
: List of objects withurl
and extractedmarkdown
content
Retrieve a Previous Crawl
You can retrieve the result of a crawl job by its task ID:
Parameters
Parameter | Type | Required | Description |
---|---|---|---|
apiKey | string | Yes | The ScrapeGraph API Key. |
taskId | string | Yes | The crawl job task ID. |
Custom Schema Example
Define exactly what data you want to extract from every page:
Async Support
SmartCrawler supports async execution for large crawls:
Validation & Error Handling
SmartCrawler performs advanced validation:
- Ensures either
url
orwebsite_html
is provided - Validates HTML size (max 2MB)
- Checks for valid URLs and HTML structure
- Handles empty or invalid prompts
- Returns clear error messages for all validation failures
Endpoint Details
Required Headers
Header | Description |
---|---|
SGAI-APIKEY | Your API authentication key |
Content-Type | application/json |
Request Body
Field | Type | Required | Description |
---|---|---|---|
url | string | Yes* | Starting URL (*or website_html required) |
website_html | string | No | Raw HTML content (max 2MB) |
prompt | string | Yes | Extraction instructions |
schema | object | No | Output schema |
headers | object | No | Custom headers |
number_of_scrolls | int | No | Infinite scroll per page |
depth | int | No | Crawl depth |
max_pages | int | No | Max pages to crawl |
rules | object | No | Crawl rules |
sitemap | bool | No | Use sitemap.xml |
Response Format
Key Features
Multi-Page Extraction
Crawl and extract from entire sites, not just single pages
AI Understanding
Contextual extraction across multiple pages
Crawl Rules
Fine-tune what gets crawled and extracted
Schema Support
Define custom output schemas for structured results
Use Cases
- Site-wide data extraction
- Product catalog crawling
- Legal/Privacy/Terms aggregation
- Research and competitive analysis
- Multi-page blog/news scraping
Best Practices
- Be specific in your prompts
- Use schemas for structured output
- Set reasonable
max_pages
anddepth
- Use
rules
to avoid unwanted pages - Handle errors and poll for results
API Reference
For detailed API documentation, see:
Support & Resources
Documentation
Comprehensive guides and tutorials
API Reference
Detailed API documentation
Community
Join our Discord community
GitHub
Check out our open-source projects
Ready to Start Crawling?
Sign up now and get your API key to begin extracting data with SmartCrawler!