Overview
SmartCrawler is our advanced web crawling service that offers two modes:- AI-Powered Extraction: LLM-powered web crawling with intelligent data extraction (10 credits per page)
- Markdown Conversion: Cost-effective HTML to markdown conversion without AI/LLM processing (2 credits per page - 80% savings!)
Try SmartCrawler instantly in our interactive playground - no coding required!
Getting Started
Quick Start
Required Headers
Header | Description |
---|---|
SGAI-APIKEY | Your API authentication key |
Content-Type | application/json |
Parameters
Parameter | Type | Required | Description |
---|---|---|---|
url | string | Yes | The starting URL for the crawl. |
prompt | string | No* | Instructions for what to extract (*required when extraction_mode=true). |
extraction_mode | bool | No | When false , enables markdown conversion mode (default: true). |
depth | int | No | How many link levels to follow (default: 1). |
max_pages | int | No | Maximum number of pages to crawl (default: 20). |
schema | object | No | Pydantic or Zod schema for structured output. |
rules | object | No | Crawl rules (see below). |
sitemap | bool | No | Use sitemap.xml for discovery (default: false). |
Get your API key from the dashboard
Markdown Conversion Mode
For cost-effective content archival and when you only need clean markdown without AI processing, use the markdown conversion mode. This mode offers significant cost savings and is perfect for documentation, content migration, and simple data collection.Benefits
- 80% Cost Savings: Only 2 credits per page vs 10 credits for AI mode
- No AI/LLM Processing: Pure HTML to markdown conversion
- Clean Output: Well-formatted markdown with metadata extraction
- Fast Processing: No AI inference delays
- Perfect for: Documentation, content archival, site migration
Quick Start - Markdown Mode
Markdown Mode Response
Markdown Conversion Response
Markdown Conversion Response
Crawl Rules
You can control the crawl behavior with therules
object:
Python
Field | Type | Default | Description |
---|---|---|---|
exclude | list | [] | List of URL substrings to skip |
same_domain | bool | True | Restrict crawl to the same domain |
Example Response
Example Response
Example Response
llm_result
: Structured extraction based on your prompt/schemacrawled_urls
: List of all URLs visitedpages
: List of objects withurl
and extractedmarkdown
content
Retrieve a Previous Crawl
You can retrieve the result of a crawl job by its task ID:Parameters
Parameter | Type | Required | Description |
---|---|---|---|
apiKey | string | Yes | The ScrapeGraph API Key. |
taskId | string | Yes | The crawl job task ID. |
Custom Schema Example
Define exactly what data you want to extract from every page:Async Support
SmartCrawler supports async execution for large crawls:Validation & Error Handling
SmartCrawler performs advanced validation:- Ensures either
url
orwebsite_html
is provided - Validates HTML size (max 2MB)
- Checks for valid URLs and HTML structure
- Handles empty or invalid prompts
- Returns clear error messages for all validation failures
Endpoint Details
Required Headers
Header | Description |
---|---|
SGAI-APIKEY | Your API authentication key |
Content-Type | application/json |
Request Body
Field | Type | Required | Description |
---|---|---|---|
url | string | Yes* | Starting URL (*or website_html required) |
website_html | string | No | Raw HTML content (max 2MB) |
prompt | string | Yes | Extraction instructions |
schema | object | No | Output schema |
headers | object | No | Custom headers |
number_of_scrolls | int | No | Infinite scroll per page |
depth | int | No | Crawl depth |
max_pages | int | No | Max pages to crawl |
rules | object | No | Crawl rules |
sitemap | bool | No | Use sitemap.xml |
Response Format
Key Features
Multi-Page Extraction
Crawl and extract from entire sites, not just single pages
AI Understanding
Contextual extraction across multiple pages
Markdown Conversion
Cost-effective HTML to markdown conversion (80% savings!)
Crawl Rules
Fine-tune what gets crawled and extracted
Schema Support
Define custom output schemas for structured results
Dual Mode Support
Choose between AI extraction or markdown conversion
Use Cases
AI Extraction Mode
- Site-wide data extraction with smart understanding
- Product catalog crawling with structured output
- Legal/Privacy/Terms aggregation with AI parsing
- Research and competitive analysis with insights
- Multi-page blog/news scraping with content analysis
Markdown Conversion Mode
- Website documentation archival and migration
- Content backup and preservation (80% cheaper!)
- Blog/article collection in markdown format
- Site content analysis without AI overhead
- Fast bulk content conversion for CMS migration
Best Practices
AI Extraction Mode
- Be specific in your prompts for better results
- Use schemas for structured output validation
- Test prompts on single pages first
- Include examples in your schema descriptions
Markdown Conversion Mode
- Perfect for content archival and documentation
- No prompt required - set
extraction_mode: false
- 80% cheaper than AI mode (2 credits vs 10 per page)
- Ideal for bulk content migration
General
- Set reasonable
max_pages
anddepth
limits - Use
rules
to avoid unwanted pages (like /logout) - Always handle errors and poll for results
- Monitor your credit usage and rate limits
API Reference
For detailed API documentation, see:Support & Resources
Documentation
Comprehensive guides and tutorials
API Reference
Detailed API documentation
Community
Join our Discord community
GitHub
Check out our open-source projects
Ready to Start Crawling?
Sign up now and get your API key to begin extracting data with SmartCrawler!