Overview

SmartCrawler is our advanced web crawling service that offers two modes:
  1. AI-Powered Extraction: LLM-powered web crawling with intelligent data extraction (10 credits per page)
  2. Markdown Conversion: Cost-effective HTML to markdown conversion without AI/LLM processing (2 credits per page - 80% savings!)
Unlike SmartScraper, which extracts data from a single page, SmartCrawler can traverse multiple pages, follow links, and either extract structured data or convert content to clean markdown from entire websites or sections.
Try SmartCrawler instantly in our interactive playground - no coding required!

Getting Started

Quick Start

from scrapegraph_py import Client

client = Client(api_key="your-api-key")

response = client.smartcrawler(
    url="https://scrapegraphai.com/",
    prompt="Extract info about the company",
    depth=2,
    max_pages=10
)
Required Headers
HeaderDescription
SGAI-APIKEYYour API authentication key
Content-Typeapplication/json

Parameters

ParameterTypeRequiredDescription
urlstringYesThe starting URL for the crawl.
promptstringNo*Instructions for what to extract (*required when extraction_mode=true).
extraction_modeboolNoWhen false, enables markdown conversion mode (default: true).
depthintNoHow many link levels to follow (default: 1).
max_pagesintNoMaximum number of pages to crawl (default: 20).
schemaobjectNoPydantic or Zod schema for structured output.
rulesobjectNoCrawl rules (see below).
sitemapboolNoUse sitemap.xml for discovery (default: false).
Get your API key from the dashboard

Markdown Conversion Mode

For cost-effective content archival and when you only need clean markdown without AI processing, use the markdown conversion mode. This mode offers significant cost savings and is perfect for documentation, content migration, and simple data collection.

Benefits

  • 80% Cost Savings: Only 2 credits per page vs 10 credits for AI mode
  • No AI/LLM Processing: Pure HTML to markdown conversion
  • Clean Output: Well-formatted markdown with metadata extraction
  • Fast Processing: No AI inference delays
  • Perfect for: Documentation, content archival, site migration

Quick Start - Markdown Mode

from scrapegraph_py import Client

client = Client(api_key="your-api-key")

# Markdown conversion mode - no prompt needed
response = client.crawl(
    url="https://scrapegraphai.com/",
    extraction_mode=False,  # False = Markdown conversion (NO AI/LLM)
    depth=2,
    max_pages=5
)

Markdown Mode Response

Crawl Rules

You can control the crawl behavior with the rules object:
Python
rules = {
    "exclude": ["/logout", "/private"],  # List of URL patterns to exclude
    "same_domain": True  # Only crawl links on the same domain
}
FieldTypeDefaultDescription
excludelist[]List of URL substrings to skip
same_domainboolTrueRestrict crawl to the same domain

Example Response

Retrieve a Previous Crawl

You can retrieve the result of a crawl job by its task ID:
result = client.get_crawl_result(task_id="your-task-id")

Parameters

ParameterTypeRequiredDescription
apiKeystringYesThe ScrapeGraph API Key.
taskIdstringYesThe crawl job task ID.

Custom Schema Example

Define exactly what data you want to extract from every page:
from pydantic import BaseModel, Field

class CompanyData(BaseModel):
    name: str = Field(description="Company name")
    description: str = Field(description="Description")
    features: list[str] = Field(description="Features")

response = client.smartcrawler(
    url="https://example.com",
    prompt="Extract company info",
    schema=CompanyData,
    depth=1,
    max_pages=5
)

Async Support

SmartCrawler supports async execution for large crawls:
import asyncio
from scrapegraph_py import AsyncClient

async def main():
    async with AsyncClient(api_key="your-api-key") as client:
        task = await client.smartcrawler(
            url="https://scrapegraphai.com/",
            prompt="Extract info about the company",
            depth=2,
            max_pages=10
        )
        # Poll for result
        result = await client.get_crawl_result(task["task_id"])
        print(result)

if __name__ == "__main__":
    asyncio.run(main())

Validation & Error Handling

SmartCrawler performs advanced validation:
  • Ensures either url or website_html is provided
  • Validates HTML size (max 2MB)
  • Checks for valid URLs and HTML structure
  • Handles empty or invalid prompts
  • Returns clear error messages for all validation failures

Endpoint Details

POST https://api.scrapegraphai.com/v1/crawl
Required Headers
HeaderDescription
SGAI-APIKEYYour API authentication key
Content-Typeapplication/json

Request Body

FieldTypeRequiredDescription
urlstringYes*Starting URL (*or website_html required)
website_htmlstringNoRaw HTML content (max 2MB)
promptstringYesExtraction instructions
schemaobjectNoOutput schema
headersobjectNoCustom headers
number_of_scrollsintNoInfinite scroll per page
depthintNoCrawl depth
max_pagesintNoMax pages to crawl
rulesobjectNoCrawl rules
sitemapboolNoUse sitemap.xml

Response Format

{
  "status": "success",
  "result": {
    "status": "done",
    "llm_result": { /* Structured extraction */ },
    "crawled_urls": ["..."],
    "pages": [ { "url": "...", "markdown": "..." }, ... ]
  }
}

Key Features

Multi-Page Extraction

Crawl and extract from entire sites, not just single pages

AI Understanding

Contextual extraction across multiple pages

Markdown Conversion

Cost-effective HTML to markdown conversion (80% savings!)

Crawl Rules

Fine-tune what gets crawled and extracted

Schema Support

Define custom output schemas for structured results

Dual Mode Support

Choose between AI extraction or markdown conversion

Use Cases

AI Extraction Mode

  • Site-wide data extraction with smart understanding
  • Product catalog crawling with structured output
  • Legal/Privacy/Terms aggregation with AI parsing
  • Research and competitive analysis with insights
  • Multi-page blog/news scraping with content analysis

Markdown Conversion Mode

  • Website documentation archival and migration
  • Content backup and preservation (80% cheaper!)
  • Blog/article collection in markdown format
  • Site content analysis without AI overhead
  • Fast bulk content conversion for CMS migration

Best Practices

AI Extraction Mode

  • Be specific in your prompts for better results
  • Use schemas for structured output validation
  • Test prompts on single pages first
  • Include examples in your schema descriptions

Markdown Conversion Mode

  • Perfect for content archival and documentation
  • No prompt required - set extraction_mode: false
  • 80% cheaper than AI mode (2 credits vs 10 per page)
  • Ideal for bulk content migration

General

  • Set reasonable max_pages and depth limits
  • Use rules to avoid unwanted pages (like /logout)
  • Always handle errors and poll for results
  • Monitor your credit usage and rate limits

API Reference

For detailed API documentation, see:

Support & Resources

Ready to Start Crawling?

Sign up now and get your API key to begin extracting data with SmartCrawler!