Overview

SmartCrawler is our advanced LLM-powered web crawling and extraction service. Unlike SmartScraper, which extracts data from a single page, SmartCrawler can traverse multiple pages, follow links, and extract structured data from entire websites or sections, all guided by your prompt and schema.

Try SmartCrawler instantly in our interactive playground - no coding required!

Getting Started

Quick Start

from scrapegraph_py import Client

client = Client(api_key="your-api-key")

response = client.smartcrawler(
    url="https://scrapegraphai.com/",
    prompt="Extract info about the company",
    depth=2,
    max_pages=10
)
Required Headers
HeaderDescription
SGAI-APIKEYYour API authentication key
Content-Typeapplication/json

Parameters

ParameterTypeRequiredDescription
urlstringYesThe starting URL for the crawl.
promptstringYesInstructions for what to extract.
depthintNoHow many link levels to follow (default: 1).
max_pagesintNoMaximum number of pages to crawl (default: 20).
schemaobjectNoPydantic or Zod schema for structured output.
rulesobjectNoCrawl rules (see below).
sitemapboolNoUse sitemap.xml for discovery (default: false).

Get your API key from the dashboard

Crawl Rules

You can control the crawl behavior with the rules object:

Python
rules = {
    "exclude": ["/logout", "/private"],  # List of URL patterns to exclude
    "same_domain": True  # Only crawl links on the same domain
}
FieldTypeDefaultDescription
excludelist[]List of URL substrings to skip
same_domainboolTrueRestrict crawl to the same domain

Example Response

Retrieve a Previous Crawl

You can retrieve the result of a crawl job by its task ID:

result = client.get_crawl_result(task_id="your-task-id")

Parameters

ParameterTypeRequiredDescription
apiKeystringYesThe ScrapeGraph API Key.
taskIdstringYesThe crawl job task ID.

Custom Schema Example

Define exactly what data you want to extract from every page:

from pydantic import BaseModel, Field

class CompanyData(BaseModel):
    name: str = Field(description="Company name")
    description: str = Field(description="Description")
    features: list[str] = Field(description="Features")

response = client.smartcrawler(
    url="https://example.com",
    prompt="Extract company info",
    schema=CompanyData,
    depth=1,
    max_pages=5
)

Async Support

SmartCrawler supports async execution for large crawls:

import asyncio
from scrapegraph_py import AsyncClient

async def main():
    async with AsyncClient(api_key="your-api-key") as client:
        task = await client.smartcrawler(
            url="https://scrapegraphai.com/",
            prompt="Extract info about the company",
            depth=2,
            max_pages=10
        )
        # Poll for result
        result = await client.get_crawl_result(task["task_id"])
        print(result)

if __name__ == "__main__":
    asyncio.run(main())

Validation & Error Handling

SmartCrawler performs advanced validation:

  • Ensures either url or website_html is provided
  • Validates HTML size (max 2MB)
  • Checks for valid URLs and HTML structure
  • Handles empty or invalid prompts
  • Returns clear error messages for all validation failures

Endpoint Details

POST https://api.scrapegraphai.com/v1/crawl
Required Headers
HeaderDescription
SGAI-APIKEYYour API authentication key
Content-Typeapplication/json

Request Body

FieldTypeRequiredDescription
urlstringYes*Starting URL (*or website_html required)
website_htmlstringNoRaw HTML content (max 2MB)
promptstringYesExtraction instructions
schemaobjectNoOutput schema
headersobjectNoCustom headers
number_of_scrollsintNoInfinite scroll per page
depthintNoCrawl depth
max_pagesintNoMax pages to crawl
rulesobjectNoCrawl rules
sitemapboolNoUse sitemap.xml

Response Format

{
  "status": "success",
  "result": {
    "status": "done",
    "llm_result": { /* Structured extraction */ },
    "crawled_urls": ["..."],
    "pages": [ { "url": "...", "markdown": "..." }, ... ]
  }
}

Key Features

Multi-Page Extraction

Crawl and extract from entire sites, not just single pages

AI Understanding

Contextual extraction across multiple pages

Crawl Rules

Fine-tune what gets crawled and extracted

Schema Support

Define custom output schemas for structured results

Use Cases

  • Site-wide data extraction
  • Product catalog crawling
  • Legal/Privacy/Terms aggregation
  • Research and competitive analysis
  • Multi-page blog/news scraping

Best Practices

  • Be specific in your prompts
  • Use schemas for structured output
  • Set reasonable max_pages and depth
  • Use rules to avoid unwanted pages
  • Handle errors and poll for results

API Reference

For detailed API documentation, see:

Support & Resources

Ready to Start Crawling?

Sign up now and get your API key to begin extracting data with SmartCrawler!