SmartCrawler - ScrapeGraphAI

Overview

SmartCrawler is our advanced web crawling service that offers two modes:

AI-Powered Extraction: LLM-powered web crawling with intelligent data extraction (10 credits per page)
Markdown Conversion: Cost-effective HTML to markdown conversion without AI/LLM processing (2 credits per page - 80% savings!)

Unlike SmartScraper, which extracts data from a single page, SmartCrawler can traverse multiple pages, follow links, and either extract structured data or convert content to clean markdown from entire websites or sections.

Try SmartCrawler instantly in our interactive playground - no coding required!

Getting Started

Quick Start

from scrapegraph_py import Client

client = Client(api_key="your-api-key")

response = client.smartcrawler(
    url="https://scrapegraphai.com/",
    prompt="Extract info about the company",
    depth=2,
    max_pages=10
)

Required Headers

Header	Description
SGAI-APIKEY	Your API authentication key
Content-Type	application/json

Parameters

Parameter	Type	Required	Description
url	string	Yes	The starting URL for the crawl.
prompt	string	No*	Instructions for what to extract (*required when extraction_mode=true).
extraction_mode	bool	No	When `false`, enables markdown conversion mode (default: true).
depth	int	No	How many link levels to follow (default: 1).
max_pages	int	No	Maximum number of pages to crawl (default: 20).
schema	object	No	Pydantic or Zod schema for structured output.
rules	object	No	Crawl rules (see below).
sitemap	bool	No	Use sitemap.xml for discovery (default: false).

Get your API key from the dashboard

Markdown Conversion Mode

For cost-effective content archival and when you only need clean markdown without AI processing, use the markdown conversion mode. This mode offers significant cost savings and is perfect for documentation, content migration, and simple data collection.

Benefits

80% Cost Savings: Only 2 credits per page vs 10 credits for AI mode
No AI/LLM Processing: Pure HTML to markdown conversion
Clean Output: Well-formatted markdown with metadata extraction
Fast Processing: No AI inference delays
Perfect for: Documentation, content archival, site migration

Quick Start - Markdown Mode

from scrapegraph_py import Client

client = Client(api_key="your-api-key")

# Markdown conversion mode - no prompt needed
response = client.crawl(
    url="https://scrapegraphai.com/",
    extraction_mode=False,  # False = Markdown conversion (NO AI/LLM)
    depth=2,
    max_pages=5
)

Markdown Mode Response

Markdown Conversion Response

{
  "status": "success",
  "result": {
    "status": "done",
    "pages_processed": 5,
    "credits_used": 10,
    "crawled_urls": [
      "https://scrapegraphai.com/",
      "https://scrapegraphai.com/about",
      "https://scrapegraphai.com/pricing"
    ],
    "pages": [
      {
        "url": "https://scrapegraphai.com/",
        "title": "ScrapeGraphAI - AI-Powered Web Scraping",
        "markdown": "# Transform Websites into Structured Data\n\nScrapeGraphAI is the most complete web scraping library...",
        "metadata": {
          "word_count": 1250,
          "headers": ["Transform Websites", "Features", "Pricing"],
          "links_count": 25
        }
      }
    ]
  }
}

Crawl Rules

You can control the crawl behavior with the rules object:

Python

rules = {
    "exclude": ["/logout", "/private"],  # List of URL patterns to exclude
    "same_domain": True  # Only crawl links on the same domain
}

Field	Type	Default	Description
exclude	list	[]	List of URL substrings to skip
same_domain	bool	True	Restrict crawl to the same domain

Example Response

{
  "status": "success",
  "result": {
    "status": "done",
    "llm_result": {
      "company": {
        "name": "ScrapeGraphAI, Inc",
        "description": "ScrapeGraphAI is a company that provides web scraping services using artificial intelligence...",
        "features": ["AI Agent Ready", "Universal Data Extraction", ...],
        "contact_email": "contact@scrapegraphai.com",
        "social_links": {
          "github": "https://github.com/ScrapeGraphAI/Scrapegraph-ai",
          "linkedin": "https://www.linkedin.com/company/101881123",
          "twitter": "https://x.com/scrapegraphai"
        }
      },
      "services": [
        {"service_name": "Markdownify", ...},
        {"service_name": "Smart Scraper", ...}
      ],
      "legal": {
        "privacy_policy": "https://scrapegraphai.com/privacy",
        "terms_of_service": "https://scrapegraphai.com/terms"
      }
    },
    "crawled_urls": [
      "https://scrapegraphai.com/", ...
    ],
    "pages": [
      {
        "url": "https://scrapegraphai.com/",
        "markdown": "# Transform Websites into Structured Data\n..."
      },
      ...
    ]
  }
}

llm_result: Structured extraction based on your prompt/schema
crawled_urls: List of all URLs visited
pages: List of objects with url and extracted markdown content

Retrieve a Previous Crawl

You can retrieve the result of a crawl job by its task ID:

result = client.get_crawl_result(task_id="your-task-id")

Parameters

Parameter	Type	Required	Description
apiKey	string	Yes	The ScrapeGraph API Key.
taskId	string	Yes	The crawl job task ID.

Custom Schema Example

Define exactly what data you want to extract from every page:

from pydantic import BaseModel, Field

class CompanyData(BaseModel):
    name: str = Field(description="Company name")
    description: str = Field(description="Description")
    features: list[str] = Field(description="Features")

response = client.smartcrawler(
    url="https://example.com",
    prompt="Extract company info",
    schema=CompanyData,
    depth=1,
    max_pages=5
)

Async Support

SmartCrawler supports async execution for large crawls:

import asyncio
from scrapegraph_py import AsyncClient

async def main():
    async with AsyncClient(api_key="your-api-key") as client:
        task = await client.smartcrawler(
            url="https://scrapegraphai.com/",
            prompt="Extract info about the company",
            depth=2,
            max_pages=10
        )
        # Poll for result
        result = await client.get_crawl_result(task["task_id"])
        print(result)

if __name__ == "__main__":
    asyncio.run(main())

Validation & Error Handling

SmartCrawler performs advanced validation:

Ensures either url or website_html is provided
Validates HTML size (max 2MB)
Checks for valid URLs and HTML structure
Handles empty or invalid prompts
Returns clear error messages for all validation failures

Endpoint Details

POST https://api.scrapegraphai.com/v1/crawl

Required Headers

Header	Description
SGAI-APIKEY	Your API authentication key
Content-Type	application/json

Request Body

Field	Type	Required	Description
url	string	Yes*	Starting URL (*or website_html required)
website_html	string	No	Raw HTML content (max 2MB)
prompt	string	Yes	Extraction instructions
schema	object	No	Output schema
headers	object	No	Custom headers
number_of_scrolls	int	No	Infinite scroll per page
depth	int	No	Crawl depth
max_pages	int	No	Max pages to crawl
rules	object	No	Crawl rules
sitemap	bool	No	Use sitemap.xml

Response Format

{
  "status": "success",
  "result": {
    "status": "done",
    "llm_result": { /* Structured extraction */ },
    "crawled_urls": ["..."],
    "pages": [ { "url": "...", "markdown": "..." }, ... ]
  }
}

Key Features

Multi-Page Extraction

Crawl and extract from entire sites, not just single pages

AI Understanding

Contextual extraction across multiple pages

Markdown Conversion

Cost-effective HTML to markdown conversion (80% savings!)

Crawl Rules

Fine-tune what gets crawled and extracted

Schema Support

Define custom output schemas for structured results

Dual Mode Support

Choose between AI extraction or markdown conversion

Use Cases

AI Extraction Mode

Site-wide data extraction with smart understanding
Product catalog crawling with structured output
Legal/Privacy/Terms aggregation with AI parsing
Research and competitive analysis with insights
Multi-page blog/news scraping with content analysis

Markdown Conversion Mode

Website documentation archival and migration
Content backup and preservation (80% cheaper!)
Blog/article collection in markdown format
Site content analysis without AI overhead
Fast bulk content conversion for CMS migration

Best Practices

AI Extraction Mode

Be specific in your prompts for better results
Use schemas for structured output validation
Test prompts on single pages first
Include examples in your schema descriptions

Markdown Conversion Mode

Perfect for content archival and documentation
No prompt required - set extraction_mode: false
80% cheaper than AI mode (2 credits vs 10 per page)
Ideal for bulk content migration

General

Set reasonable max_pages and depth limits
Use rules to avoid unwanted pages (like /logout)
Always handle errors and poll for results
Monitor your credit usage and rate limits

API Reference

For detailed API documentation, see:

Support & Resources

Documentation

Comprehensive guides and tutorials

API Reference

Detailed API documentation

Community

Join our Discord community

GitHub

Check out our open-source projects

Get Started

Services

Official SDKs

Integrations

Contribute

Resources

​Overview

​Getting Started

​Quick Start

Required Headers

​Parameters

​Markdown Conversion Mode

​Benefits

​Quick Start - Markdown Mode

​Markdown Mode Response

​Crawl Rules

​Example Response

​Retrieve a Previous Crawl

​Parameters

​Custom Schema Example

​Async Support

​Validation & Error Handling

​Endpoint Details

Required Headers

​Request Body

​Response Format

​Key Features

Multi-Page Extraction

AI Understanding

Markdown Conversion

Crawl Rules

Schema Support

Dual Mode Support

​Use Cases

​AI Extraction Mode

​Markdown Conversion Mode

​Best Practices

​AI Extraction Mode

​Markdown Conversion Mode

​General

​API Reference

​Support & Resources

Documentation

API Reference

Community

GitHub

Ready to Start Crawling?

Overview

Getting Started

Quick Start

Parameters

Markdown Conversion Mode

Benefits

Quick Start - Markdown Mode

Markdown Mode Response

Crawl Rules

Example Response

Retrieve a Previous Crawl

Parameters

Custom Schema Example

Async Support

Validation & Error Handling

Endpoint Details

Request Body

Response Format

Key Features

Use Cases

AI Extraction Mode

Markdown Conversion Mode

Best Practices

AI Extraction Mode

Markdown Conversion Mode

General

API Reference

Support & Resources