Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.scrapegraphai.com/llms.txt

Use this file to discover all available pages before exploring further.

Overview

Crawl traverses a site starting from a URL, follows links up to a depth you set, and returns each page in the formats you request. Crawls are async — you start a job, then poll (or get notified via webhook) until it completes.
Try Crawl instantly in our interactive playground.

Pricing

A Crawl job costs 2 credits to start, plus the per-page Scrape cost for every page processed. Per-page format costs:
FormatCredits
markdown1
html1
links1
images1
summary1
json5
screenshot2
branding25
When a page is requested in multiple formats, the per-format costs are summed. Enabling stealth in fetchConfig adds 5 credits per page; render mode (auto/fast/js) does not affect the cost. See the pricing page for the full breakdown.

Getting Started

Quick Start

import time
from scrapegraph_py import ScrapeGraphAI, MarkdownFormatConfig

# reads SGAI_API_KEY from env, or pass explicitly: ScrapeGraphAI(api_key="...")
sgai = ScrapeGraphAI()

start = sgai.crawl.start(
    "https://scrapegraphai.com/",
    formats=[MarkdownFormatConfig()],
    max_pages=5,
    max_depth=2,
)

if start.status != "success":
    print("Failed:", start.error)
else:
    crawl_id = start.data.id
    print("Crawl started:", crawl_id)

    while True:
        time.sleep(2)
        status = sgai.crawl.get(crawl_id)
        if status.status != "success":
            break
        print(f"{status.data.finished}/{status.data.total} - {status.data.status}")
        if status.data.status in ("completed", "failed"):
            for page in status.data.pages:
                print(f"  {page.url} - {page.status}")
            break

Parameters

ParameterTypeRequiredDescription
urlstringYesStarting URL to crawl.
formatsarrayNoOutput formats per page (see Scrape formats).
maxPages / max_pagesintNoMaximum number of pages to crawl. Default 50, max 1000.
maxDepth / max_depthintNoHow many levels deep to follow links. Default 2.
maxLinksPerPage / max_links_per_pageintNoCap on links expanded per page. Default 10.
allowExternal / allow_externalboolNoWhether to follow links to other domains. Default false (same-origin only).
includePatterns / include_patternsarrayNoURL patterns to include (e.g. ["/blog/*"]).
excludePatterns / exclude_patternsarrayNoURL patterns to exclude (e.g. ["/admin/*"]).
contentTypes / content_typesarrayNoLimit crawled pages to these MIME types, e.g. ["text/html", "application/pdf"].
fetchConfig / fetch_configobjectNoFetch options (see Scrape · FetchConfig).
Get your API key from the dashboard.
{
  "id": "79694e03-f2ea-43f2-93cc-7c6fc26f999a",
  "status": "running",
  "total": 3,
  "finished": 0,
  "pages": []
}
{
  "id": "79694e03-f2ea-43f2-93cc-7c6fc26f999a",
  "status": "completed",
  "total": 3,
  "finished": 1,
  "pages": [
    {
      "url": "https://example.com",
      "depth": 0,
      "title": "",
      "status": "completed",
      "parentUrl": null,
      "contentType": "text/html",
      "links": ["https://iana.org/domains/example"],
      "scrapeRefId": "83a911ed-c0bc-4a8c-ad62-8efeeb93f33a"
    }
  ]
}

Fetching page content

The crawl response returns each page as lightweight metadata (url, depth, scrapeRefId, …) — not the full body. Use the History service with each scrapeRefId to pull the formatted content the underlying scrape produced.
# After the crawl completes, fetch the markdown for each page
for page in status.data.pages:
    if page.status != "completed":
        continue
    entry = sgai.history.get(page.scrape_ref_id)
    md = entry.data.result.results.get("markdown", {}).get("data", [None])[0]
    print(page.url, "->", md[:80] if md else "(empty)")
See the History service for the full entry shape and the requestParentId linkage that ties each child scrape back to its parent crawl.

Managing Crawl Jobs

# Check status
status = sgai.crawl.get(crawl_id)

# Stop / resume / delete
sgai.crawl.stop(crawl_id)
sgai.crawl.resume(crawl_id)
sgai.crawl.delete(crawl_id)
await sgai.crawl.get(crawlId);
await sgai.crawl.stop(crawlId);
await sgai.crawl.resume(crawlId);
await sgai.crawl.delete(crawlId);

Advanced Usage

URL patterns and fetch config

from scrapegraph_py import ScrapeGraphAI, MarkdownFormatConfig, FetchConfig

sgai = ScrapeGraphAI()

res = sgai.crawl.start(
    "https://example.com",
    formats=[MarkdownFormatConfig()],
    max_depth=2,
    max_pages=10,
    include_patterns=["/blog/*"],
    exclude_patterns=["/admin/*"],
    fetch_config=FetchConfig(mode="js", stealth=True, wait=1000),
)

Async Support (Python)

import asyncio
from scrapegraph_py import AsyncScrapeGraphAI

async def main():
    async with AsyncScrapeGraphAI() as sgai:
        start = await sgai.crawl.start(
            "https://example.com",
            max_pages=5,
            max_depth=2,
        )
        status = await sgai.crawl.get(start.data.id)
        print("Status:", status.data.status)

asyncio.run(main())

Key Features

Multi-Page Crawling

Traverse entire sites, following links automatically.

Flexible Formats

Request markdown, HTML, links, images, and more per page.

Job Control

Start, stop, resume, and delete crawl jobs.

URL Filtering

Include or exclude by URL pattern.

Integration Options

Official SDKs

AI Framework Integrations

Support & Resources

Documentation

Guides and tutorials

API Reference

Detailed API documentation

Community

Join our Discord community

GitHub

Check out our open-source projects