Crawl

Overview

Crawl traverses a site starting from a URL, follows links up to a depth you set, and returns each page in the formats you request. Crawls are async — you start a job, then poll (or get notified via webhook) until it completes.

Try Crawl instantly in our interactive playground.

Pricing

A Crawl job costs 2 credits to start, plus the per-page Scrape cost for every page processed. Per-page format costs:

Format	Credits
`markdown`	1
`html`	1
`links`	1
`images`	1
`summary`	1
`json`	5
`screenshot`	2
`branding`	25

When a page is requested in multiple formats, the per-format costs are summed. Enabling stealth in fetchConfig adds 5 credits per page; render mode (auto/fast/js) does not affect the cost. See the pricing page for the full breakdown.

Getting Started

Quick Start

import time
from scrapegraph_py import ScrapeGraphAI, MarkdownFormatConfig

# reads SGAI_API_KEY from env, or pass explicitly: ScrapeGraphAI(api_key="...")
sgai = ScrapeGraphAI()

start = sgai.crawl.start(
    "https://scrapegraphai.com/",
    formats=[MarkdownFormatConfig()],
    max_pages=5,
    max_depth=2,
)

if start.status != "success":
    print("Failed:", start.error)
else:
    crawl_id = start.data.id
    print("Crawl started:", crawl_id)

    while True:
        time.sleep(2)
        status = sgai.crawl.get(crawl_id)
        if status.status != "success":
            break
        print(f"{status.data.finished}/{status.data.total} - {status.data.status}")
        if status.data.status in ("completed", "failed"):
            for page in status.data.pages:
                print(f"  {page.url} - {page.status}")
            break

import { ScrapeGraphAI } from "scrapegraph-js";

const sgai = ScrapeGraphAI();

const start = await sgai.crawl.start({
  url: "https://scrapegraphai.com/",
  formats: [{ type: "markdown" }],
  maxPages: 5,
  maxDepth: 2,
});

if (start.status !== "success" || !start.data) {
  console.error("Failed:", start.error);
} else {
  const crawlId = start.data.id;
  console.log("Crawl started:", crawlId);

  while (true) {
    await new Promise((r) => setTimeout(r, 2000));
    const status = await sgai.crawl.get(crawlId);
    if (status.status !== "success" || !status.data) break;
    console.log(`${status.data.finished}/${status.data.total} - ${status.data.status}`);
    if (status.data.status === "completed" || status.data.status === "failed") {
      for (const p of status.data.pages) console.log(`  ${p.url} - ${p.status}`);
      break;
    }
  }
}

# Start a crawl
curl -X POST https://v2-api.scrapegraphai.com/api/crawl \
  -H "SGAI-APIKEY: $SGAI_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://scrapegraphai.com/",
    "formats": [{ "type": "markdown" }],
    "maxPages": 5,
    "maxDepth": 2
  }'

# Check status (replace :id with the crawl id returned above)
curl -X GET https://v2-api.scrapegraphai.com/api/crawl/:id \
  -H "SGAI-APIKEY: $SGAI_API_KEY"

# Fetch pages with resolved scrape results
curl -X GET "https://v2-api.scrapegraphai.com/api/crawl/:id/pages?limit=50&cursor=0" \
  -H "SGAI-APIKEY: $SGAI_API_KEY"

Parameters

Parameter	Type	Required	Description
`url`	string	Yes	Starting URL to crawl.
`formats`	array	No	Output formats per page (see Scrape formats).
`maxPages` / `max_pages`	int	No	Maximum number of pages to crawl. Default `50`, max `1000`.
`maxDepth` / `max_depth`	int	No	How many levels deep to follow links. Default `2`.
`maxLinksPerPage` / `max_links_per_page`	int	No	Cap on links expanded per page. Default `10`.
`allowExternal` / `allow_external`	bool	No	Whether to follow links to other domains. Default `false` (same-origin only).
`includePatterns` / `include_patterns`	array	No	URL patterns to include (e.g. `["/blog/*"]`).
`excludePatterns` / `exclude_patterns`	array	No	URL patterns to exclude (e.g. `["/admin/*"]`).
`contentTypes` / `content_types`	array	No	Limit crawled pages to these MIME types, e.g. `["text/html", "application/pdf"]`.
`fetchConfig` / `fetch_config`	object	No	Fetch options (see Scrape · FetchConfig).

Get your API key from the dashboard.

Example Response (start)

{
  "id": "79694e03-f2ea-43f2-93cc-7c6fc26f999a",
  "status": "running",
  "total": 3,
  "finished": 0,
  "pages": []
}

Example Response (get)

{
  "id": "79694e03-f2ea-43f2-93cc-7c6fc26f999a",
  "status": "completed",
  "total": 3,
  "finished": 1,
  "pages": [
    {
      "url": "https://example.com",
      "depth": 0,
      "title": "",
      "status": "completed",
      "parentUrl": null,
      "contentType": "text/html",
      "links": ["https://iana.org/domains/example"],
      "scrapeRefId": "83a911ed-c0bc-4a8c-ad62-8efeeb93f33a"
    }
  ]
}

Fetching page content

GET /api/crawl/:id is designed for status polling and returns lightweight page metadata. To fetch the actual per-page content, call the paginated pages endpoint:

cURL

curl -X GET "https://v2-api.scrapegraphai.com/api/crawl/79694e03-f2ea-43f2-93cc-7c6fc26f999a/pages?limit=50&cursor=0" \
  -H "SGAI-APIKEY: $SGAI_API_KEY"

The response is cursor-paginated:

{
  "data": [
    {
      "url": "https://example.com",
      "status": "completed",
      "depth": 0,
      "parentUrl": null,
      "scrapeRefId": "83a911ed-c0bc-4a8c-ad62-8efeeb93f33a",
      "scrape": {
        "results": {
          "markdown": {
            "data": ["# Example Domain\n\nThis domain is for use in illustrative examples..."]
          }
        },
        "metadata": {
          "contentType": "text/html"
        }
      }
    }
  ],
  "pagination": {
    "limit": 50,
    "nextCursor": null
  }
}

limit controls how many crawl pages are returned in one response. It defaults to 50, with a maximum of 100. cursor is a zero-based index into the ordered crawl page list. Start with cursor=0, then use pagination.nextCursor as the next request’s cursor until it returns null.

# First 50 pages
curl -X GET "https://v2-api.scrapegraphai.com/api/crawl/:id/pages?limit=50&cursor=0" \
  -H "SGAI-APIKEY: $SGAI_API_KEY"

# Next 50 pages when the previous response returns "nextCursor": "50"
curl -X GET "https://v2-api.scrapegraphai.com/api/crawl/:id/pages?limit=50&cursor=50" \
  -H "SGAI-APIKEY: $SGAI_API_KEY"

See the Get crawl pages API reference for the full response shape. If you only need one page’s underlying Scrape request, fetch that page’s scrapeRefId through History:

curl -X GET https://v2-api.scrapegraphai.com/api/history/83a911ed-c0bc-4a8c-ad62-8efeeb93f33a \
  -H "SGAI-APIKEY: $SGAI_API_KEY"

Managing Crawl Jobs

# Check status
status = sgai.crawl.get(crawl_id)

# Stop / resume / delete
sgai.crawl.stop(crawl_id)
sgai.crawl.resume(crawl_id)
sgai.crawl.delete(crawl_id)

await sgai.crawl.get(crawlId);
await sgai.crawl.stop(crawlId);
await sgai.crawl.resume(crawlId);
await sgai.crawl.delete(crawlId);

Advanced Usage

URL patterns and fetch config

from scrapegraph_py import ScrapeGraphAI, MarkdownFormatConfig, FetchConfig

sgai = ScrapeGraphAI()

res = sgai.crawl.start(
    "https://example.com",
    formats=[MarkdownFormatConfig()],
    max_depth=2,
    max_pages=10,
    include_patterns=["/blog/*"],
    exclude_patterns=["/admin/*"],
    fetch_config=FetchConfig(mode="js", stealth=True, wait=1000),
)

Async Support (Python)

import asyncio
from scrapegraph_py import AsyncScrapeGraphAI

async def main():
    async with AsyncScrapeGraphAI() as sgai:
        start = await sgai.crawl.start(
            "https://example.com",
            max_pages=5,
            max_depth=2,
        )
        status = await sgai.crawl.get(start.data.id)
        print("Status:", status.data.status)

asyncio.run(main())

Key Features

Multi-Page Crawling

Traverse entire sites, following links automatically.

Flexible Formats

Request markdown, HTML, links, images, and more per page.

Job Control

Start, stop, resume, and delete crawl jobs.

URL Filtering

Include or exclude by URL pattern.

Integration Options

Official SDKs

Python SDK
JavaScript SDK (scrapegraph-js ≥ 2.1.0, Node ≥ 22)

AI Framework Integrations

Support & Resources

Documentation

Guides and tutorials

API Reference

Detailed API documentation

Community

Join our Discord community

GitHub

Check out our open-source projects

Get Started

Services

Official SDKs

LLM SDKs

Frameworks

Personal AI Agents

No-code

Contribute

Overview

Pricing

Getting Started

Quick Start

Parameters

Fetching page content

Managing Crawl Jobs

Advanced Usage

URL patterns and fetch config

Async Support (Python)

Key Features

Multi-Page Crawling

Flexible Formats

Job Control

URL Filtering

Integration Options

Official SDKs

AI Framework Integrations

Support & Resources

Documentation

API Reference

Community

GitHub

​Overview

​Pricing

​Getting Started

​Quick Start

​Parameters

​Fetching page content

​Managing Crawl Jobs

​Advanced Usage

​URL patterns and fetch config

​Async Support (Python)

​Key Features

Multi-Page Crawling

Flexible Formats

Job Control

URL Filtering

​Integration Options

​Official SDKs

​AI Framework Integrations

​Support & Resources

Documentation

API Reference

Community

GitHub

Overview

Pricing

Getting Started

Quick Start

Parameters

Fetching page content

Managing Crawl Jobs

Advanced Usage

URL patterns and fetch config

Async Support (Python)

Key Features

Integration Options

Official SDKs

AI Framework Integrations

Support & Resources