Skip to main content

Overview

Crawl is an advanced web crawling service that traverses multiple pages, follows links, and returns content in your preferred format (markdown or HTML). It provides namespaced operations for starting, monitoring, stopping, and resuming crawl jobs.
Try Crawl instantly in our interactive playground

Getting Started

Quick Start

from scrapegraph_py import Client

client = Client(api_key="your-api-key")

# Start a crawl
response = client.crawl.start(
    "https://example.com",
    depth=2,
    max_pages=10,
    format="markdown",
)
print("Crawl started:", response)

Parameters

ParameterTypeRequiredDescription
urlstringYesThe starting URL to crawl.
depthintNoHow many levels deep to follow links.
max_pagesintNoMaximum number of pages to crawl.
formatstringNoOutput format: "markdown" or "html". Default: "markdown".
include_patternslistNoURL patterns to include (e.g., ["/blog/*"]).
exclude_patternslistNoURL patterns to exclude (e.g., ["/admin/*"]).
fetch_configFetchConfigNoConfiguration for page fetching (headers, stealth, etc.).
Get your API key from the dashboard

Managing Crawl Jobs

Check Status

status = client.crawl.status(crawl_id)
print("Status:", status)

Stop a Running Crawl

client.crawl.stop(crawl_id)

Resume a Stopped Crawl

client.crawl.resume(crawl_id)

Advanced Usage

With FetchConfig

from scrapegraph_py import Client, FetchConfig

client = Client(api_key="your-api-key")

response = client.crawl.start(
    "https://example.com",
    depth=2,
    max_pages=10,
    format="markdown",
    include_patterns=["/blog/*"],
    exclude_patterns=["/admin/*"],
    fetch_config=FetchConfig(
        mode="js",
        stealth=True,
        wait=1000,
        headers={"User-Agent": "MyBot"},
    ),
)

Async Support

import asyncio
from scrapegraph_py import AsyncClient

async def main():
    async with AsyncClient(api_key="your-api-key") as client:
        job = await client.crawl.start(
            "https://example.com",
            depth=2,
            max_pages=5,
        )
        
        status = await client.crawl.status(job["id"])
        print("Crawl status:", status)

asyncio.run(main())

Key Features

Multi-Page Crawling

Traverse entire websites following links automatically

Flexible Formats

Get results in markdown or HTML format

Job Control

Start, stop, resume, and monitor crawl jobs

URL Filtering

Include or exclude pages by URL patterns

Integration Options

Official SDKs

  • Python SDK - Perfect for data science and backend applications
  • JavaScript SDK - Ideal for web applications and Node.js

AI Framework Integrations

Support & Resources

Documentation

Comprehensive guides and tutorials

API Reference

Detailed API documentation

Community

Join our Discord community

GitHub

Check out our open-source projects