Skip to main content

PyPI Package

PyPI version

Python Support

Python Support
These docs cover scrapegraph-py β‰₯ 2.1.0 and require Python β‰₯ 3.12. Earlier 1.x releases expose the deprecated v1 API and point to a different backend β€” none of the snippets on this page work there. The 2.0.x series used typed request wrappers (ScrapeRequest, ExtractRequest, …); 2.1.0 removed those wrappers in favour of direct positional/keyword arguments, so upgrade if you are pinned to 2.0.x.

Installation

pip install "scrapegraph-py>=2.1.0"
# or
uv add "scrapegraph-py>=2.1.0"

What’s New in v2

  • Complete rewrite built on Pydantic v2 + httpx.
  • Client rename: Client β†’ ScrapeGraphAI, AsyncClient β†’ AsyncScrapeGraphAI.
  • Direct arguments (v2.1.0): every method accepts positional/keyword args β€” no more ScrapeRequest/ExtractRequest/… wrappers.
  • ApiResult[T] wrapper: no exceptions on API errors β€” every call returns status: "success" | "error", data, error, and elapsed_ms.
  • Nested resources: sgai.crawl.*, sgai.monitor.*, sgai.history.*.
  • camelCase on the wire, snake_case in Python: automatic via Pydantic’s alias_generator.
  • Removed: markdownify(), agenticscraper(), sitemap(), feedback() β€” use scrape() with the appropriate format entry instead.
v2 is a breaking release. See the Migration Guide if you’re upgrading from v1.

Quick Start

from scrapegraph_py import ScrapeGraphAI

# reads SGAI_API_KEY from env, or pass it explicitly:
# sgai = ScrapeGraphAI(api_key="sgai-...")
sgai = ScrapeGraphAI()

result = sgai.scrape("https://example.com")

if result.status == "success":
    print(result.data.results["markdown"]["data"])
else:
    print(result.error)

ApiResult

Every method returns ApiResult[T] β€” no try/except needed for API errors:
from typing import Generic, Literal, TypeVar
from pydantic import BaseModel

T = TypeVar("T")

class ApiResult(BaseModel, Generic[T]):
    status: Literal["success", "error"]
    data: T | None
    error: str | None = None
    elapsed_ms: int

Environment Variables

VariableDescriptionDefault
SGAI_API_KEYYour ScrapeGraphAI API keyβ€”
SGAI_API_URLOverride API base URLhttps://v2-api.scrapegraphai.com/api
SGAI_TIMEOUTRequest timeout in seconds120
SGAI_DEBUGEnable debug logging (set to "1")off
The client supports context managers for automatic session cleanup:
with ScrapeGraphAI() as sgai:
    result = sgai.scrape("https://example.com")

Services

Scrape

Fetch a page in one or more formats (markdown, html, screenshot, json, links, images, summary, branding).
from scrapegraph_py import (
    ScrapeGraphAI, FetchConfig,
    MarkdownFormatConfig, ScreenshotFormatConfig, JsonFormatConfig,
)

sgai = ScrapeGraphAI()

res = sgai.scrape(
    "https://example.com",
    formats=[
        MarkdownFormatConfig(mode="reader"),
        ScreenshotFormatConfig(full_page=True, width=1440, height=900),
        JsonFormatConfig(prompt="Extract product info"),
    ],
    content_type="text/html",  # optional, auto-detected
    fetch_config=FetchConfig(
        mode="js",
        stealth=True,
        timeout=30000,
        wait=2000,
        scrolls=3,
    ),
)

if res.status == "success":
    markdown = res.data.results["markdown"]["data"]

scrape() parameters

ParameterTypeRequiredDescription
urlstrYesURL to scrape (positional)
formatslist[FormatConfig]NoDefaults to [MarkdownFormatConfig()]
content_typestrNoOverride detected content type (e.g. "application/pdf", "text/html")
fetch_configFetchConfigNoFetch configuration (mode, stealth, timeout, cookies, country, …)

Format entries

ClassFields
MarkdownFormatConfigmode: "normal" | "reader" | "prune"
HtmlFormatConfigmode: same as above
ScreenshotFormatConfigfull_page, width (320–3840), height (200–2160), quality
JsonFormatConfigprompt (1–10k chars), schema (JSON Schema dict β€” pass a Pydantic model’s model_json_schema() to reuse a BaseModel), mode
LinksFormatConfigβ€”
ImagesFormatConfigβ€”
SummaryFormatConfigβ€”
BrandingFormatConfigβ€”
Duplicate type entries in formats are rejected by a Pydantic validator.

Extract

Run structured extraction against a URL, HTML, or markdown using AI.
from scrapegraph_py import ScrapeGraphAI

sgai = ScrapeGraphAI()

res = sgai.extract(
    "Extract product names and prices",
    url="https://example.com",
    schema={
        "type": "object",
        "properties": {
            "products": {
                "type": "array",
                "items": {
                    "type": "object",
                    "properties": {
                        "name":  {"type": "string"},
                        "price": {"type": "string"},
                    },
                },
            },
        },
    },
)

if res.status == "success":
    print(res.data.json_data)
    print(f"Tokens: {res.data.usage.prompt_tokens} / {res.data.usage.completion_tokens}")
Using a Pydantic model as the schema
schema= is a JSON Schema dict. Any Pydantic BaseModel produces one via model_json_schema(), so you can define the desired shape once and reuse it to validate the response client-side.
from pydantic import BaseModel, Field
from scrapegraph_py import ScrapeGraphAI

class Product(BaseModel):
    name: str
    price: str | None = None

class Products(BaseModel):
    products: list[Product] = Field(default_factory=list)

sgai = ScrapeGraphAI()

res = sgai.extract(
    "Extract product names and prices",
    url="https://example.com",
    schema=Products.model_json_schema(),
)

if res.status == "success":
    parsed = Products.model_validate(res.data.json_data)
    for p in parsed.products:
        print(p.name, p.price)
The same pattern works for JsonFormatConfig(schema=...) in scrape() and for search(schema=...).

extract() parameters

ParameterTypeRequiredDescription
promptstrYes1–10,000 chars (positional)
urlstrYes*Page URL
htmlstrYes*Raw HTML (alternative to url)
markdownstrYes*Raw markdown (alternative to url)
schemadictNoJSON Schema for the structured output. Pass a Pydantic model’s model_json_schema() to reuse a BaseModel.
modestrNo"normal" (default), "reader", "prune"
content_typestrNoOverride detected content type
fetch_configFetchConfigNoFetch configuration
*At least one of url, html, or markdown is required.
Run a web search and optionally extract structured data from the results.
from scrapegraph_py import ScrapeGraphAI

sgai = ScrapeGraphAI()

res = sgai.search(
    "best programming languages 2024",
    num_results=5,
    prompt="Summarize the top languages and reasons",
    time_range="past_week",
    location_geo_code="us",
)

if res.status == "success":
    for hit in res.data.results:
        print(hit.title, hit.url)
    print(res.data.json_data)  # when prompt/schema are set

search() parameters

ParameterTypeRequiredDescription
querystrYes1–500 chars (positional)
num_resultsintNo1–20, default 3
formatstrNo"markdown" (default) or "html"
modestrNoHTML processing: "prune" (default), "normal", "reader"
promptstrNoRequired when schema is set
schemadictNoJSON Schema for structured output. Pass a Pydantic model’s model_json_schema() to reuse a BaseModel.
location_geo_codestrNoTwo-letter country code (e.g. "us", "it")
time_rangestrNo"past_hour", "past_24_hours", "past_week", "past_month", "past_year"
fetch_configFetchConfigNoFetch configuration

Crawl

Crawl a site and its linked pages asynchronously. Access via the sgai.crawl resource.
from scrapegraph_py import ScrapeGraphAI, MarkdownFormatConfig

sgai = ScrapeGraphAI()

# Start
start = sgai.crawl.start(
    "https://example.com",
    formats=[MarkdownFormatConfig()],
    max_depth=2,
    max_pages=50,
    max_links_per_page=10,
    include_patterns=["/blog/*"],
    exclude_patterns=["/admin/*"],
)

crawl_id = start.data.id

# Poll
status = sgai.crawl.get(crawl_id)
print(f"{status.data.finished}/{status.data.total} - {status.data.status}")

# Control
sgai.crawl.stop(crawl_id)
sgai.crawl.resume(crawl_id)
sgai.crawl.delete(crawl_id)

crawl.start() parameters

ParameterTypeRequiredDescription
urlstrYesStarting URL (positional)
formatslist[FormatConfig]NoDefaults to [MarkdownFormatConfig()]
max_depthintNoβ‰₯ 0, default 2
max_pagesintNo1–1000, default 50
max_links_per_pageintNoβ‰₯ 1, default 10
allow_externalboolNoDefault False
include_patternslist[str]NoURL glob patterns to include
exclude_patternslist[str]NoURL glob patterns to exclude
content_typeslist[str]NoAllowed response content types
fetch_configFetchConfigNoFetch configuration

Monitor

Scheduled extraction jobs. Access via the sgai.monitor resource.
from scrapegraph_py import ScrapeGraphAI, MarkdownFormatConfig

sgai = ScrapeGraphAI()

mon = sgai.monitor.create(
    "https://example.com",
    "0 * * * *",                 # cron expression (positional)
    name="Price Monitor",
    formats=[MarkdownFormatConfig()],
    webhook_url="https://example.com/webhook",
)

cron_id = mon.data.cron_id

sgai.monitor.list()
sgai.monitor.get(cron_id)
sgai.monitor.update(cron_id, interval="0 */6 * * *")
sgai.monitor.pause(cron_id)
sgai.monitor.resume(cron_id)
sgai.monitor.delete(cron_id)

monitor.activity() β€” poll tick history

Paginate through the per-run ticks a monitor has produced (what changed on each scheduled run).
act = sgai.monitor.activity(cron_id, limit=20)

if act.status == "success":
    for tick in act.data.ticks:
        status = "CHANGED" if tick.changed else "no change"
        print(f"[{tick.created_at}] {tick.status} - {status} ({tick.elapsed_ms}ms)")

    if act.data.next_cursor:
        more = sgai.monitor.activity(cron_id, limit=20, cursor=act.data.next_cursor)
monitor.activity() accepts limit (1–100, default 20) and optional cursor for pagination. Each MonitorTickEntry exposes id, created_at, status, changed, elapsed_ms, and a diffs model with per-format deltas.

monitor.create() parameters

ParameterTypeRequiredDescription
urlstrYesURL to monitor (positional)
intervalstrYesCron expression, 1–100 chars (positional)
namestrNo≀ 200 chars
formatslist[FormatConfig]NoDefaults to [MarkdownFormatConfig()]
webhook_urlstrNoWebhook invoked on change detection
fetch_configFetchConfigNoFetch configuration

History

Fetch recent request history. Access via the sgai.history resource.
from scrapegraph_py import ScrapeGraphAI

sgai = ScrapeGraphAI()

page = sgai.history.list(service="scrape", page=1, limit=20)
for entry in page.data.data:
    print(entry.id, entry.service, entry.status, entry.elapsed_ms)

one = sgai.history.get("request-id")

Credits / Health

credits = sgai.credits()
# ApiResult[CreditsResponse] with .remaining, .used, .plan, .jobs.crawl, .jobs.monitor

health = sgai.health()
# ApiResult[HealthResponse] with .status, .uptime, .services

Configuration Objects

FetchConfig

Controls how pages are fetched. See the proxy configuration guide for details on modes and geotargeting.
from scrapegraph_py import FetchConfig

config = FetchConfig(
    mode="js",            # "auto" (default), "fast", "js"
    stealth=True,         # Residential proxies / anti-bot headers (+5 credits)
    timeout=30000,        # 1,000–60,000 ms
    wait=2000,            # 0–30,000 ms
    scrolls=3,            # 0–100
    country="us",         # ISO 3166-1 alpha-2
    headers={"X-Custom": "header"},
    cookies={"session": "abc"},
    mock=False,           # Or a MockConfig object for testing
)

Async Support

Every sync method has an async equivalent on AsyncScrapeGraphAI:
import asyncio
from scrapegraph_py import AsyncScrapeGraphAI

async def main():
    async with AsyncScrapeGraphAI() as sgai:
        res = await sgai.scrape("https://example.com")
        if res.status == "success":
            print(res.data.results["markdown"]["data"])

        start = await sgai.crawl.start("https://example.com", max_pages=25)
        status = await sgai.crawl.get(start.data.id)
        print(status.data.status)

        credits = await sgai.credits()
        print(credits.data.remaining)

asyncio.run(main())

Support

GitHub

Report issues and contribute to the SDK

Email Support

Get help from our development team