These docs cover scrapegraph-py β₯ 2.1.0 and require Python β₯ 3.12 . Earlier 1.x releases expose the deprecated v1 API and point to a different backend β none of the snippets on this page work there. The 2.0.x series used typed request wrappers (ScrapeRequest, ExtractRequest, β¦); 2.1.0 removed those wrappers in favour of direct positional/keyword arguments, so upgrade if you are pinned to 2.0.x.
Installation
pip install "scrapegraph-py>=2.1.0"
# or
uv add "scrapegraph-py>=2.1.0"
Whatβs New in v2
Complete rewrite built on Pydantic v2 + httpx .
Client rename : Client β ScrapeGraphAI, AsyncClient β AsyncScrapeGraphAI.
Direct arguments (v2.1.0): every method accepts positional/keyword args β no more ScrapeRequest/ExtractRequest/β¦ wrappers.
ApiResult[T] wrapper : no exceptions on API errors β every call returns status: "success" | "error", data, error, and elapsed_ms.
Nested resources : sgai.crawl.*, sgai.monitor.*, sgai.history.*.
camelCase on the wire, snake_case in Python : automatic via Pydanticβs alias_generator.
Removed : markdownify(), agenticscraper(), sitemap(), feedback() β use scrape() with the appropriate format entry instead.
v2 is a breaking release. See the Migration Guide if youβre upgrading from v1.
Quick Start
from scrapegraph_py import ScrapeGraphAI
# reads SGAI_API_KEY from env, or pass it explicitly:
# sgai = ScrapeGraphAI(api_key="sgai-...")
sgai = ScrapeGraphAI()
result = sgai.scrape( "https://example.com" )
if result.status == "success" :
print (result.data.results[ "markdown" ][ "data" ])
else :
print (result.error)
ApiResult
Every method returns ApiResult[T] β no try/except needed for API errors:
from typing import Generic, Literal, TypeVar
from pydantic import BaseModel
T = TypeVar( "T" )
class ApiResult ( BaseModel , Generic[T]):
status: Literal[ "success" , "error" ]
data: T | None
error: str | None = None
elapsed_ms: int
Environment Variables
Variable Description Default SGAI_API_KEYYour ScrapeGraphAI API key β SGAI_API_URLOverride API base URL https://v2-api.scrapegraphai.com/apiSGAI_TIMEOUTRequest timeout in seconds 120SGAI_DEBUGEnable debug logging (set to "1") off
The client supports context managers for automatic session cleanup:
with ScrapeGraphAI() as sgai:
result = sgai.scrape( "https://example.com" )
Services
Fetch a page in one or more formats (markdown, html, screenshot, json, links, images, summary, branding).
from scrapegraph_py import (
ScrapeGraphAI, FetchConfig,
MarkdownFormatConfig, ScreenshotFormatConfig, JsonFormatConfig,
)
sgai = ScrapeGraphAI()
res = sgai.scrape(
"https://example.com" ,
formats = [
MarkdownFormatConfig( mode = "reader" ),
ScreenshotFormatConfig( full_page = True , width = 1440 , height = 900 ),
JsonFormatConfig( prompt = "Extract product info" ),
],
content_type = "text/html" , # optional, auto-detected
fetch_config = FetchConfig(
mode = "js" ,
stealth = True ,
timeout = 30000 ,
wait = 2000 ,
scrolls = 3 ,
),
)
if res.status == "success" :
markdown = res.data.results[ "markdown" ][ "data" ]
scrape() parameters
Parameter Type Required Description urlstrYes URL to scrape (positional) formatslist[FormatConfig]No Defaults to [MarkdownFormatConfig()] content_typestrNo Override detected content type (e.g. "application/pdf", "text/html") fetch_configFetchConfigNo Fetch configuration (mode, stealth, timeout, cookies, country, β¦)
Class Fields MarkdownFormatConfigmode: "normal" | "reader" | "prune"HtmlFormatConfigmode: same as aboveScreenshotFormatConfigfull_page, width (320β3840), height (200β2160), qualityJsonFormatConfigprompt (1β10k chars), schema (JSON Schema dict β pass a Pydantic modelβs model_json_schema() to reuse a BaseModel), modeLinksFormatConfigβ ImagesFormatConfigβ SummaryFormatConfigβ BrandingFormatConfigβ
Duplicate type entries in formats are rejected by a Pydantic validator.
Run structured extraction against a URL, HTML, or markdown using AI.
from scrapegraph_py import ScrapeGraphAI
sgai = ScrapeGraphAI()
res = sgai.extract(
"Extract product names and prices" ,
url = "https://example.com" ,
schema = {
"type" : "object" ,
"properties" : {
"products" : {
"type" : "array" ,
"items" : {
"type" : "object" ,
"properties" : {
"name" : { "type" : "string" },
"price" : { "type" : "string" },
},
},
},
},
},
)
if res.status == "success" :
print (res.data.json_data)
print ( f "Tokens: { res.data.usage.prompt_tokens } / { res.data.usage.completion_tokens } " )
Using a Pydantic model as the schema
schema= is a JSON Schema dict. Any Pydantic BaseModel produces one via model_json_schema(), so you can define the desired shape once and reuse it to validate the response client-side.
from pydantic import BaseModel, Field
from scrapegraph_py import ScrapeGraphAI
class Product ( BaseModel ):
name: str
price: str | None = None
class Products ( BaseModel ):
products: list[Product] = Field( default_factory = list )
sgai = ScrapeGraphAI()
res = sgai.extract(
"Extract product names and prices" ,
url = "https://example.com" ,
schema = Products.model_json_schema(),
)
if res.status == "success" :
parsed = Products.model_validate(res.data.json_data)
for p in parsed.products:
print (p.name, p.price)
The same pattern works for JsonFormatConfig(schema=...) in scrape() and for search(schema=...).
Parameter Type Required Description promptstrYes 1β10,000 chars (positional) urlstrYes* Page URL htmlstrYes* Raw HTML (alternative to url) markdownstrYes* Raw markdown (alternative to url) schemadictNo JSON Schema for the structured output. Pass a Pydantic modelβs model_json_schema() to reuse a BaseModel. modestrNo "normal" (default), "reader", "prune"content_typestrNo Override detected content type fetch_configFetchConfigNo Fetch configuration
*At least one of url, html, or markdown is required.
Run a web search and optionally extract structured data from the results.
from scrapegraph_py import ScrapeGraphAI
sgai = ScrapeGraphAI()
res = sgai.search(
"best programming languages 2024" ,
num_results = 5 ,
prompt = "Summarize the top languages and reasons" ,
time_range = "past_week" ,
location_geo_code = "us" ,
)
if res.status == "success" :
for hit in res.data.results:
print (hit.title, hit.url)
print (res.data.json_data) # when prompt/schema are set
search() parameters
Parameter Type Required Description querystrYes 1β500 chars (positional) num_resultsintNo 1β20, default 3 formatstrNo "markdown" (default) or "html"modestrNo HTML processing: "prune" (default), "normal", "reader" promptstrNo Required when schema is set schemadictNo JSON Schema for structured output. Pass a Pydantic modelβs model_json_schema() to reuse a BaseModel. location_geo_codestrNo Two-letter country code (e.g. "us", "it") time_rangestrNo "past_hour", "past_24_hours", "past_week", "past_month", "past_year"fetch_configFetchConfigNo Fetch configuration
Crawl a site and its linked pages asynchronously. Access via the sgai.crawl resource.
from scrapegraph_py import ScrapeGraphAI, MarkdownFormatConfig
sgai = ScrapeGraphAI()
# Start
start = sgai.crawl.start(
"https://example.com" ,
formats = [MarkdownFormatConfig()],
max_depth = 2 ,
max_pages = 50 ,
max_links_per_page = 10 ,
include_patterns = [ "/blog/*" ],
exclude_patterns = [ "/admin/*" ],
)
crawl_id = start.data.id
# Poll
status = sgai.crawl.get(crawl_id)
print ( f " { status.data.finished } / { status.data.total } - { status.data.status } " )
# Control
sgai.crawl.stop(crawl_id)
sgai.crawl.resume(crawl_id)
sgai.crawl.delete(crawl_id)
crawl.start() parameters
Parameter Type Required Description urlstrYes Starting URL (positional) formatslist[FormatConfig]No Defaults to [MarkdownFormatConfig()] max_depthintNo β₯ 0, default 2max_pagesintNo 1β1000, default 50max_links_per_pageintNo β₯ 1, default 10allow_externalboolNo Default False include_patternslist[str]No URL glob patterns to include exclude_patternslist[str]No URL glob patterns to exclude content_typeslist[str]No Allowed response content types fetch_configFetchConfigNo Fetch configuration
Monitor
Scheduled extraction jobs. Access via the sgai.monitor resource.
from scrapegraph_py import ScrapeGraphAI, MarkdownFormatConfig
sgai = ScrapeGraphAI()
mon = sgai.monitor.create(
"https://example.com" ,
"0 * * * *" , # cron expression (positional)
name = "Price Monitor" ,
formats = [MarkdownFormatConfig()],
webhook_url = "https://example.com/webhook" ,
)
cron_id = mon.data.cron_id
sgai.monitor.list()
sgai.monitor.get(cron_id)
sgai.monitor.update(cron_id, interval = "0 */6 * * *" )
sgai.monitor.pause(cron_id)
sgai.monitor.resume(cron_id)
sgai.monitor.delete(cron_id)
monitor.activity() β poll tick history
Paginate through the per-run ticks a monitor has produced (what changed on each scheduled run).
act = sgai.monitor.activity(cron_id, limit = 20 )
if act.status == "success" :
for tick in act.data.ticks:
status = "CHANGED" if tick.changed else "no change"
print ( f "[ { tick.created_at } ] { tick.status } - { status } ( { tick.elapsed_ms } ms)" )
if act.data.next_cursor:
more = sgai.monitor.activity(cron_id, limit = 20 , cursor = act.data.next_cursor)
monitor.activity() accepts limit (1β100, default 20) and optional cursor for pagination. Each MonitorTickEntry exposes id, created_at, status, changed, elapsed_ms, and a diffs model with per-format deltas.
monitor.create() parameters
Parameter Type Required Description urlstrYes URL to monitor (positional) intervalstrYes Cron expression, 1β100 chars (positional) namestrNo β€ 200 chars formatslist[FormatConfig]No Defaults to [MarkdownFormatConfig()] webhook_urlstrNo Webhook invoked on change detection fetch_configFetchConfigNo Fetch configuration
History
Fetch recent request history. Access via the sgai.history resource.
from scrapegraph_py import ScrapeGraphAI
sgai = ScrapeGraphAI()
page = sgai.history.list( service = "scrape" , page = 1 , limit = 20 )
for entry in page.data.data:
print (entry.id, entry.service, entry.status, entry.elapsed_ms)
one = sgai.history.get( "request-id" )
Credits / Health
credits = sgai.credits()
# ApiResult[CreditsResponse] with .remaining, .used, .plan, .jobs.crawl, .jobs.monitor
health = sgai.health()
# ApiResult[HealthResponse] with .status, .uptime, .services
Configuration Objects
FetchConfig
Controls how pages are fetched. See the proxy configuration guide for details on modes and geotargeting.
from scrapegraph_py import FetchConfig
config = FetchConfig(
mode = "js" , # "auto" (default), "fast", "js"
stealth = True , # Residential proxies / anti-bot headers (+5 credits)
timeout = 30000 , # 1,000β60,000 ms
wait = 2000 , # 0β30,000 ms
scrolls = 3 , # 0β100
country = "us" , # ISO 3166-1 alpha-2
headers = { "X-Custom" : "header" },
cookies = { "session" : "abc" },
mock = False , # Or a MockConfig object for testing
)
Async Support
Every sync method has an async equivalent on AsyncScrapeGraphAI:
import asyncio
from scrapegraph_py import AsyncScrapeGraphAI
async def main ():
async with AsyncScrapeGraphAI() as sgai:
res = await sgai.scrape( "https://example.com" )
if res.status == "success" :
print (res.data.results[ "markdown" ][ "data" ])
start = await sgai.crawl.start( "https://example.com" , max_pages = 25 )
status = await sgai.crawl.get(start.data.id)
print (status.data.status)
credits = await sgai.credits()
print ( credits .data.remaining)
asyncio.run(main())
Support
GitHub Report issues and contribute to the SDK
Email Support Get help from our development team