Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.scrapegraphai.com/llms.txt

Use this file to discover all available pages before exploring further.

Overview

Extract uses an LLM to pull structured data from a URL, HTML, or markdown. Provide a prompt (and optionally a JSON schema) and it returns typed JSON — no selectors or post-processing required.
Try Extract instantly in our interactive playground.

Pricing

Each Extract call costs 5 credits. Enabling stealth in fetchConfig adds 5 credits; render mode (auto / fast / js) does not affect the cost. See the pricing page for the full breakdown.

Getting Started

Quick Start

from scrapegraph_py import ScrapeGraphAI

# reads SGAI_API_KEY from env, or pass explicitly: ScrapeGraphAI(api_key="...")
sgai = ScrapeGraphAI()

res = sgai.extract(
    "What does the company do? Extract name and description.",
    url="https://scrapegraphai.com",
)

if res.status == "success":
    print(res.data.json_data)
else:
    print("Failed:", res.error)

Parameters

ParameterTypeRequiredDescription
urlstringCond.URL of the page to extract from. One of url, html, or markdown is required.
htmlstringCond.Raw HTML to extract from.
markdownstringCond.Markdown content to extract from.
promptstringYesNatural-language description of what to extract.
schemaobjectNoJSON schema describing the desired output shape. In Python you can pass a Pydantic model via MyModel.model_json_schema().
modestringNoHTML processing mode: "normal", "reader", "prune".
fetchConfig / fetch_configobjectNoFetch options (see Scrape · FetchConfig).
Get your API key from the dashboard.
{
  "id": "9a2178b6-2525-4f98-85e6-9f8c7da17541",
  "raw": null,
  "json": {
    "name": "ScrapeGraphAI",
    "description": "ScrapeGraphAI is an AI-powered web scraping platform that uses natural language prompts to turn any webpage into structured data via a simple API."
  },
  "usage": {
    "promptTokens": 10002,
    "completionTokens": 509
  },
  "metadata": {
    "chunker": { "chunks": [{ "size": 5000 }, { "size": 2535 }] },
    "fetch": {}
  }
}

With a JSON Schema

Pass a JSON schema to pin down the exact output shape.
from scrapegraph_py import ScrapeGraphAI

sgai = ScrapeGraphAI()

res = sgai.extract(
    "Extract structured information about this page",
    url="https://example.com",
    schema={
        "type": "object",
        "properties": {
            "title": {"type": "string"},
            "description": {"type": "string"},
            "links": {"type": "array", "items": {"type": "string"}},
        },
        "required": ["title"],
    },
)

if res.status == "success":
    print(res.data.json_data)

With a Pydantic Schema (Python)

If you already model your data with Pydantic, use the same BaseModel to drive the extraction. model_json_schema() produces the JSON Schema dict the API expects, and model_validate() parses the response back into typed objects.
from pydantic import BaseModel, Field
from scrapegraph_py import ScrapeGraphAI

class Product(BaseModel):
    name: str = Field(description="Product name")
    price: str | None = Field(default=None, description="Listed price, if any")

class Products(BaseModel):
    products: list[Product] = Field(default_factory=list)

sgai = ScrapeGraphAI()

res = sgai.extract(
    "Extract product names and prices",
    url="https://example.com",
    schema=Products.model_json_schema(),
)

if res.status == "success":
    parsed = Products.model_validate(res.data.json_data)
    for p in parsed.products:
        print(p.name, p.price)
The wire format is JSON Schema either way — model_json_schema() is just the standard Pydantic v2 helper that produces it. Field descriptions are forwarded to the LLM and improve extraction quality on ambiguous fields.

Extract from HTML or Markdown

Skip the fetch and extract from content you already have.
from scrapegraph_py import ScrapeGraphAI

sgai = ScrapeGraphAI()

res = sgai.extract(
    "Extract product name and price",
    html="<html><body><h1>Widget</h1><p>$9.99</p></body></html>",
)

FetchConfig

Control how the page is fetched before extraction (JS rendering, stealth, headers, etc). See the full options in Scrape · FetchConfig.
from scrapegraph_py import ScrapeGraphAI, FetchConfig

sgai = ScrapeGraphAI()

res = sgai.extract(
    "Extract the main content",
    url="https://example.com",
    fetch_config=FetchConfig(mode="js", stealth=True, wait=2000),
)

Async Support (Python)

import asyncio
from scrapegraph_py import AsyncScrapeGraphAI

async def main():
    async with AsyncScrapeGraphAI() as sgai:
        res = await sgai.extract(
            "Summarize what this product does",
            url="https://scrapegraphai.com",
        )
        if res.status == "success":
            print(res.data.json_data)

asyncio.run(main())

Key Features

Universal Compatibility

Works with any URL, raw HTML, or markdown input.

AI Understanding

Contextual extraction — no XPath or brittle selectors.

Structured Output

JSON schema support for typed, predictable results.

Token Accounting

Response includes prompt/completion token usage.

Integration Options

Official SDKs

AI Framework Integrations

Support & Resources

Documentation

Guides and tutorials

API Reference

Detailed API documentation

Community

Join our Discord community

GitHub

Check out our open-source projects