Extract

Overview

Extract uses an LLM to pull structured data from a URL, HTML, or markdown. Provide a prompt (and optionally a JSON schema) and it returns typed JSON — no selectors or post-processing required.

Try Extract instantly in our interactive playground.

Pricing

Each Extract call costs 5 credits. Enabling stealth in fetchConfig adds 5 credits; render mode (auto / fast / js) does not affect the cost. See the pricing page for the full breakdown.

Getting Started

Quick Start

from scrapegraph_py import ScrapeGraphAI

# reads SGAI_API_KEY from env, or pass explicitly: ScrapeGraphAI(api_key="...")
sgai = ScrapeGraphAI()

res = sgai.extract(
    "What does the company do? Extract name and description.",
    url="https://scrapegraphai.com",
)

if res.status == "success":
    print(res.data.json_data)
else:
    print("Failed:", res.error)

import { ScrapeGraphAI } from "scrapegraph-js";

const sgai = ScrapeGraphAI();

const res = await sgai.extract({
  url: "https://scrapegraphai.com",
  prompt: "What does the company do? Extract name and description.",
});

if (res.status === "success") {
  console.log(res.data?.json);
  console.log("Tokens used:", res.data?.usage);
} else {
  console.error(res.error);
}

curl -X POST https://v2-api.scrapegraphai.com/api/extract \
  -H "SGAI-APIKEY: $SGAI_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://scrapegraphai.com",
    "prompt": "What does the company do? Extract name and description."
  }'

Parameters

Parameter	Type	Required	Description
`url`	string	Cond.	URL of the page to extract from. One of `url`, `html`, or `markdown` is required.
`html`	string	Cond.	Raw HTML to extract from.
`markdown`	string	Cond.	Markdown content to extract from.
`prompt`	string	Yes	Natural-language description of what to extract.
`schema`	object	No	JSON schema describing the desired output shape. In Python you can pass a Pydantic model via `MyModel.model_json_schema()`.
`mode`	string	No	HTML processing mode: `"normal"`, `"reader"`, `"prune"`.
`fetchConfig` / `fetch_config`	object	No	Fetch options (see Scrape · FetchConfig).

Get your API key from the dashboard.

Example Response

{
  "id": "9a2178b6-2525-4f98-85e6-9f8c7da17541",
  "raw": null,
  "json": {
    "name": "ScrapeGraphAI",
    "description": "ScrapeGraphAI is an AI-powered web scraping platform that uses natural language prompts to turn any webpage into structured data via a simple API."
  },
  "usage": {
    "promptTokens": 10002,
    "completionTokens": 509
  },
  "metadata": {
    "chunker": { "chunks": [{ "size": 5000 }, { "size": 2535 }] },
    "fetch": {}
  }
}

With a JSON Schema

Pass a JSON schema to pin down the exact output shape.

from scrapegraph_py import ScrapeGraphAI

sgai = ScrapeGraphAI()

res = sgai.extract(
    "Extract structured information about this page",
    url="https://example.com",
    schema={
        "type": "object",
        "properties": {
            "title": {"type": "string"},
            "description": {"type": "string"},
            "links": {"type": "array", "items": {"type": "string"}},
        },
        "required": ["title"],
    },
)

if res.status == "success":
    print(res.data.json_data)

import { ScrapeGraphAI } from "scrapegraph-js";

const sgai = ScrapeGraphAI();

const res = await sgai.extract({
  url: "https://example.com",
  prompt: "Extract the page title and description",
  schema: {
    type: "object",
    properties: {
      title: { type: "string" },
      description: { type: "string" },
    },
    required: ["title"],
  },
});

if (res.status === "success") {
  console.log(res.data?.json);
}

curl -X POST https://v2-api.scrapegraphai.com/api/extract \
  -H "SGAI-APIKEY: $SGAI_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com",
    "prompt": "Extract the page title and description",
    "schema": {
      "type": "object",
      "properties": {
        "title": {"type": "string"},
        "description": {"type": "string"}
      },
      "required": ["title"]
    }
  }'

With a Pydantic Schema (Python)

If you already model your data with Pydantic, use the same BaseModel to drive the extraction. model_json_schema() produces the JSON Schema dict the API expects, and model_validate() parses the response back into typed objects.

from pydantic import BaseModel, Field
from scrapegraph_py import ScrapeGraphAI

class Product(BaseModel):
    name: str = Field(description="Product name")
    price: str | None = Field(default=None, description="Listed price, if any")

class Products(BaseModel):
    products: list[Product] = Field(default_factory=list)

sgai = ScrapeGraphAI()

res = sgai.extract(
    "Extract product names and prices",
    url="https://example.com",
    schema=Products.model_json_schema(),
)

if res.status == "success":
    parsed = Products.model_validate(res.data.json_data)
    for p in parsed.products:
        print(p.name, p.price)

The wire format is JSON Schema either way — model_json_schema() is just the standard Pydantic v2 helper that produces it. Field descriptions are forwarded to the LLM and improve extraction quality on ambiguous fields.

Extract from HTML or Markdown

Skip the fetch and extract from content you already have.

from scrapegraph_py import ScrapeGraphAI

sgai = ScrapeGraphAI()

res = sgai.extract(
    "Extract product name and price",
    html="<html><body><h1>Widget</h1><p>$9.99</p></body></html>",
)

FetchConfig

Control how the page is fetched before extraction (JS rendering, stealth, headers, etc). See the full options in Scrape · FetchConfig.

from scrapegraph_py import ScrapeGraphAI, FetchConfig

sgai = ScrapeGraphAI()

res = sgai.extract(
    "Extract the main content",
    url="https://example.com",
    fetch_config=FetchConfig(mode="js", stealth=True, wait=2000),
)

Async Support (Python)

import asyncio
from scrapegraph_py import AsyncScrapeGraphAI

async def main():
    async with AsyncScrapeGraphAI() as sgai:
        res = await sgai.extract(
            "Summarize what this product does",
            url="https://scrapegraphai.com",
        )
        if res.status == "success":
            print(res.data.json_data)

asyncio.run(main())

Key Features

Universal Compatibility

Works with any URL, raw HTML, or markdown input.

AI Understanding

Contextual extraction — no XPath or brittle selectors.

Structured Output

JSON schema support for typed, predictable results.

Token Accounting

Response includes prompt/completion token usage.

Integration Options

Official SDKs

Python SDK
JavaScript SDK (scrapegraph-js ≥ 2.1.0, Node ≥ 22)

AI Framework Integrations

Support & Resources

Documentation

Guides and tutorials

API Reference

Detailed API documentation

Community

Join our Discord community

GitHub

Check out our open-source projects

Get Started

Services

Official SDKs

LLM SDKs

Frameworks

Personal AI Agents

No-code

Contribute

Overview

Pricing

Getting Started

Quick Start

Parameters

With a JSON Schema

With a Pydantic Schema (Python)

Extract from HTML or Markdown

FetchConfig

Async Support (Python)

Key Features

Universal Compatibility

AI Understanding

Structured Output

Token Accounting

Integration Options

Official SDKs

AI Framework Integrations

Support & Resources

Documentation

API Reference

Community

GitHub

​Overview

​Pricing

​Getting Started

​Quick Start

​Parameters

​With a JSON Schema

​With a Pydantic Schema (Python)

​Extract from HTML or Markdown

​FetchConfig

​Async Support (Python)

​Key Features

Universal Compatibility

AI Understanding

Structured Output

Token Accounting

​Integration Options

​Official SDKs

​AI Framework Integrations

​Support & Resources

Documentation

API Reference

Community

GitHub

Overview

Pricing

Getting Started

Quick Start

Parameters

With a JSON Schema

With a Pydantic Schema (Python)

Extract from HTML or Markdown

FetchConfig

Async Support (Python)

Key Features

Integration Options

Official SDKs

AI Framework Integrations

Support & Resources