Skip to main content

Overview

Extract is our flagship LLM-powered web scraping service that intelligently extracts structured data from any website. Using advanced LLM models, it understands context and content like a human would, making web data extraction more reliable and efficient than ever.
Try Extract instantly in our interactive playground

Getting Started

Quick Start

from scrapegraph_py import Client

client = Client(api_key="your-api-key")

response = client.extract(
    url="https://scrapegraphai.com/",
    prompt="Extract info about the company"
)

Parameters

ParameterTypeRequiredDescription
urlstringYesThe URL of the webpage to scrape.
promptstringYesA textual description of what you want to extract.
output_schemaobjectNoPydantic or Zod schema for structured response format.
fetch_configFetchConfigNoConfiguration for page fetching (headers, cookies, stealth, etc.).
Get your API key from the dashboard
{
  "id": "sg-req-abc123",
  "status": "completed",
  "result": {
    "company_name": "ScrapeGraphAI",
    "description": "ScrapeGraphAI is a powerful AI scraping API designed for efficient web data extraction...",
    "features": [
      "Effortless, cost-effective, and AI-powered data extraction",
      "Handles proxy rotation and rate limits",
      "Supports a wide variety of websites"
    ]
  }
}

FetchConfig

Use FetchConfig to control how the page is fetched:
from scrapegraph_py import Client, FetchConfig

client = Client(api_key="your-api-key")

response = client.extract(
    url="https://example.com",
    prompt="Extract the main content",
    fetch_config=FetchConfig(
        mode="js",
        stealth=True,
        headers={"User-Agent": "MyBot"},
        cookies={"session": "abc123"},
        scrolls=3,
        wait=2000,
    ),
)
In the JavaScript SDK, pass fetchConfig as a property of the params object: extract(apiKey, { url, prompt, fetchConfig: { ... } }).
ParameterTypeDescription
modestringFetch mode: "auto" (default), "fast", or "js".
stealthboolEnable stealth mode with residential proxy and anti-bot headers.
headersdictCustom HTTP headers to send.
cookiesdictCookies to include in the request.
scrollsintNumber of page scrolls (0-100).
waitintMilliseconds to wait after page load (0-30000).
timeoutintRequest timeout in milliseconds (1000-60000).
countrystringTwo-letter ISO country code for geo-targeted proxy routing.

Custom Schema Example

Define exactly what data you want to extract:
from pydantic import BaseModel, Field
from scrapegraph_py import Client

class ArticleData(BaseModel):
    title: str = Field(description="Article title")
    author: str = Field(description="Author name")
    content: str = Field(description="Main article content")
    publish_date: str = Field(description="Publication date")

client = Client(api_key="your-api-key")

response = client.extract(
    url="https://example.com/article",
    prompt="Extract the article information",
    output_schema=ArticleData
)

Async Support

For applications requiring asynchronous execution:
import asyncio
from scrapegraph_py import AsyncClient

async def main():
    async with AsyncClient(api_key="your-api-key") as client:
        urls = [
            "https://scrapegraphai.com/",
            "https://github.com/ScrapeGraphAI/Scrapegraph-ai",
        ]

        tasks = [
            client.extract(
                url=url,
                prompt="Summarize the main content",
            )
            for url in urls
        ]

        responses = await asyncio.gather(*tasks, return_exceptions=True)

        for i, response in enumerate(responses):
            if isinstance(response, Exception):
                print(f"Error for {urls[i]}: {response}")
            else:
                print(f"Result for {urls[i]}: {response}")

if __name__ == "__main__":
    asyncio.run(main())

Key Features

Universal Compatibility

Works with any website structure, including JavaScript-rendered content

AI Understanding

Contextual understanding of content for accurate extraction

Structured Output

Returns clean, structured data in your preferred format

Schema Support

Define custom output schemas using Pydantic or Zod

Integration Options

Official SDKs

  • Python SDK - Perfect for data science and backend applications
  • JavaScript SDK - Ideal for web applications and Node.js

AI Framework Integrations

Support & Resources

Documentation

Comprehensive guides and tutorials

API Reference

Detailed API documentation

Community

Join our Discord community

GitHub

Check out our open-source projects