SmartScraper - ScrapeGraphAI

Overview

SmartScraper is our flagship LLM-powered web scraping service that intelligently extracts structured data from any website. Using advanced LLM models, it understands context and content like a human would, making web data extraction more reliable and efficient than ever.

Try SmartScraper instantly in our interactive playground - no coding required!

Getting Started

Quick Start

from scrapegraph_py import Client

client = Client(api_key="your-api-key")

response = client.smartscraper(
    website_url="https://scrapegraphai.com/",
    user_prompt="Extract info about the company"
)

Parameters

Parameter	Type	Required	Description
apiKey	string	Yes	The ScrapeGraph API Key.
websiteUrl	string	Yes	The URL of the webpage that needs to be scraped.
prompt	string	Yes	A textual description of what you want to achieve.
schema	object	No	The Pydantic or Zod object that describes the structure and format of the response.

Get your API key from the dashboard

Example Response

{
  "request_id": "sg-req-abc123",
  "status": "completed",
  "website_url": "https://scrapegraphai.com/",
  "user_prompt": "Extract info about the company",
  "result": {
    "company_name": "ScrapeGraphAI",
    "description": "ScrapeGraphAI is a powerful AI scraping API designed for efficient web data extraction to power LLM applications and AI agents...",
    "features": [
      "Effortless, cost-effective, and AI-powered data extraction",
      "Handles proxy rotation and rate limits",
      "Supports a wide variety of websites"
    ],
    "contact_email": "contact@scrapegraphai.com",
    "social_links": {
      "github": "https://github.com/ScrapeGraphAI/Scrapegraph-ai",
      "linkedin": "https://www.linkedin.com/company/101881123",
      "twitter": "https://x.com/scrapegraphai"
    }
  },
  "error": ""
}

The response includes:

request_id: Unique identifier for tracking your request
status: Current status of the extraction (“completed”, “running”, “failed”)
result: The extracted data in structured JSON format
error: Error message (if any occurred during extraction)

Using Your Own HTML

Instead of providing a URL, you can optionally pass your own HTML content:

html_content = """
<html>
    <body>
        <h1>ScrapeGraphAI</h1>
        <div class="description">
            <p>AI-powered web scraping for modern applications.</p>
        </div>
        <div class="features">
            <ul>
                <li>Smart Extraction</li>
                <li>Local Processing</li>
                <li>Schema Support</li>
            </ul>
        </div>
    </body>
</html>
"""

response = client.smartscraper(
    website_html=html_content,  # This will override website_url if both are provided
    user_prompt="Extract info about the company"
)

This is useful when:

You already have the HTML content cached
You want to process modified HTML
You’re working with dynamically generated content
You need to process content offline
You want to pre-process the HTML before extraction

When both website_url and website_html are provided, website_html takes precedence and will be used for extraction.

Key Features

Universal Compatibility

Works with any website structure, including JavaScript-rendered content

AI Understanding

Contextual understanding of content for accurate extraction

Structured Output

Returns clean, structured data in your preferred format

Schema Support

Define custom output schemas using Pydantic or Zod

Use Cases

Content Aggregation

News article extraction
Blog post summarization
Product information gathering
Research data collection

Data Analysis

Market research
Competitor analysis
Price monitoring
Trend tracking

AI Training

Dataset creation
Training data collection
Content classification
Knowledge base building

Want to learn more about our AI-powered scraping technology? Visit our main website to discover how we’re revolutionizing web data extraction.

Other Functionality

Retrieve a previous request

If you know the response id of a previous request you made, you can retrieve all the information.

import { getSmartScraperRequest } from 'scrapegraph-js';

const apiKey = 'your_api_key';
const requestId = 'ID_of_previous_request';

try {
  const requestInfo = await getSmartScraperRequest(apiKey, requestId);
  console.log(requestInfo);
} catch (error) {
  console.error(error);
}

Parameters

Parameter	Type	Required	Description
apiKey	string	Yes	The ScrapeGraph API Key.
requestId	string	Yes	The request ID associated with the output of a previous smartScraper request.

Custom Schema Example

Define exactly what data you want to extract:

from pydantic import BaseModel, Field

class ArticleData(BaseModel):
    title: str = Field(description="Article title")
    author: str = Field(description="Author name")
    content: str = Field(description="Main article content")
    publish_date: str = Field(description="Publication date")

response = client.smartscraper(
    website_url="https://example.com/article",
    user_prompt="Extract the article information",
    output_schema=ArticleData
)

Async Support

For applications requiring asynchronous execution, SmartScraper provides comprehensive async support through the AsyncClient:

import asyncio
from scrapegraph_py import AsyncClient
from pydantic import BaseModel, Field

# Define your schema
class WebpageSchema(BaseModel):
    title: str = Field(description="The title of the webpage")
    description: str = Field(description="The description of the webpage")
    summary: str = Field(description="A brief summary of the webpage")

async def main():
    # Initialize the async client
    async with AsyncClient(api_key="your-api-key") as client:
        # List of URLs to analyze
        urls = [
            "https://scrapegraphai.com/",
            "https://github.com/ScrapeGraphAI/Scrapegraph-ai",
        ]

        # Create scraping tasks for each URL
        tasks = [
            client.smartscraper(
                website_url=url,
                user_prompt="Summarize the main content",
                output_schema=WebpageSchema
            )
            for url in urls
        ]

        # Execute requests concurrently
        responses = await asyncio.gather(*tasks, return_exceptions=True)

        # Process results
        for i, response in enumerate(responses):
            if isinstance(response, Exception):
                print(f"Error for {urls[i]}: {response}")
            else:
                print(f"Result for {urls[i]}: {response['result']}")

# Run the async function
if __name__ == "__main__":
    asyncio.run(main())

Infinite Scroll Support

SmartScraper can handle infinite scroll pages by automatically scrolling to load more content before extraction. This is perfect for social media feeds, e-commerce product listings, and other dynamic content.

from scrapegraph_py import Client
from scrapegraph_py.logger import sgai_logger
from pydantic import BaseModel
from typing import List

sgai_logger.set_logging(level="INFO")

# Define the output schema
class Company(BaseModel):
    name: str
    category: str
    location: str

class CompaniesResponse(BaseModel):
    companies: List[Company]

# Initialize the client with explicit API key
sgai_client = Client(api_key="sgai-api-key")

try:
    # SmartScraper request with infinite scroll
    response = sgai_client.smartscraper(
        website_url="https://www.ycombinator.com/companies?batch=Spring%202025",
        user_prompt="Extract all company names and their categories from the page",
        output_schema=CompaniesResponse,
        number_of_scrolls=10  # Scroll 10 times to load more companies
    )

    # Print the response
    print(f"Request ID: {response['request_id']}")
    
    # Parse and print the results in a structured way
    result = CompaniesResponse.model_validate(response['result'])
    print("\nExtracted Companies:")
    print("-" * 80)
    for company in result.companies:
        print(f"Name: {company.name}")
        print(f"Category: {company.category}")
        print(f"Location: {company.location}")
        print("-" * 80)

except Exception as e:
    print(f"An error occurred: {e}")

finally:
    sgai_client.close()

Parameters for Infinite Scroll

Parameter	Type	Required	Description
number_of_scrolls	number	No	Number of times to scroll down to load more content (default: 0)

Infinite scroll is particularly useful for:

Social media feeds (Twitter, Instagram, LinkedIn)
E-commerce product listings
News websites with continuous scrolling
Any page that loads content dynamically as you scroll

SmartScraper Endpoint

The SmartScraper endpoint is our core service for extracting structured data from any webpage using advanced language models. It automatically adapts to different website layouts and content types, enabling quick and reliable data extraction.

Key Capabilities

Universal Compatibility: Works with any website structure, including JavaScript-rendered content
Schema Validation: Supports both Pydantic (Python) and Zod (JavaScript) schemas
Concurrent Processing: Efficient handling of multiple URLs through async support
Custom Extraction: Flexible user prompts for targeted data extraction

Endpoint Details

POST https://api.scrapegraphai.com/v1/smartscraper

Required Headers

Header	Description
SGAI-APIKEY	Your API authentication key
Content-Type	application/json

Request Body

Field	Type	Required	Description
website_url	string	Yes*	URL to scrape (*either this or website_html required)
website_html	string	No	Raw HTML content to process
user_prompt	string	Yes	Instructions for data extraction
output_schema	object	No	Pydantic or Zod schema for response validation

Response Format

{
  "request_id": "sg-req-abc123",
  "status": "completed",
  "website_url": "https://example.com",
  "result": {
    // Structured data based on schema or extraction prompt
  },
  "error": null
}

Best Practices

Schema Definition:
- Define schemas to ensure consistent data structure
- Use descriptive field names and types
- Include field descriptions for better extraction accuracy
Async Processing:
- Use async clients for concurrent requests
- Implement proper error handling
- Monitor rate limits and implement backoff strategies
Error Handling:
- Always wrap requests in try-catch blocks
- Check response status before processing
- Implement retry logic for failed requests

Integration Options

Official SDKs

Python SDK - Perfect for data science and backend applications
JavaScript SDK - Ideal for web applications and Node.js

AI Framework Integrations

LangChain Integration - Use SmartScraper in your LLM workflows
LlamaIndex Integration - Build powerful search and QA systems

Best Practices

Optimizing Extraction

Be specific in your prompts
Use schemas for structured data
Handle pagination for multi-page content
Implement error handling and retries

Rate Limiting

Implement reasonable delays between requests
Use async clients for better performance
Monitor your API usage

Example Projects

Check out our cookbook for real-world examples:

E-commerce product scraping
News aggregation
Research data collection
Content monitoring

API Reference

For detailed API documentation, see:

Support & Resources

Documentation

Comprehensive guides and tutorials

API Reference

Detailed API documentation

Community

Join our Discord community

GitHub

Check out our open-source projects

Get Started

Services

Official SDKs

Integrations

Contribute

Resources

​Overview

​Getting Started

​Quick Start

​Parameters

​Key Features

Universal Compatibility

AI Understanding

Structured Output

Schema Support

​Use Cases

​Content Aggregation

​Data Analysis

​AI Training

​Other Functionality

​Retrieve a previous request

​Parameters

​Custom Schema Example

​Async Support

​Infinite Scroll Support

​Parameters for Infinite Scroll

​SmartScraper Endpoint

​Key Capabilities

​Endpoint Details

Required Headers

Request Body

Response Format

​Best Practices

​Integration Options

​Official SDKs

​AI Framework Integrations

​Best Practices

​Optimizing Extraction

​Rate Limiting

​Example Projects

​API Reference

​Support & Resources

Documentation

API Reference

Community

GitHub

Ready to Start?

Overview

Getting Started

Quick Start

Parameters

Key Features

Use Cases

Content Aggregation

Data Analysis

AI Training

Other Functionality

Retrieve a previous request

Parameters

Custom Schema Example

Async Support

Infinite Scroll Support

Parameters for Infinite Scroll

SmartScraper Endpoint

Key Capabilities

Endpoint Details

Best Practices

Integration Options

Official SDKs

AI Framework Integrations

Best Practices

Optimizing Extraction

Rate Limiting

Example Projects

API Reference

Support & Resources