Skip to main content
ScrapeGraph API Banner

PyPI Package

PyPI version

Python Support

Python Support

Installation

Install the package using pip:
pip install scrapegraph-py

Features

  • AI-Powered Extraction: Advanced web scraping using artificial intelligence
  • Flexible Clients: Both synchronous and asynchronous support
  • Type Safety: Structured output with Pydantic schemas
  • Production Ready: Detailed logging and automatic retries
  • Developer Friendly: Comprehensive error handling

Quick Start

Initialize the client with your API key:
from scrapegraph_py import Client

client = Client(api_key="your-api-key-here")
You can also set the SGAI_API_KEY environment variable and initialize the client without parameters: client = Client()

Services

SmartScraper

Extract specific information from any webpage using AI:
response = client.smartscraper(
    website_url="https://example.com",
    user_prompt="Extract the main heading and description"
)

Parameters

ParameterTypeRequiredDescription
website_urlstringYesThe URL of the webpage that needs to be scraped.
user_promptstringYesA textual description of what you want to achieve.
output_schemaobjectNoThe Pydantic object that describes the structure and format of the response.
render_heavy_jsbooleanNoEnable enhanced JavaScript rendering for heavy JS websites (React, Vue, Angular, etc.). Default: False
Define a simple schema for basic data extraction:
from pydantic import BaseModel, Field

class ArticleData(BaseModel):
    title: str = Field(description="The article title")
    author: str = Field(description="The author's name")
    publish_date: str = Field(description="Article publication date")
    content: str = Field(description="Main article content")
    category: str = Field(description="Article category")

response = client.smartscraper(
    website_url="https://example.com/blog/article",
    user_prompt="Extract the article information",
    output_schema=ArticleData
)

print(f"Title: {response.title}")
print(f"Author: {response.author}")
print(f"Published: {response.publish_date}")
Define a complex schema for nested data structures:
from typing import List
from pydantic import BaseModel, Field

class Employee(BaseModel):
    name: str = Field(description="Employee's full name")
    position: str = Field(description="Job title")
    department: str = Field(description="Department name")
    email: str = Field(description="Email address")

class Office(BaseModel):
    location: str = Field(description="Office location/city")
    address: str = Field(description="Full address")
    phone: str = Field(description="Contact number")

class CompanyData(BaseModel):
    name: str = Field(description="Company name")
    description: str = Field(description="Company description")
    industry: str = Field(description="Industry sector")
    founded_year: int = Field(description="Year company was founded")
    employees: List[Employee] = Field(description="List of key employees")
    offices: List[Office] = Field(description="Company office locations")
    website: str = Field(description="Company website URL")

# Extract comprehensive company information
response = client.smartscraper(
    website_url="https://example.com/about",
    user_prompt="Extract detailed company information including employees and offices",
    output_schema=CompanyData
)

# Access nested data
print(f"Company: {response.name}")
print("\nKey Employees:")
for employee in response.employees:
    print(f"- {employee.name} ({employee.position})")

print("\nOffice Locations:")
for office in response.offices:
    print(f"- {office.location}: {office.address}")
For modern web applications built with React, Vue, Angular, or other JavaScript frameworks:
from scrapegraph_py import Client
from pydantic import BaseModel, Field

class ProductInfo(BaseModel):
    name: str = Field(description="Product name")
    price: str = Field(description="Product price")
    description: str = Field(description="Product description")
    availability: str = Field(description="Product availability status")

client = Client(api_key="your-api-key")

# Enable enhanced JavaScript rendering for a React-based e-commerce site
response = client.smartscraper(
    website_url="https://example-react-store.com/products/123",
    user_prompt="Extract product details including name, price, description, and availability",
    output_schema=ProductInfo,
    render_heavy_js=True  # Enable for React/Vue/Angular sites
)

print(f"Product: {response['result']['name']}")
print(f"Price: {response['result']['price']}")
print(f"Available: {response['result']['availability']}")
When to use render_heavy_js:
  • React, Vue, or Angular applications
  • Single Page Applications (SPAs)
  • Sites with heavy client-side rendering
  • Dynamic content loaded via JavaScript
  • Interactive elements that depend on JavaScript execution

SearchScraper

Search and extract information from multiple web sources using AI:
response = client.searchscraper(
    user_prompt="What are the key features and pricing of ChatGPT Plus?"
)

Parameters

ParameterTypeRequiredDescription
user_promptstringYesA textual description of what you want to achieve.
num_resultsnumberNoNumber of websites to search (3-20). Default: 3.
extraction_modebooleanNoTrue = AI extraction mode (10 credits/page), False = markdown mode (2 credits/page). Default: True
output_schemaobjectNoThe Pydantic object that describes the structure and format of the response (AI extraction mode only)
Define a simple schema for structured search results:
from pydantic import BaseModel, Field
from typing import List

class ProductInfo(BaseModel):
    name: str = Field(description="Product name")
    description: str = Field(description="Product description")
    price: str = Field(description="Product price")
    features: List[str] = Field(description="List of key features")
    availability: str = Field(description="Availability information")

response = client.searchscraper(
    user_prompt="Find information about iPhone 15 Pro",
    output_schema=ProductInfo
)

print(f"Product: {response.name}")
print(f"Price: {response.price}")
print("\nFeatures:")
for feature in response.features:
    print(f"- {feature}")
Define a complex schema for comprehensive market research:
from typing import List
from pydantic import BaseModel, Field

class MarketPlayer(BaseModel):
    name: str = Field(description="Company name")
    market_share: str = Field(description="Market share percentage")
    key_products: List[str] = Field(description="Main products in market")
    strengths: List[str] = Field(description="Company's market strengths")

class MarketTrend(BaseModel):
    name: str = Field(description="Trend name")
    description: str = Field(description="Trend description")
    impact: str = Field(description="Expected market impact")
    timeframe: str = Field(description="Trend timeframe")

class MarketAnalysis(BaseModel):
    market_size: str = Field(description="Total market size")
    growth_rate: str = Field(description="Annual growth rate")
    key_players: List[MarketPlayer] = Field(description="Major market players")
    trends: List[MarketTrend] = Field(description="Market trends")
    challenges: List[str] = Field(description="Industry challenges")
    opportunities: List[str] = Field(description="Market opportunities")

# Perform comprehensive market research
response = client.searchscraper(
    user_prompt="Analyze the current AI chip market landscape",
    output_schema=MarketAnalysis
)

# Access structured market data
print(f"Market Size: {response.market_size}")
print(f"Growth Rate: {response.growth_rate}")

print("\nKey Players:")
for player in response.key_players:
    print(f"\n{player.name}")
    print(f"Market Share: {player.market_share}")
    print("Key Products:")
    for product in player.key_products:
        print(f"- {product}")

print("\nMarket Trends:")
for trend in response.trends:
    print(f"\n{trend.name}")
    print(f"Impact: {trend.impact}")
    print(f"Timeframe: {trend.timeframe}")
Use markdown mode for cost-effective content gathering:
from scrapegraph_py import Client

client = Client(api_key="your-api-key")

# Enable markdown mode for cost-effective content gathering
response = client.searchscraper(
    user_prompt="Latest developments in artificial intelligence",
    num_results=3,
    extraction_mode=False  # Enable markdown mode (2 credits per page vs 10 credits)
)

# Access the raw markdown content
markdown_content = response['markdown_content']
reference_urls = response['reference_urls']

print(f"Markdown content length: {len(markdown_content)} characters")
print(f"Reference URLs: {len(reference_urls)}")

# Process the markdown content
print("Content preview:", markdown_content[:500] + "...")

# Save to file for analysis
with open('ai_research_content.md', 'w', encoding='utf-8') as f:
    f.write(markdown_content)

print("Content saved to ai_research_content.md")
Markdown Mode Benefits:
  • Cost-effective: Only 2 credits per page (vs 10 credits for AI extraction)
  • Full content: Get complete page content in markdown format
  • Faster: No AI processing overhead
  • Perfect for: Content analysis, bulk data collection, building datasets

Markdownify

Convert any webpage into clean, formatted markdown:
response = client.markdownify(
    website_url="https://example.com"
)

Async Support

All endpoints support asynchronous operations:
import asyncio
from scrapegraph_py import AsyncClient

async def main():
    async with AsyncClient() as client:
        response = await client.smartscraper(
            website_url="https://example.com",
            user_prompt="Extract the main content"
        )
        print(response)

asyncio.run(main())

Feedback

Help us improve by submitting feedback programmatically:
client.submit_feedback(
    request_id="your-request-id",
    rating=5,
    feedback_text="Great results!"
)

Support

This project is licensed under the MIT License. See the LICENSE file for details.
I