Installation
Install the package using pip:
pip install scrapegraph-py
Features
AI-Powered Extraction : Advanced web scraping using artificial intelligence
Flexible Clients : Both synchronous and asynchronous support
Type Safety : Structured output with Pydantic schemas
Production Ready : Detailed logging and automatic retries
Developer Friendly : Comprehensive error handling
Quick Start
Initialize the client with your API key:
from scrapegraph_py import Client
client = Client( api_key = "your-api-key-here" )
You can also set the SGAI_API_KEY
environment variable and initialize the client without parameters: client = Client()
Services
SmartScraper
Extract specific information from any webpage using AI:
response = client.smartscraper(
website_url = "https://example.com" ,
user_prompt = "Extract the main heading and description"
)
Parameters
Parameter Type Required Description website_url string Yes The URL of the webpage that needs to be scraped. user_prompt string Yes A textual description of what you want to achieve. output_schema object No The Pydantic object that describes the structure and format of the response. render_heavy_js boolean No Enable enhanced JavaScript rendering for heavy JS websites (React, Vue, Angular, etc.). Default: False
Define a simple schema for basic data extraction: from pydantic import BaseModel, Field
class ArticleData ( BaseModel ):
title: str = Field( description = "The article title" )
author: str = Field( description = "The author's name" )
publish_date: str = Field( description = "Article publication date" )
content: str = Field( description = "Main article content" )
category: str = Field( description = "Article category" )
response = client.smartscraper(
website_url = "https://example.com/blog/article" ,
user_prompt = "Extract the article information" ,
output_schema = ArticleData
)
print ( f "Title: { response.title } " )
print ( f "Author: { response.author } " )
print ( f "Published: { response.publish_date } " )
Define a complex schema for nested data structures: from typing import List
from pydantic import BaseModel, Field
class Employee ( BaseModel ):
name: str = Field( description = "Employee's full name" )
position: str = Field( description = "Job title" )
department: str = Field( description = "Department name" )
email: str = Field( description = "Email address" )
class Office ( BaseModel ):
location: str = Field( description = "Office location/city" )
address: str = Field( description = "Full address" )
phone: str = Field( description = "Contact number" )
class CompanyData ( BaseModel ):
name: str = Field( description = "Company name" )
description: str = Field( description = "Company description" )
industry: str = Field( description = "Industry sector" )
founded_year: int = Field( description = "Year company was founded" )
employees: List[Employee] = Field( description = "List of key employees" )
offices: List[Office] = Field( description = "Company office locations" )
website: str = Field( description = "Company website URL" )
# Extract comprehensive company information
response = client.smartscraper(
website_url = "https://example.com/about" ,
user_prompt = "Extract detailed company information including employees and offices" ,
output_schema = CompanyData
)
# Access nested data
print ( f "Company: { response.name } " )
print ( " \n Key Employees:" )
for employee in response.employees:
print ( f "- { employee.name } ( { employee.position } )" )
print ( " \n Office Locations:" )
for office in response.offices:
print ( f "- { office.location } : { office.address } " )
Enhanced JavaScript Rendering Example
For modern web applications built with React, Vue, Angular, or other JavaScript frameworks: from scrapegraph_py import Client
from pydantic import BaseModel, Field
class ProductInfo ( BaseModel ):
name: str = Field( description = "Product name" )
price: str = Field( description = "Product price" )
description: str = Field( description = "Product description" )
availability: str = Field( description = "Product availability status" )
client = Client( api_key = "your-api-key" )
# Enable enhanced JavaScript rendering for a React-based e-commerce site
response = client.smartscraper(
website_url = "https://example-react-store.com/products/123" ,
user_prompt = "Extract product details including name, price, description, and availability" ,
output_schema = ProductInfo,
render_heavy_js = True # Enable for React/Vue/Angular sites
)
print ( f "Product: { response[ 'result' ][ 'name' ] } " )
print ( f "Price: { response[ 'result' ][ 'price' ] } " )
print ( f "Available: { response[ 'result' ][ 'availability' ] } " )
When to use render_heavy_js
:
React, Vue, or Angular applications
Single Page Applications (SPAs)
Sites with heavy client-side rendering
Dynamic content loaded via JavaScript
Interactive elements that depend on JavaScript execution
SearchScraper
Search and extract information from multiple web sources using AI:
response = client.searchscraper(
user_prompt = "What are the key features and pricing of ChatGPT Plus?"
)
Parameters
Parameter Type Required Description user_prompt string Yes A textual description of what you want to achieve. num_results number No Number of websites to search (3-20). Default: 3. extraction_mode boolean No True = AI extraction mode (10 credits/page), False = markdown mode (2 credits/page). Default: Trueoutput_schema object No The Pydantic object that describes the structure and format of the response (AI extraction mode only)
Define a simple schema for structured search results: from pydantic import BaseModel, Field
from typing import List
class ProductInfo ( BaseModel ):
name: str = Field( description = "Product name" )
description: str = Field( description = "Product description" )
price: str = Field( description = "Product price" )
features: List[ str ] = Field( description = "List of key features" )
availability: str = Field( description = "Availability information" )
response = client.searchscraper(
user_prompt = "Find information about iPhone 15 Pro" ,
output_schema = ProductInfo
)
print ( f "Product: { response.name } " )
print ( f "Price: { response.price } " )
print ( " \n Features:" )
for feature in response.features:
print ( f "- { feature } " )
Define a complex schema for comprehensive market research: from typing import List
from pydantic import BaseModel, Field
class MarketPlayer ( BaseModel ):
name: str = Field( description = "Company name" )
market_share: str = Field( description = "Market share percentage" )
key_products: List[ str ] = Field( description = "Main products in market" )
strengths: List[ str ] = Field( description = "Company's market strengths" )
class MarketTrend ( BaseModel ):
name: str = Field( description = "Trend name" )
description: str = Field( description = "Trend description" )
impact: str = Field( description = "Expected market impact" )
timeframe: str = Field( description = "Trend timeframe" )
class MarketAnalysis ( BaseModel ):
market_size: str = Field( description = "Total market size" )
growth_rate: str = Field( description = "Annual growth rate" )
key_players: List[MarketPlayer] = Field( description = "Major market players" )
trends: List[MarketTrend] = Field( description = "Market trends" )
challenges: List[ str ] = Field( description = "Industry challenges" )
opportunities: List[ str ] = Field( description = "Market opportunities" )
# Perform comprehensive market research
response = client.searchscraper(
user_prompt = "Analyze the current AI chip market landscape" ,
output_schema = MarketAnalysis
)
# Access structured market data
print ( f "Market Size: { response.market_size } " )
print ( f "Growth Rate: { response.growth_rate } " )
print ( " \n Key Players:" )
for player in response.key_players:
print ( f " \n { player.name } " )
print ( f "Market Share: { player.market_share } " )
print ( "Key Products:" )
for product in player.key_products:
print ( f "- { product } " )
print ( " \n Market Trends:" )
for trend in response.trends:
print ( f " \n { trend.name } " )
print ( f "Impact: { trend.impact } " )
print ( f "Timeframe: { trend.timeframe } " )
Use markdown mode for cost-effective content gathering: from scrapegraph_py import Client
client = Client( api_key = "your-api-key" )
# Enable markdown mode for cost-effective content gathering
response = client.searchscraper(
user_prompt = "Latest developments in artificial intelligence" ,
num_results = 3 ,
extraction_mode = False # Enable markdown mode (2 credits per page vs 10 credits)
)
# Access the raw markdown content
markdown_content = response[ 'markdown_content' ]
reference_urls = response[ 'reference_urls' ]
print ( f "Markdown content length: { len (markdown_content) } characters" )
print ( f "Reference URLs: { len (reference_urls) } " )
# Process the markdown content
print ( "Content preview:" , markdown_content[: 500 ] + "..." )
# Save to file for analysis
with open ( 'ai_research_content.md' , 'w' , encoding = 'utf-8' ) as f:
f.write(markdown_content)
print ( "Content saved to ai_research_content.md" )
Markdown Mode Benefits:
Cost-effective : Only 2 credits per page (vs 10 credits for AI extraction)
Full content : Get complete page content in markdown format
Faster : No AI processing overhead
Perfect for : Content analysis, bulk data collection, building datasets
Markdownify
Convert any webpage into clean, formatted markdown:
response = client.markdownify(
website_url = "https://example.com"
)
Async Support
All endpoints support asynchronous operations:
import asyncio
from scrapegraph_py import AsyncClient
async def main ():
async with AsyncClient() as client:
response = await client.smartscraper(
website_url = "https://example.com" ,
user_prompt = "Extract the main content"
)
print (response)
asyncio.run(main())
Feedback
Help us improve by submitting feedback programmatically:
client.submit_feedback(
request_id = "your-request-id" ,
rating = 5 ,
feedback_text = "Great results!"
)
Support
This project is licensed under the MIT License. See the LICENSE file for details.