SearchScraper - ScrapeGraphAI

Overview

SearchScraper is our advanced LLM-powered search service that intelligently searches and aggregates information from multiple web sources. Using state-of-the-art language models, it understands your queries and extracts relevant information across the web, providing comprehensive answers with full source attribution.

Try SearchScraper instantly in our interactive playground - no coding required!

Getting Started

Quick Start

from scrapegraph_py import Client

client = Client(api_key="your-api-key")

response = client.searchscraper(
    user_prompt="What are the key features and pricing of ChatGPT Plus?",
    num_results=5  # Search 5 websites (default is 3)
)

Parameters

Parameter	Type	Required	Description
apiKey	string	Yes	The ScrapeGraph API Key.
prompt	string	Yes	A textual description of what you want to achieve.
numResults	number	No	Number of websites to search (3-20). Default: 3. Higher = deeper research.
schema	object	No	The Pydantic or Zod object that describes the structure and format of the response

NEW: You can now control the number of websites to search (3-20) for deeper research. More sites = more credits. See advanced usage below!

Advanced: Website Limits & Credit Costs

Example Response

{
  "request_id": "sg-req-abc123",
  "status": "completed",
  "user_prompt": "What are the key features and pricing of ChatGPT Plus?",
  "result": {
    "product": {
      "name": "ChatGPT Plus",
      "description": "Premium version of ChatGPT with advanced features and capabilities",
      "target_audience": "Power users and professionals requiring advanced AI capabilities"
    },
    "features": [
      {
        "name": "GPT-4 Access",
        "description": "Access to the latest GPT-4 language model"
      },
      {
        "name": "Response Speed",
        "description": "Faster response times compared to free tier"
      },
      {
        "name": "Priority Access",
        "description": "Guaranteed access during peak usage times"
      },
      {
        "name": "New Features",
        "description": "Early access to new features and improvements"
      },
      {
        "name": "Plugin Support",
        "description": "Access to third-party plugins and integrations"
      }
    ],
    "pricing": {
      "plans": [
        {
          "name": "Plus Subscription",
          "price": {
            "amount": 20,
            "currency": "USD",
            "period": "monthly"
          },
          "features": [
            "GPT-4 access",
            "Faster response speed",
            "Priority access during peak times",
            "Early feature access",
            "Plugin support"
          ]
        }
      ]
    },
    "availability": {
      "regions": [
        "United States",
        "European Union",
        "United Kingdom",
        "Most other countries"
      ],
      "restrictions": [
        "Not available in sanctioned countries",
        "Requires credit card for subscription"
      ]
    }
  },
  "reference_urls": [
    "https://openai.com/chatgpt",
    "https://openai.com/blog/chatgpt-plus",
    "https://help.openai.com/en/articles/6825453-chatgpt-plus-plan"
  ],
  "error": ""
}

The response includes:

request_id: Unique identifier for tracking your request
status: Current status of the search (“completed”, “running”, “failed”)
result: The extracted data in structured JSON format
reference_urls: Source URLs for verification
error: Error message (if any occurred during search)

Key Features

Multi-Source Search

Intelligent search across multiple reliable web sources

AI Understanding

Advanced LLM models for accurate information extraction

Structured Output

Clean, structured data in your preferred format

Source Attribution

Full transparency with reference URLs

Use Cases

Research & Analysis

Academic research and fact-finding
Market research and competitive analysis
Technology trend analysis
Industry insights gathering

Data Aggregation

Product research and comparison
Company information compilation
Price monitoring across sources
Technology stack analysis

Content Creation

Fact verification and citation
Content research and inspiration
Data-driven article writing
Knowledge base building

Want to learn more about our AI-powered search technology? Visit our main website to discover how we’re revolutionizing web research.

Other Functionality

Retrieve a previous request

If you know the response id of a previous request you made, you can retrieve all the information.

import { getSearchScraperRequest } from 'scrapegraph-js';

const apiKey = 'your_api_key';
const requestId = 'ID_of_previous_request';

try {
  const requestInfo = await getSearchScraperRequest(apiKey, requestId);
  console.log(requestInfo);
} catch (error) {
  console.error(error);
}

Parameters

Parameter	Type	Required	Description
apiKey	string	Yes	The ScrapeGraph API Key.
requestId	string	Yes	The request ID associated with the output of a previous searchScraper request.

Custom Schema Example

Define exactly what data you want to extract using Pydantic or Zod:

from pydantic import BaseModel, Field
from typing import List

class CompanyProfile(BaseModel):
    name: str = Field(description="Company name")
    description: str = Field(description="Brief company description")
    founded_year: str = Field(description="Year the company was founded")
    headquarters: str = Field(description="Company headquarters location")
    employees: str = Field(description="Number of employees")
    industry: str = Field(description="Primary industry")
    products: List[str] = Field(description="Main products or services")
    competitors: List[str] = Field(description="Major competitors")
    market_share: str = Field(description="Company's market share")
    revenue: str = Field(description="Annual revenue")
    tech_stack: List[str] = Field(description="Technologies used by the company")

response = client.searchscraper(
    user_prompt="Find comprehensive information about OpenAI",
    output_schema=CompanyProfile
)

Advanced Schema Usage

The schema system in SearchScraper is a powerful way to ensure you get exactly the data structure you need. Here are some advanced techniques for using schemas effectively:

Nested Schemas

You can create complex nested structures to capture hierarchical data:

from pydantic import BaseModel, Field
from typing import List, Optional

class Author(BaseModel):
    name: str = Field(description="Author's full name")
    bio: Optional[str] = Field(description="Author's biography")
    expertise: List[str] = Field(description="Areas of expertise")

class Article(BaseModel):
    title: str = Field(description="Article title")
    content: str = Field(description="Main article content")
    author: Author = Field(description="Article author information")
    publication_date: str = Field(description="Date of publication")
    tags: List[str] = Field(description="Article tags or categories")

response = client.searchscraper(
    user_prompt="Find the latest AI research articles",
    output_schema=Article
)

Schema Validation Rules

Enhance data quality by adding validation rules to your schema:

from pydantic import BaseModel, Field, validator
from typing import List
from datetime import datetime

class ProductInfo(BaseModel):
    name: str = Field(description="Product name")
    price: float = Field(description="Product price", gt=0)
    currency: str = Field(description="Currency code", max_length=3)
    release_date: str = Field(description="Product release date")
    
    @validator('currency')
    def validate_currency(cls, v):
        if len(v) != 3 or not v.isupper():
            raise ValueError('Currency must be a 3-letter uppercase code')
        return v
        
    @validator('release_date')
    def validate_date(cls, v):
        try:
            datetime.strptime(v, '%Y-%m-%d')
            return v
        except ValueError:
            raise ValueError('Date must be in YYYY-MM-DD format')

Quality Improvement Tips

To get the highest quality results from SearchScraper, follow these best practices:

1. Detailed Field Descriptions

Always provide clear, detailed descriptions for each field in your schema:

class CompanyInfo(BaseModel):
    revenue: str = Field(
        description="Annual revenue in USD, including the year of reporting"
        # Good: "Annual revenue in USD, including the year of reporting"
        # Bad: "Revenue"
    )
    market_position: str = Field(
        description="Company's market position including market share percentage and rank among competitors"
        # Good: "Company's market position including market share percentage and rank among competitors"
        # Bad: "Position"
    )

2. Structured Prompts

Combine schemas with well-structured prompts for better results:

response = client.searchscraper(
    user_prompt="""
    Find information about Tesla's electric vehicles with specific focus on:
    - Latest Model 3 and Model Y specifications
    - Current pricing structure
    - Available customization options
    - Delivery timeframes
    Please include only verified information from official sources.
    """,
    output_schema=TeslaVehicleInfo
)

3. Data Validation

Implement comprehensive validation to ensure data quality:

from pydantic import BaseModel, Field, validator
from typing import List, Optional
from datetime import datetime

class MarketData(BaseModel):
    timestamp: str = Field(description="Data timestamp in ISO format")
    value: float = Field(description="Market value")
    confidence_score: float = Field(description="Confidence score between 0 and 1")
    
    @validator('timestamp')
    def validate_timestamp(cls, v):
        try:
            datetime.fromisoformat(v)
            return v
        except ValueError:
            raise ValueError('Invalid ISO timestamp format')
    
    @validator('confidence_score')
    def validate_confidence(cls, v):
        if not 0 <= v <= 1:
            raise ValueError('Confidence score must be between 0 and 1')
        return v

4. Error Handling

Implement robust error handling for schema validation:

try:
    response = client.searchscraper(
        user_prompt="Find market data for NASDAQ:AAPL",
        output_schema=MarketData
    )
    validated_data = MarketData(**response.result)
except ValidationError as e:
    print(f"Data validation failed: {e.json()}")
    # Implement fallback logic or error reporting
except Exception as e:
    print(f"An error occurred: {str(e)}")

Async Support

Example of using the async searchscraper functionality to search for information concurrently:

import asyncio
from scrapegraph_py import AsyncClient
from scrapegraph_py.logger import sgai_logger

sgai_logger.set_logging(level="INFO")

async def main():
    # Initialize async client
    sgai_client = AsyncClient(api_key="your-api-key-here")

    # List of search queries
    queries = [
        "What is the latest version of Python and what are its main features?",
        "What are the key differences between Python 2 and Python 3?",
        "What is Python's GIL and how does it work?",
    ]

    # Create tasks for concurrent execution
    tasks = [sgai_client.searchscraper(user_prompt=query) for query in queries]

    # Execute requests concurrently
    responses = await asyncio.gather(*tasks, return_exceptions=True)

    # Process results
    for i, response in enumerate(responses):
        if isinstance(response, Exception):
            print(f"\nError for query {i+1}: {response}")
        else:
            print(f"\nSearch {i+1}:")
            print(f"Query: {queries[i]}")
            print(f"Result: {response['result']}")
            print("Reference URLs:")
            for url in response["reference_urls"]:
                print(f"- {url}")

    await sgai_client.close()

if __name__ == "__main__":
    asyncio.run(main())

Integration Options

Official SDKs

Python SDK - Perfect for data science and backend applications
JavaScript SDK - Ideal for web applications and Node.js

AI Framework Integrations

LangChain Integration - Use SearchScraper in your LLM workflows
LlamaIndex Integration - Build powerful search and QA systems
CrewAI Integration - Create AI agents with search capabilities

Best Practices

Query Optimization

Be specific in your prompts
Use descriptive queries
Include relevant context
Specify time-sensitive requirements

Schema Design

Start with essential fields
Use appropriate data types
Add field descriptions
Make optional fields nullable
Group related information

Rate Limiting

Implement reasonable delays between requests
Use async clients for better performance
Monitor your API usage

Example Projects

Check out our cookbook for real-world examples:

API Reference

For detailed API documentation, see:

Support & Resources

Documentation

Comprehensive guides and tutorials

API Reference

Detailed API documentation

Community

Join our Discord community

GitHub

Check out our open-source projects

Ready to Start?

Example: Configurable Website Limits

import os
from dotenv import load_dotenv
from scrapegraph_py import Client
from scrapegraph_py.logger import sgai_logger

load_dotenv()
sgai_logger.set_logging(level="INFO")
api_key = os.getenv("SGAI_API_KEY")
if not api_key:
    raise ValueError("SGAI_API_KEY not found in environment variables. Please create a .env file with: SGAI_API_KEY=your_api_key_here")

client = Client(api_key=api_key)

response = client.searchscraper(
    user_prompt="What is the latest version of Python and what are its main features?",
    num_results=5  # Try 3, 5, 10, or 20 for different research depth
)

print("\nResults:")
print(f"Answer: {response['result']}")
print("\nReference URLs:")
for url in response["reference_urls"]:
    print(f"- {url}")

client.close()

Get Started

Services

Official SDKs

Integrations

Contribute

Resources

​Overview

​Getting Started

​Quick Start

​Parameters

​Key Features

Multi-Source Search

AI Understanding

Structured Output

Source Attribution

​Use Cases

​Research & Analysis

​Data Aggregation

​Content Creation

​Other Functionality

​Retrieve a previous request

​Parameters

​Custom Schema Example

​Advanced Schema Usage

​Nested Schemas

​Schema Validation Rules

​Quality Improvement Tips

​1. Detailed Field Descriptions

​2. Structured Prompts

​3. Data Validation

​4. Error Handling

​Async Support

​Integration Options

​Official SDKs

​AI Framework Integrations

​Best Practices

​Query Optimization

​Schema Design

​Rate Limiting

​Example Projects

​API Reference

​Support & Resources

Documentation

API Reference

Community

GitHub

Ready to Start?

​Example: Configurable Website Limits

Overview

Getting Started

Quick Start

Parameters

Key Features

Use Cases

Research & Analysis

Data Aggregation

Content Creation

Other Functionality

Retrieve a previous request

Parameters

Custom Schema Example

Advanced Schema Usage

Nested Schemas

Schema Validation Rules

Quality Improvement Tips

1. Detailed Field Descriptions

2. Structured Prompts

3. Data Validation

4. Error Handling

Async Support

Integration Options

Official SDKs

AI Framework Integrations

Best Practices

Query Optimization

Schema Design

Rate Limiting

Example Projects

API Reference

Support & Resources

Example: Configurable Website Limits