Skip to main content

Enhancing AI Applications with Web Data

Learn how to integrate ScrapeGraphAI with your AI and LLM applications to enhance their capabilities with real-time web data.

Common Use Cases

  • RAG (Retrieval Augmented Generation): Enhance your LLM responses with up-to-date web content
  • AI Assistants: Build domain-specific AI assistants with access to web data
  • Knowledge Bases: Create and maintain dynamic knowledge bases from web sources
  • Research Agents: Develop autonomous agents that can research and analyze web content

Integration Examples

RAG with LangChain

from langchain import LLMChain
from scrapegraph_py import Client
from pydantic import BaseModel, Field
from typing import Optional

class ArticleSchema(BaseModel):
    """Schema for article content"""
    title: str = Field(description="Article title")
    content: str = Field(description="Main article content")
    author: Optional[str] = Field(description="Article author name")
    date: Optional[str] = Field(description="Publication date")
    summary: Optional[str] = Field(description="Article summary or description")

# Initialize the client
client = Client()

try:
    # Scrape relevant content
    response = client.smartscraper(
        website_url="https://example.com/article",
        user_prompt="Extract the main article content, title, author, and publication date",
        output_schema=ArticleSchema
    )

    # Use in your RAG pipeline
    text_content = f"Title: {response.title}\n\nContent: {response.content}"
    docs = text_splitter.split_text(text_content)  # Most text splitters expect string input
    vectorstore.add_documents(docs)

    # Query your LLM with the enhanced context
    response = llm_chain.run("Summarize the latest developments...")

except Exception as e:
    print(f"Error occurred: {str(e)}")

AI Research Assistant

from scrapegraph_py import Client
from pydantic import BaseModel, Field
from typing import List

class ResearchData(BaseModel):
    title: str = Field(description="Article title")
    content: str = Field(description="Main article content")
    author: str = Field(description="Article author")
    date: str = Field(description="Publication date")

class ResearchResults(BaseModel):
    articles: List[ResearchData]

# Initialize the client
client = Client()

try:
    # Search and scrape multiple sources
    search_results = client.searchscraper(
        user_prompt="What are the latest developments in artificial intelligence?",
        output_schema=ResearchResults,
        num_results=5,  # Number of websites to search (3-20)
        extraction_mode=True  # Use AI extraction mode for structured data
    )

    # Process with your AI model
    if search_results and search_results.articles:
        analysis = ai_model.analyze(search_results.articles)
        print(f"Analyzed {len(search_results.articles)} articles")
    else:
        print("No articles found in the search results")

except Exception as e:
    print(f"Error during research: {str(e)}")

Best Practices

  1. Data Freshness: Regularly update your knowledge base with fresh web content
  2. Content Filtering: Use our filtering options to get only relevant content
  3. Rate Limiting: Implement appropriate rate limiting for production applications
  4. Error Handling: Always handle potential scraping errors gracefully
I