Documentation Index
Fetch the complete documentation index at: https://docs.scrapegraphai.com/llms.txt
Use this file to discover all available pages before exploring further.
Enhancing AI Applications with Web Data
Learn how to integrate ScrapeGraphAI with your AI and LLM applications to enhance their capabilities with real-time web data.
Common Use Cases
- RAG (Retrieval Augmented Generation): Enhance your LLM responses with up-to-date web content
- AI Assistants: Build domain-specific AI assistants with access to web data
- Knowledge Bases: Create and maintain dynamic knowledge bases from web sources
- Research Agents: Develop autonomous agents that can research and analyze web content
Integration Examples
RAG with LangChain
from langchain import LLMChain
from scrapegraph_py import Client
from pydantic import BaseModel, Field
from typing import Optional
class ArticleSchema(BaseModel):
"""Schema for article content"""
title: str = Field(description="Article title")
content: str = Field(description="Main article content")
author: Optional[str] = Field(description="Article author name")
date: Optional[str] = Field(description="Publication date")
summary: Optional[str] = Field(description="Article summary or description")
# Initialize the client
client = Client(api_key="your-api-key")
try:
# Scrape relevant content
response = client.extract(
url="https://example.com/article",
prompt="Extract the main article content, title, author, and publication date",
output_schema=ArticleSchema
)
# Use in your RAG pipeline
text_content = f"Title: {response.title}\n\nContent: {response.content}"
docs = text_splitter.split_text(text_content) # Most text splitters expect string input
vectorstore.add_documents(docs)
# Query your LLM with the enhanced context
response = llm_chain.run("Summarize the latest developments...")
except Exception as e:
print(f"Error occurred: {str(e)}")
AI Research Assistant
from scrapegraph_py import Client
from pydantic import BaseModel, Field
from typing import List
class ResearchData(BaseModel):
title: str = Field(description="Article title")
content: str = Field(description="Main article content")
author: str = Field(description="Article author")
date: str = Field(description="Publication date")
class ResearchResults(BaseModel):
articles: List[ResearchData]
# Initialize the client
client = Client(api_key="your-api-key")
try:
# Search and scrape multiple sources
search_results = client.search(
query="What are the latest developments in artificial intelligence?",
output_schema=ResearchResults,
num_results=5
)
# Process with your AI model
if search_results and search_results.articles:
analysis = ai_model.analyze(search_results.articles)
print(f"Analyzed {len(search_results.articles)} articles")
else:
print("No articles found in the search results")
except Exception as e:
print(f"Error during research: {str(e)}")
Best Practices
- Data Freshness: Regularly update your knowledge base with fresh web content
- Content Filtering: Use our filtering options to get only relevant content
- Rate Limiting: Implement appropriate rate limiting for production applications
- Error Handling: Always handle potential scraping errors gracefully