LlamaIndex is a data framework for building LLM-powered agents and RAG applications. This page shows how to wire scrapegraph-py ≥ 2.0.1 into LlamaIndex as a set of FunctionTools so your agents can scrape pages, extract structured data, search the web, run asynchronous crawls, and manage scheduled monitors.
Official LlamaIndex Documentation
Learn more about building agents and RAG pipelines with LlamaIndex
Which package? LlamaIndex also ships a pre-built tool spec at llama-index-tools-scrapegraphai, but it currently depends on scrapegraph-py<2 and targets the legacy v1 backend. New v2 API keys are rejected by that path. The recipes below use the v2 SDK directly — they work with the current dashboard and every v2 endpoint (scrape, extract, search, crawl, monitor).
The following recipes are ported from the official scrapegraph-py cookbook notebooks, swapped to call the v2 extract endpoint so they run against the current dashboard API key.
Pull founders, pricing plans, and social links off a company homepage. Based on cookbook/company-info/.
from pydantic import BaseModel, Fieldfrom typing import Listfrom scrapegraph_py import ScrapeGraphAIclass FounderSchema(BaseModel): name: str = Field(description="Name of the founder") role: str = Field(description="Role of the founder in the company") linkedin: str = Field(description="LinkedIn profile of the founder")class PricingPlanSchema(BaseModel): tier: str = Field(description="Name of the pricing tier") price: str = Field(description="Price of the plan") credits: int = Field(description="Number of credits included in the plan")class SocialLinksSchema(BaseModel): linkedin: str twitter: str github: strclass CompanyInfoSchema(BaseModel): company_name: str description: str founders: List[FounderSchema] = Field(default_factory=list) pricing_plans: List[PricingPlanSchema] = Field(default_factory=list) social_links: SocialLinksSchemasgai = ScrapeGraphAI()res = sgai.extract( "Extract info about the company", url="https://scrapegraphai.com/", schema=CompanyInfoSchema.model_json_schema(),)if res.status == "success": print(res.data.json_data)
Pull headlines from a news section. Based on cookbook/wired-news/.
from pydantic import BaseModel, Fieldfrom typing import Listfrom scrapegraph_py import ScrapeGraphAIclass NewsItemSchema(BaseModel): category: str = Field(description="Category of the news (e.g. 'Health', 'Environment')") title: str = Field(description="Title of the news article") link: str = Field(description="URL to the news article") author: str = Field(description="Author of the news article")class ListNewsSchema(BaseModel): news: List[NewsItemSchema]sgai = ScrapeGraphAI()res = sgai.extract( "Extract the first 10 news articles on the page", url="https://www.wired.com/category/science/", schema=ListNewsSchema.model_json_schema(),)if res.status == "success": for item in res.data.json_data["news"]: print(f"[{item['category']}] {item['title']}")
Pull house listings with price, address, and tags. Based on cookbook/homes-forsale/.
from pydantic import BaseModel, Fieldfrom typing import Listfrom scrapegraph_py import ScrapeGraphAI, FetchConfigclass HouseListingSchema(BaseModel): price: int = Field(description="Price of the house in USD") bedrooms: int bathrooms: int square_feet: int = Field(description="Total square footage of the house") address: str city: str state: str zip_code: str tags: List[str] = Field(description="Tags like 'New construction' or 'Large garage'") agent_name: str agency: strclass HousesListingsSchema(BaseModel): houses: List[HouseListingSchema]sgai = ScrapeGraphAI()# Anti-bot heavy sites need stealth + JS renderingres = sgai.extract( "Extract information about houses for sale", url="https://www.zillow.com/san-francisco-ca/", schema=HousesListingsSchema.model_json_schema(), fetch_config=FetchConfig(mode="js", stealth=True, wait=2000),)
Combine scrape + extract into a LlamaIndex ReActAgent so the LLM decides which tool to call per step. Based on cookbook/research-agent/.
from scrapegraph_py import ScrapeGraphAI, MarkdownFormatConfigfrom llama_index.core.tools import FunctionToolfrom llama_index.core.agent import ReActAgentfrom llama_index.llms.openai import OpenAIsgai = ScrapeGraphAI()def scrape(url: str) -> str: """Fetch a page and return its markdown content.""" res = sgai.scrape(url, formats=[MarkdownFormatConfig()]) if res.status != "success": return res.error or "" return res.data.results.get("markdown", {}).get("data", [""])[0]def extract(url: str, prompt: str) -> dict: """Extract structured data from a URL using the given prompt.""" res = sgai.extract(prompt, url=url) return res.data.json_data if res.status == "success" else {"error": res.error}tools = [FunctionTool.from_defaults(fn=f) for f in (scrape, extract)]agent = ReActAgent.from_tools( tools, llm=OpenAI(model="gpt-4o"), verbose=True,)response = agent.chat( "Extract all the keyboard names and prices from " "https://www.ebay.com/sch/i.html?_nkw=keyboards")print(response)
Hand the full tool list to an agent and let it pick the right tool per step:
from scrapegraph_py import ScrapeGraphAI, MarkdownFormatConfigfrom llama_index.core.tools import FunctionToolfrom llama_index.core.agent.workflow import FunctionAgentfrom llama_index.llms.openai import OpenAIsgai = ScrapeGraphAI()def scrape(url: str) -> str: res = sgai.scrape(url) if res.status != "success": return res.error or "" return res.data.results.get("markdown", {}).get("data", [""])[0]def extract(url: str, prompt: str) -> dict: res = sgai.extract(prompt, url=url) return res.data.json_data if res.status == "success" else {"error": res.error}def search(query: str, num_results: int = 5) -> list[dict]: res = sgai.search(query, num_results=num_results) if res.status == "error": return [{"error": res.error}] return [{"title": r.title, "url": r.url} for r in res.data.results]def crawl(url: str, max_pages: int = 20) -> dict: res = sgai.crawl.start(url, formats=[MarkdownFormatConfig()], max_pages=max_pages) return {"crawl_id": res.data.id} if res.status == "success" else {"error": res.error}def create_monitor(url: str, name: str, interval: str) -> dict: res = sgai.monitor.create( url, interval, name=name, formats=[MarkdownFormatConfig()], ) return {"cron_id": res.data.cron_id} if res.status == "success" else {"error": res.error}tools = [FunctionTool.from_defaults(fn=f) for f in ( scrape, extract, search, crawl, create_monitor,)]agent = FunctionAgent( tools=tools, llm=OpenAI(model="gpt-4o"), system_prompt=( "You are a web research assistant powered by ScrapeGraphAI v2. " "Pick the most specific tool for the job: scrape for a single page, " "extract for structured data, search for open-web questions, " "crawl for multi-page jobs, and create_monitor for recurring jobs." ),)response = await agent.run( "Research the latest blog posts on scrapegraphai.com and summarize them.")print(response)
Tool selection — pass only the tools the agent actually needs; a shorter tool list keeps prompts tighter and routing more accurate.
Schema design — when calling extract or search, pass a concrete JSON schema (YourSchema.model_json_schema()) so the extractor has a clear target.
Format entries — scrape accepts a list of format entries; combine MarkdownFormatConfig, ScreenshotFormatConfig, and JsonFormatConfig in one call to avoid multiple round-trips.
Async crawls — sgai.crawl.start returns immediately; always poll sgai.crawl.get(id) until status in ("completed", "failed", "stopped").
ApiResult — branch on result.status instead of wrapping calls in try/except; the SDK never raises on API-level errors.
Hard pages — stealth mode + mode="js" fetch config handles most anti-bot sites (see the Zillow recipe above).