Overview
LlamaIndex is a data framework for building LLM-powered agents and RAG applications. This page shows how to wirescrapegraph-py ≥ 2.0.1 into LlamaIndex as a set of FunctionTools so your agents can scrape pages, extract structured data, search the web, run asynchronous crawls, and manage scheduled monitors.
Official LlamaIndex Documentation
Learn more about building agents and RAG pipelines with LlamaIndex
Which package? LlamaIndex also ships a pre-built tool spec at
llama-index-tools-scrapegraphai, but it currently depends on scrapegraph-py<2 and targets the legacy v1 backend. New v2 API keys are rejected by that path. The recipes below use the v2 SDK directly — they work with the current dashboard and every v2 endpoint (scrape, extract, search, crawl, monitor).Installation
Quick Start
Initialize the v2 client and expose a tool to any LlamaIndex agent:Cookbook recipes
The following recipes are ported from the officialscrapegraph-py cookbook notebooks, swapped to call the v2 extract endpoint so they run against the current dashboard API key.
1. Extract company info
Pull founders, pricing plans, and social links off a company homepage. Based oncookbook/company-info/.
2. Extract GitHub trending repos
Pull a ranked list of trending repositories. Based oncookbook/github-trending/.
3. Extract a news feed
Pull headlines from a news section. Based oncookbook/wired-news/.
4. Extract real-estate listings
Pull house listings with price, address, and tags. Based oncookbook/homes-forsale/.
5. Research agent with ReActAgent
Combine scrape + extract into a LlamaIndex ReActAgent so the LLM decides which tool to call per step. Based on cookbook/research-agent/.
Usage Reference
Scrape tool
Extract tool
Search tool
Crawl tool
Crawls are asynchronous — pollsgai.crawl.get(id) until status in ("completed", "failed", "stopped").
Monitor tool
Configuration Options
The v2ScrapeGraphAI client accepts:
| Parameter | Type | Default | Description |
|---|---|---|---|
api_key | str | None | None | Falls back to SGAI_API_KEY. |
base_url | str | https://v2-api.scrapegraphai.com/api | Override via SGAI_API_URL. |
timeout | int | 120 | Request timeout in seconds. Override via SGAI_TIMEOUT. |
| SDK call | Endpoint | First positional arg |
|---|---|---|
sgai.scrape(url, ...) | Scrape | url |
sgai.extract(prompt, url=..., ...) | Extract | prompt |
sgai.search(query, ...) | Search | query |
sgai.crawl.start(url, ...), .get/.stop/.resume/.delete(id) | Crawl | url / id |
sgai.monitor.create(url, interval, ...), .list/.get/.update/.pause/.resume/.delete/.activity(...) | Monitor | url, interval |
ApiResult[T] with status, data, error, and elapsed_ms — so tools can surface errors without exceptions.
Advanced Usage
Combining every endpoint in one agent
Hand the full tool list to an agent and let it pick the right tool per step:Async client
Every resource has an async twin viaAsyncScrapeGraphAI:
Custom agent configuration
Plug the tools into any LlamaIndex agent —ReActAgent, workflow-based, or third-party:
Features
Scrape
Fetch pages as markdown, HTML, screenshots, JSON, links, images, summary, or branding
Extract
Structured extraction with a prompt and a JSON schema
Search
AI-powered web search with optional structured output
Crawl
Asynchronous multi-page crawls with start / stop / resume controls
Monitor
Cron-scheduled jobs with webhook notifications on change
Typed Requests
Pydantic request models and
ApiResult[T] responses — no surprisesAsync-Ready
AsyncScrapeGraphAI mirrors every resource for parallel pipelinesAgent-Ready
Every endpoint exposed as a drop-in LlamaIndex FunctionTool
Best Practices
- Tool selection — pass only the tools the agent actually needs; a shorter tool list keeps prompts tighter and routing more accurate.
- Schema design — when calling
extractorsearch, pass a concrete JSON schema (YourSchema.model_json_schema()) so the extractor has a clear target. - Format entries —
scrapeaccepts a list of format entries; combineMarkdownFormatConfig,ScreenshotFormatConfig, andJsonFormatConfigin one call to avoid multiple round-trips. - Async crawls —
sgai.crawl.startreturns immediately; always pollsgai.crawl.get(id)untilstatus in ("completed", "failed", "stopped"). - ApiResult — branch on
result.statusinstead of wrapping calls intry/except; the SDK never raises on API-level errors. - Hard pages — stealth mode +
mode="js"fetch config handles most anti-bot sites (see the Zillow recipe above).
Support
LlamaIndex Discord
Join the LlamaIndex community for support and discussions
scrapegraph-py cookbook
Browse the full set of notebook examples
ScrapeGraphAI Discord
Get help with ScrapeGraphAI features
Documentation
Explore the full API reference