AI-powered website crawling and multi-page extraction
Header | Description |
---|---|
SGAI-APIKEY | Your API authentication key |
Content-Type | application/json |
Parameter | Type | Required | Description |
---|---|---|---|
url | string | Yes | The starting URL for the crawl. |
prompt | string | No* | Instructions for what to extract (*required when extraction_mode=true). |
extraction_mode | bool | No | When false , enables markdown conversion mode (default: true). |
depth | int | No | How many link levels to follow (default: 1). |
max_pages | int | No | Maximum number of pages to crawl (default: 20). |
schema | object | No | Pydantic or Zod schema for structured output. |
rules | object | No | Crawl rules (see below). |
sitemap | bool | No | Use sitemap.xml for discovery (default: false). |
Markdown Conversion Response
rules
object:
Field | Type | Default | Description |
---|---|---|---|
exclude | list | [] | List of URL substrings to skip |
same_domain | bool | True | Restrict crawl to the same domain |
Example Response
llm_result
: Structured extraction based on your prompt/schemacrawled_urls
: List of all URLs visitedpages
: List of objects with url
and extracted markdown
contentParameter | Type | Required | Description |
---|---|---|---|
apiKey | string | Yes | The ScrapeGraph API Key. |
taskId | string | Yes | The crawl job task ID. |
url
or website_html
is providedHeader | Description |
---|---|
SGAI-APIKEY | Your API authentication key |
Content-Type | application/json |
Field | Type | Required | Description |
---|---|---|---|
url | string | Yes* | Starting URL (*or website_html required) |
website_html | string | No | Raw HTML content (max 2MB) |
prompt | string | Yes | Extraction instructions |
schema | object | No | Output schema |
headers | object | No | Custom headers |
number_of_scrolls | int | No | Infinite scroll per page |
depth | int | No | Crawl depth |
max_pages | int | No | Max pages to crawl |
rules | object | No | Crawl rules |
sitemap | bool | No | Use sitemap.xml |
extraction_mode: false
max_pages
and depth
limitsrules
to avoid unwanted pages (like /logout)