Scrape Service

Overview

The Scrape service provides direct access to raw HTML content from web pages, with optional JavaScript rendering support. This service is perfect for applications that need the complete HTML structure of a webpage, including dynamically generated content.
Try the Scrape service instantly in our interactive playground - no coding required!

Getting Started

Quick Start

from scrapegraph_py import Client
from scrapegraph_py.logger import sgai_logger

sgai_logger.set_logging(level="INFO")

# Initialize the client
sgai_client = Client(api_key="your-api-key")

# Scrape request
response = sgai_client.htmlify(
    website_url="https://example.com",
    render_heavy_js=False  # Set to True for heavy JavaScript rendering
)

print("HTML Content:", response.html)
print("Request ID:", response.scrape_request_id)
print("Status:", response.status)

Parameters

ParameterTypeRequiredDescription
apiKeystringYesThe ScrapeGraph API Key.
websiteUrlstringYesThe URL of the webpage to scrape.
render_heavy_jsbooleanNoSet to true for heavy JavaScript rendering. Default: false
Get your API key from the dashboard

Key Features

Raw HTML Access

Get complete HTML structure including all elements

JavaScript Rendering

Optional support for heavy JavaScript rendering

Fast Processing

Quick extraction for simple HTML content

Reliable Output

Consistent results across different websites

Use Cases

Web Development

  • Extract HTML templates
  • Analyze page structure
  • Test website rendering
  • Debug HTML issues

Data Analysis

  • Parse HTML content
  • Extract specific elements
  • Monitor website changes
  • Build web scrapers

Content Processing

  • Process dynamic content
  • Handle JavaScript-heavy sites
  • Extract embedded data
  • Analyze page performance
Want to learn more about our AI-powered scraping technology? Visit our main website to discover how we’re revolutionizing web data extraction.

JavaScript Rendering

The render_heavy_js parameter controls whether JavaScript should be executed on the target page:

When to Use JavaScript Rendering

  • Single Page Applications (SPAs): React, Vue, Angular apps
  • Dynamic Content: Content loaded via AJAX/fetch
  • Interactive Elements: Dropdowns, modals, infinite scroll
  • Client-side Routing: Hash-based or history API routing

When to Skip JavaScript Rendering

  • Static HTML Pages: Traditional server-rendered content
  • Performance: Faster processing for simple pages
  • Cost Optimization: Lower API usage for basic scraping
  • Reliability: More predictable results for static content

Advanced Usage

Async Support

For applications requiring asynchronous execution, the Scrape service provides async support:
from scrapegraph_py import AsyncClient
import asyncio

async def main():
    async with AsyncClient(api_key="your-api-key") as client:
        response = await client.htmlify(
            website_url="https://example.com",
            render_heavy_js=True
        )
        print(response)

# Run the async function
asyncio.run(main())

Concurrent Processing

Process multiple URLs concurrently for better performance:
import asyncio
from scrapegraph_py import AsyncClient
from scrapegraph_py.logger import sgai_logger

sgai_logger.set_logging(level="INFO")

async def main():
    # Initialize async client
    sgai_client = AsyncClient(api_key="your-api-key")

    # URLs to scrape
    urls = [
        "https://example.com",
        "https://scrapegraphai.com/",
        "https://github.com/ScrapeGraphAI/Scrapegraph-ai",
    ]

    tasks = [sgai_client.htmlify(website_url=url, render_heavy_js=False) for url in urls]

    # Execute requests concurrently
    responses = await asyncio.gather(*tasks, return_exceptions=True)

    # Process results
    for i, response in enumerate(responses):
        if isinstance(response, Exception):
            print(f"\nError for {urls[i]}: {response}")
        else:
            print(f"\nPage {i+1} HTML:")
            print(f"URL: {urls[i]}")
            print(f"HTML Length: {len(response['html'])} characters")

    await sgai_client.close()

if __name__ == "__main__":
    asyncio.run(main())

Integration Options

Official SDKs

AI Framework Integrations

Best Practices

Performance Optimization

  1. Use render_heavy_js=false for static content
  2. Process multiple URLs concurrently
  3. Cache results when possible
  4. Monitor API usage and costs

Error Handling

  • Always check the status field
  • Handle network timeouts gracefully
  • Implement retry logic for failed requests
  • Log errors for debugging

Content Processing

  • Validate HTML structure before parsing
  • Handle different character encodings
  • Extract only needed content sections
  • Clean up HTML for further processing

Example Projects

Check out our cookbook for real-world examples:
  • Web scraping automation tools
  • Content monitoring systems
  • HTML analysis applications
  • Dynamic content extractors

API Reference

For detailed API documentation, see:

Support & Resources

Ready to Start?

Sign up now and get your API key to begin scraping web content!