Skip to main content

Integrate Crawl4AI with n8n: Complete Workflow Automation Guide

Learn how to integrate Crawl4AI with n8n workflows using a FastAPI microservice architecture. Includes complete setup, deployment, and production best practices.

⭐ Get the Docker Compose Template on GitHub

Ready-to-use deployment files included • Fork and customize for your needs

Why Combine Crawl4AI with n8n?

n8n is a powerful workflow automation tool that excels at connecting services, orchestrating multi-step processes, and handling data transformations. However, its built-in HTTP Request node has limitations when dealing with modern, JavaScript-heavy websites.

By wrapping Crawl4AI in a FastAPI microservice, you get the best of both worlds:

  • n8n handles workflow orchestration, scheduling, error handling, and integrations with 400+ services
  • Crawl4AI handles complex web scraping—JavaScript rendering, CSS extraction, LLM-based content parsing
  • FastAPI provides a clean REST API interface between n8n and Crawl4AI
ℹ️

Why official documentation isn't enough

The Crawl4AI documentation shows you how to scrape websites programmatically, and n8n's docs explain how to make HTTP requests. But neither explains how to build a production-ready microservice that bridges the two—you need to design API endpoints, handle authentication, manage async operations, implement proper error handling, and deploy with Docker. This guide fills that gap.

Understanding the Architecture

The integration follows this microservice pattern:

text
n8n Workflow → HTTP Request Node → FastAPI Microservice → Crawl4AI → Target Website
                      ↓                        ↓                      ↓                 ↓
              Orchestration           REST API             Async crawler      HTTP request
              Scheduling               Authentication       Playwright         JavaScript
              Error handling           Rate limiting         Content extraction  rendering

n8n never directly touches Crawl4AI or Playwright. Instead, it makes HTTP requests to your FastAPI service, which handles the heavy lifting. This separation keeps n8n lightweight while giving you full control over the scraping logic.

Building the FastAPI Microservice

Create a project structure:

bash
mkdir crawl4ai-n8n-service
cd crawl4ai-n8n-service
mkdir app
touch app/main.py app/models.py app/crawler.py app/requirements.txt

Step 1: Define Data Models

Create app/models.py:

python
from pydantic import BaseModel, HttpUrl, Field
from typing import Optional, List

class ScrapeRequest(BaseModel):
    url: HttpUrl
    selector: Optional[str] = Field(None, description="CSS selector for targeted extraction")
    wait_for_selector: Optional[str] = Field(None, description="CSS selector to wait before scraping")
    word_count_threshold: int = Field(10, description="Minimum word count for content blocks")
    bypass_cache: bool = Field(False, description="Force fresh scrape")

class ScrapeResponse(BaseModel):
    success: bool
    url: str
    markdown: Optional[str] = None
    extracted_content: Optional[str] = None
    error: Optional[str] = None
    metadata: dict = {}

class HealthResponse(BaseModel):
    status: str
    version: str

Step 2: Create Crawler Wrapper

Create app/crawler.py:

python
from crawl4ai import AsyncWebCrawler
from crawl4ai.extraction_strategy import LLMExtractionStrategy
from typing import Optional
import logging

logger = logging.getLogger(__name__)

class CrawlerWrapper:
    def __init__(self):
        self.crawler = None

    async def initialize(self):
        """Initialize the async crawler (call once on startup)"""
        self.crawler = AsyncWebCrawler(
            verbose=True,
            headless=True
        )
        await self.crawler.start()

    async def scrape(
        self,
        url: str,
        selector: Optional[str] = None,
        wait_for_selector: Optional[str] = None,
        word_count_threshold: int = 10,
        bypass_cache: bool = False
    ) -> dict:
        """Perform scraping with configurable options"""
        try:
            if not self.crawler:
                await self.initialize()

            result = await self.crawler.arun(
                url=str(url),
                word_count_threshold=word_count_threshold,
                css_selector=selector,
                wait_for_selector=wait_for_selector,
                bypass_cache=bypass_cache,
                exclude_external_links=False,
                exclude_social_media_links=True
            )

            if result.success:
                return {
                    "success": True,
                    "url": url,
                    "markdown": result.markdown,
                    "extracted_content": result.extracted_content if hasattr(result, 'extracted_content') else None,
                    "metadata": {
                        "word_count": result.extracted_content_word_count if hasattr(result, 'extracted_content_word_count') else 0,
                        "links": len(result.links.get('internal', [])) + len(result.links.get('external', [])) if hasattr(result, 'links') else 0,
                        "screenshot": hasattr(result, 'screenshot')
                    }
                }
            else:
                error_msg = result.error_message if hasattr(result, 'error_message') else 'Unknown error'
                logger.error(f"Scraping failed for {url}: {error_msg}")
                return {
                    "success": False,
                    "url": url,
                    "error": error_msg,
                    "metadata": {}
                }

        except Exception as e:
            logger.exception(f"Exception during scraping: {str(e)}")
            return {
                "success": False,
                "url": url,
                "error": str(e),
                "metadata": {}
            }

# Global crawler instance
crawler_wrapper = CrawlerWrapper()

Step 3: Create FastAPI Application

Create app/main.py. Secure your endpoint with a Crawl4AI API Token to prevent unauthorized access to your scraping service.

python
from fastapi import FastAPI, HTTPException, Security, status
from fastapi.middleware.cors import CORSMiddleware
from fastapi.responses import JSONResponse
import os
import logging
from contextlib import asynccontextmanager

from .models import ScrapeRequest, ScrapeResponse, HealthResponse
from .crawler import crawler_wrapper

# Configure logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

# API Key authentication (optional but recommended)
API_KEY = os.getenv("API_KEY", "your-secret-api-key")

async def verify_api_key(api_key: str = Security(None)):
    """Verify API key if one is set"""
    if API_KEY and API_KEY != "your-secret-api-key":
        if api_key != API_KEY:
            raise HTTPException(
                status_code=status.HTTP_401_UNAUTHORIZED,
                detail="Invalid API key"
            )
    return True

@asynccontextmanager
async def lifespan(app: FastAPI):
    """Manage application lifecycle"""
    # Startup
    logger.info("Initializing Crawl4AI crawler...")
    await crawler_wrapper.initialize()
    logger.info("Crawler initialized successfully")
    yield
    # Shutdown
    logger.info("Shutting down crawler...")
    if crawler_wrapper.crawler:
        await crawler_wrapper.crawler.close()
    logger.info("Shutdown complete")

# Create FastAPI app
app = FastAPI(
    title="Crawl4AI n8n Service",
    description="FastAPI microservice for n8n workflows to use Crawl4AI",
    version="1.0.0",
    lifespan=lifespan
)

# Enable CORS for n8n
app.add_middleware(
    CORSMiddleware,
    allow_origins=["*"],  # In production, specify n8n's domain
    allow_credentials=True,
    allow_methods=["POST"],
    allow_headers=["*"],
)

@app.get("/health", response_model=HealthResponse)
async def health_check():
    """Health check endpoint for monitoring"""
    return HealthResponse(status="healthy", version="1.0.0")

@app.post("/scrape", response_model=ScrapeResponse)
async def scrape_url(request: ScrapeRequest, auth: bool = Security(verify_api_key)):
    """
    Main scraping endpoint for n8n workflows.

    Expects a JSON body with:
    - url (required): The URL to scrape
    - selector (optional): CSS selector for targeted extraction
    - wait_for_selector (optional): CSS selector to wait for before scraping
    - word_count_threshold (optional): Minimum word count (default: 10)
    - bypass_cache (optional): Force fresh scrape (default: false)
    """
    try:
        logger.info(f"Scraping request for URL: {request.url}")

        result = await crawler_wrapper.scrape(
            url=str(request.url),
            selector=request.selector,
            wait_for_selector=request.wait_for_selector,
            word_count_threshold=request.word_count_threshold,
            bypass_cache=request.bypass_cache
        )

        return ScrapeResponse(**result)

    except Exception as e:
        logger.exception(f"Unexpected error: {str(e)}")
        raise HTTPException(
            status_code=status.HTTP_500_INTERNAL_SERVER_ERROR,
            detail=f"Internal server error: {str(e)}"
        )

@app.exception_handler(Exception)
async def global_exception_handler(request, exc):
    """Global exception handler"""
    logger.error(f"Unhandled exception: {str(exc)}")
    return JSONResponse(
        status_code=500,
        content={"detail": "Internal server error"}
    )

if __name__ == "__main__":
    import uvicorn
    uvicorn.run(app, host="0.0.0.0", port=8000)

Step 4: Dependencies

Create app/requirements.txt:

text
fastapi==0.109.0
uvicorn[standard]==0.27.0
crawl4ai==0.2.77
pydantic==2.5.3
python-multipart==0.0.6

Deploy Crawl4AI with Docker for n8n Automation

Create a Dockerfile in the project root:

dockerfile
FROM python:3.11-slim

# Install system dependencies
RUN apt-get update && apt-get install -y \
    wget \
    gnupg \
    && rm -rf /var/lib/apt/lists/*

# Install Playwright browsers
RUN playwright install --with-deps chromium

# Set working directory
WORKDIR /app

# Copy requirements and install Python dependencies
COPY app/requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Copy application code
COPY app/ ./app/

# Expose port
EXPOSE 8000

# Health check
HEALTHCHECK --interval=30s --timeout=10s --start-period=5s --retries=3 \
  CMD python -c "import requests; requests.get('http://localhost:8000/health')"

# Run the application
CMD ["uvicorn", "app.main:app", "--host", "0.0.0.0", "--port", "8000"]

Create docker-compose.yml for easier deployment:

yaml
version: '3.8'

services:
  crawl4ai-service:
    build: .
    container_name: crawl4ai-n8n
    ports:
      - "8000:8000"
    environment:
      - API_KEY=${API_KEY:-your-secret-api-key}
    restart: unless-stopped
    volumes:
      - ./logs:/app/logs
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
      interval: 30s
      timeout: 10s
      retries: 3
      start_period: 5s

Build and run:

bash
# Build the image
docker-compose build

# Start the service
docker-compose up -d

# Check logs
docker-compose logs -f

# Stop the service
docker-compose down

Configuring n8n Workflows

Once your FastAPI service is running, configure n8n to use it:

1. Add HTTP Request Node

Drag an HTTP Request node into your n8n workflow and configure it:

  • Method: POST
  • URL: http://localhost:8000/scrape (or your deployed URL)
  • Authentication: Header API Key (if you set one)
  • Body: JSON with the scraping parameters

2. Example Request Body

json
{
  "url": "https://example.com",
  "selector": ".article-content",
  "word_count_threshold": 10,
  "bypass_cache": true
}

3. Handle the Response

The HTTP Request node will receive a JSON response with:

json
{
  "success": true,
  "url": "https://example.com",
  "markdown": "# Scraped content...\n\nFull markdown...",
  "extracted_content": null,
  "error": null,
  "metadata": {
    "word_count": 1234,
    "links": 45,
    "screenshot": true
  }
}
ℹ️

Workflow example

A typical workflow might be: Schedule Trigger → HTTP Request (scrape) → IF (check success field) → Code Node (process markdown) → Send to Slack/Database/Notion

Common Mistakes and How to Avoid Them

Mistake 1: Not Handling Async Lifecycle

Forgetting to initialize and cleanup the crawler properly

Problem: Creating a new crawler instance for every request is slow and resource-intensive.

❌ Wrong - new instance per request

python
@app.post("/scrape")
async def scrape_url(request: ScrapeRequest):
    async with AsyncWebCrawler() as crawler:  # Spawns new browser every time
        result = await crawler.arun(url=request.url)
    return result

✅ Correct - reuse crawler instance

python
crawler_wrapper = CrawlerWrapper()

@asynccontextmanager
async def lifespan(app: FastAPI):
    await crawler_wrapper.initialize()  # Once on startup
    yield
    await crawler_wrapper.crawler.close()  # Cleanup on shutdown

Mistake 2: No Rate Limiting

Allowing unlimited requests that overload the server

Problem: n8n workflows can trigger rapid-fire requests that crash your service or get you blocked.

✅ Add rate limiting with slowapi

python
from slowapi import Limiter, _rate_limit_exceeded_handler
from slowapi.util import get_remote_address
from slowapi.errors import RateLimitExceeded

limiter = Limiter(key_func=get_remote_address)
app.state.limiter = limiter
app.add_exception_handler(RateLimitExceeded, _rate_limit_exceeded_handler)

@app.post("/scrape")
@limiter.limit("10/minute")  # 10 requests per minute per IP
async def scrape_url(request: ScrapeRequest, http_request: Request):
    # ... scraping logic

Mistake 3: Not Validating URLs

Accepting any URL without validation

Problem: Malicious actors can exploit your service to attack internal networks or phishing sites.

✅ Validate URLs before scraping

python
from urllib.parse import urlparse
import ipaddress

def is_safe_url(url: str) -> bool:
    """Check if URL is safe to scrape"""
    parsed = urlparse(url)

    # Reject non-HTTP(S) schemes
    if parsed.scheme not in ['http', 'https']:
        return False

    # Reject private IPs
    try:
        ip = ipaddress.ip_address(parsed.hostname)
        if ip.is_private or ip.is_loopback:
            return False
    except ValueError:
        pass  # Not an IP, continue

    return True

# In your endpoint
if not is_safe_url(str(request.url)):
    raise HTTPException(status_code=400, detail="Unsafe URL")

Mistake 4: No Timeout Handling

Requests hanging indefinitely on slow sites

Problem: Some sites take minutes to load, blocking your workers and exhausting resources.

✅ Add timeouts to crawler

python
result = await crawler.arun(
    url=url,
    timeout=30,  # 30 second timeout
    page_timeout=30  # Page load timeout
)

Mistake 5: Missing Error Context

Returning generic errors that make debugging impossible

Problem: When scraping fails, n8n workflows can't tell if it's a timeout, blocking, or parsing error.

❌ Wrong - generic error

python
return {"success": False, "error": "Scraping failed"}

✅ Correct - detailed error

python
return {
    "success": False,
    "error": f"Timeout after 30s waiting for selector '.content'",
    "error_type": "timeout",
    "url": url
}

When This Integration Makes Sense (And When It Doesn't)

Good Fit ✅

  • Complex workflows: You need to scrape, transform, and send data to multiple services (Slack, Notion, Google Sheets)
  • Scheduled tasks: Daily/weekly scraping jobs that require minimal code
  • Dynamic sites: SPAs, React apps that need JavaScript rendering
  • Non-technical users: Teams who prefer drag-and-drop workflow builders over writing Python scripts
  • Integration-heavy: You need to connect scraped data to 5+ other services

Not a Good Fit ❌

  • High-volume scraping: Millions of pages per day—use dedicated scraping infrastructure
  • Simple scripts: One-off tasks where a 10-line Python script suffices
  • Real-time requirements: Sub-second response times needed
  • Static sites: Basic HTML sites that don't need Playwright overhead
  • Cost-sensitive: Running n8n + Docker + FastAPI has higher resource costs than a simple script

Alternative Approaches

  • For simple workflows: Use n8n's built-in HTTP Request node with puppeteer or cheerio services
  • For developers: Write Python scripts with Airflow or Prefect for orchestration
  • For enterprise: Consider Apify, which provides hosted web scraping with n8n integration out of the box

Production Best Practices

1. Authentication & Security

  • Use strong API keys (32+ characters) and store them in environment variables, not code
  • Restrict CORS to your n8n domain instead of allowing all origins
  • Add HTTPS in production (use a reverse proxy like nginx or Traefik)
  • Implement request signing for additional security

2. Monitoring & Logging

Add structured logging with correlation IDs:

python
import uuid
from fastapi import Request

@app.middleware("http")
async def add_correlation_id(request: Request, call_next):
    correlation_id = str(uuid.uuid4())
    request.state.correlation_id = correlation_id
    response = await call_next(request)
    response.headers["X-Correlation-ID"] = correlation_id
    return response

@app.post("/scrape")
async def scrape_url(request: ScrapeRequest, http_request: Request):
    correlation_id = http_request.state.correlation_id
    logger.info(f"[{correlation_id}] Scraping {request.url}")

3. Scaling Considerations

  • Horizontal scaling: Run multiple containers behind a load balancer (each needs its own Crawl4AI instance)
  • Resource limits: Set Docker memory limits (2GB per container is typical)
  • Queue system: For heavy workloads, add Redis Queue or Celery between n8n and your scraper
  • Connection pooling: Reuse browser instances when possible, but recycle them periodically to prevent memory leaks

4. Cost Optimization

  • Use serverless or spot instances for scheduled workflows (you only pay during execution)
  • Cache aggressively—if content hasn't changed, return cached results instead of re-scraping
  • Batch requests when possible—scrape 10 URLs in one workflow execution instead of 10 separate runs
  • Monitor costs per 1000 pages and set alerts for unexpected spikes

Summary

Integrating Crawl4AI with n8n through a FastAPI microservice gives you a powerful, flexible automation platform. The key implementation points are:

  1. Microservice architecture: FastAPI provides a clean REST interface between n8n and Crawl4AI
  2. Lifecycle management: Initialize the crawler once at startup, cleanup on shutdown
  3. Security: Validate URLs, use API keys, enable HTTPS in production
  4. Reliability: Add rate limiting, timeouts, detailed error messages, and logging
  5. Scalability: Design for horizontal scaling with load balancers and queue systems

This integration shines for complex, multi-step workflows that need to connect scraped data to multiple services. For simpler use cases or high-volume scraping, consider lightweight scripts or dedicated scraping platforms.

Need This for Your AI Coding Agent Instead?

If you're using Cursor, Claude Desktop, or other AI assistants, check out our MCP guide. Set up web scraping in your AI workflow in under 5 minutes.

View Crawl4AI MCP Guide