Integrate Crawl4AI with n8n: Practical Workflow Automation Guide

Learn how to integrate Crawl4AI with n8n workflows using a FastAPI microservice architecture. Includes complete setup, deployment, and production best practices.

⭐ Get the Docker Compose Template on GitHub

Ready-to-use deployment files included • Fork and customize for your needs

Why Combine Crawl4AI with n8n for Automation

Limitations of Official Documentation

ℹ️

Why official documentation isn't enough

The Crawl4AI documentation shows you how to scrape websites programmatically, and n8n's docs explain how to make HTTP requests. But neither explains how to build a production-ready microservice that bridges the two—you need to design API endpoints, handle authentication, manage async operations, implement proper error handling, and deploy with Docker. This guide fills that gap.

n8n is a powerful workflow automation tool that excels at connecting services, orchestrating multi-step processes, and handling data transformations. However, its built-in HTTP Request node has limitations when dealing with modern, JavaScript-heavy websites.

By wrapping Crawl4AI in a FastAPI microservice, you get the best of both worlds:

n8n handles workflow orchestration, scheduling, error handling, and integrations with 400+ services
Crawl4AI handles complex web scraping—JavaScript rendering, CSS extraction, LLM-based content parsing
FastAPI provides a clean REST API interface between n8n and Crawl4AI

Crawl4AI n8n Integration Architecture

Understanding the Architecture

The integration follows this microservice pattern:

text

n8n Workflow → HTTP Request Node → FastAPI Microservice → Crawl4AI → Target Website
                        ↓                        ↓                      ↓                 ↓
                Orchestration           REST API             Async crawler      HTTP request
                Scheduling               Authentication       Playwright         JavaScript
                Error handling           Rate limiting         Content extraction  rendering

n8n never directly touches Crawl4AI or Playwright. Instead, it makes HTTP requests to your FastAPI service, which handles the heavy lifting. This separation keeps n8n lightweight while giving you full control over the scraping logic.

Implementing the Crawl4AI Microservice

Create a project structure:

bash

mkdir crawl4ai-n8n-service
cd crawl4ai-n8n-service
mkdir app
touch app/main.py app/models.py app/crawler.py app/requirements.txt

Step 1: Define Data Models

Create app/models.py:

python

from pydantic import BaseModel, HttpUrl, Field
from typing import Optional, List

class ScrapeRequest(BaseModel):
    url: HttpUrl
    selector: Optional[str] = Field(None, description="CSS selector for targeted extraction")
    wait_for_selector: Optional[str] = Field(None, description="CSS selector to wait before scraping")
    word_count_threshold: int = Field(10, description="Minimum word count for content blocks")
    bypass_cache: bool = Field(False, description="Force fresh scrape")

class ScrapeResponse(BaseModel):
    success: bool
    url: str
    markdown: Optional[str] = None
    extracted_content: Optional[str] = None
    error: Optional[str] = None
    metadata: dict = {}

class HealthResponse(BaseModel):
    status: str
    version: str

Step 2: Create the Crawler Wrapper

Create app/crawler.py:

python

from crawl4ai import AsyncWebCrawler
from crawl4ai.extraction_strategy import LLMExtractionStrategy
from typing import Optional
import logging

logger = logging.getLogger(__name__)

class CrawlerWrapper:
    def __init__(self):
        self.crawler = None

    async def initialize(self):
        """Initialize the async crawler (call once on startup)"""
        self.crawler = AsyncWebCrawler(
            verbose=True,
            headless=True
        )
        await self.crawler.start()

    async def scrape(
        self,
        url: str,
        selector: Optional[str] = None,
        wait_for_selector: Optional[str] = None,
        word_count_threshold: int = 10,
        bypass_cache: bool = False
    ) -> dict:
        """Perform scraping with configurable options"""
        try:
            if not self.crawler:
                await self.initialize()

            result = await self.crawler.arun(
                url=str(url),
                word_count_threshold=word_count_threshold,
                css_selector=selector,
                wait_for_selector=wait_for_selector,
                bypass_cache=bypass_cache,
                exclude_external_links=False,
                exclude_social_media_links=True
            )

            if result.success:
                return {
                    "success": True,
                    "url": url,
                    "markdown": result.markdown,
                    "extracted_content": result.extracted_content if hasattr(result, 'extracted_content') else None,
                    "metadata": {
                        "word_count": result.extracted_content_word_count if hasattr(result, 'extracted_content_word_count') else 0,
                        "links": len(result.links.get('internal', [])) + len(result.links.get('external', [])) if hasattr(result, 'links') else 0,
                        "screenshot": hasattr(result, 'screenshot')
                    }
                }
            else:
                error_msg = result.error_message if hasattr(result, 'error_message') else 'Unknown error'
                logger.error(f"Scraping failed for {url}: {error_msg}")
                return {
                    "success": False,
                    "url": url,
                    "error": error_msg,
                    "metadata": {}
                }

        except Exception as e:
            logger.exception(f"Exception during scraping: {str(e)}")
            return {
                "success": False,
                "url": url,
                "error": str(e),
                "metadata": {}
            }

# Global crawler instance
crawler_wrapper = CrawlerWrapper()

Step 3: Build the FastAPI Application

Create app/main.py. Secure your endpoint with a Crawl4AI API Token to prevent unauthorized access to your scraping service.

python

from fastapi import FastAPI, HTTPException, Security, status
from fastapi.middleware.cors import CORSMiddleware
from fastapi.responses import JSONResponse
import os
import logging
from contextlib import asynccontextmanager

from .models import ScrapeRequest, ScrapeResponse, HealthResponse
from .crawler import crawler_wrapper

# Configure logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

# API Key authentication (optional but recommended)
API_KEY = os.getenv("API_KEY", "your-secret-api-key")

async def verify_api_key(api_key: str = Security(None)):
    """Verify API key if one is set"""
    if API_KEY and API_KEY != "your-secret-api-key":
        if api_key != API_KEY:
            raise HTTPException(
                status_code=status.HTTP_401_UNAUTHORIZED,
                detail="Invalid API key"
            )
    return True

@asynccontextmanager
async def lifespan(app: FastAPI):
    """Manage application lifecycle"""
    # Startup
    logger.info("Initializing Crawl4AI crawler...")
    await crawler_wrapper.initialize()
    logger.info("Crawler initialized successfully")
    yield
    # Shutdown
    logger.info("Shutting down crawler...")
    if crawler_wrapper.crawler:
        await crawler_wrapper.crawler.close()
    logger.info("Shutdown complete")

# Create FastAPI app
app = FastAPI(
    title="Crawl4AI n8n Service",
    description="FastAPI microservice for n8n workflows to use Crawl4AI",
    version="1.0.0",
    lifespan=lifespan
)

# Enable CORS for n8n
app.add_middleware(
    CORSMiddleware,
    allow_origins=["*"],  # In production, specify n8n's domain
    allow_credentials=True,
    allow_methods=["POST"],
    allow_headers=["*"],
)

@app.get("/health", response_model=HealthResponse)
async def health_check():
    """Health check endpoint for monitoring"""
    return HealthResponse(status="healthy", version="1.0.0")

@app.post("/scrape", response_model=ScrapeResponse)
async def scrape_url(request: ScrapeRequest, auth: bool = Security(verify_api_key)):
    """
    Main scraping endpoint for n8n workflows.

    Expects a JSON body with:
    - url (required): The URL to scrape
    - selector (optional): CSS selector for targeted extraction
    - wait_for_selector (optional): CSS selector to wait for before scraping
    - word_count_threshold (optional): Minimum word count (default: 10)
    - bypass_cache (optional): Force fresh scrape (default: false)
    """
    try:
        logger.info(f"Scraping request for URL: {request.url}")

        result = await crawler_wrapper.scrape(
            url=str(request.url),
            selector=request.selector,
            wait_for_selector=request.wait_for_selector,
            word_count_threshold=request.word_count_threshold,
            bypass_cache=request.bypass_cache
        )

        return ScrapeResponse(**result)

    except Exception as e:
        logger.exception(f"Unexpected error: {str(e)}")
        raise HTTPException(
            status_code=status.HTTP_500_INTERNAL_SERVER_ERROR,
            detail=f"Internal server error: {str(e)}"
        )

@app.exception_handler(Exception)
async def global_exception_handler(request, exc):
    """Global exception handler"""
    logger.error(f"Unhandled exception: {str(exc)}")
    return JSONResponse(
        status_code=500,
        content={"detail": "Internal server error"}
    )

if __name__ == "__main__":
    import uvicorn
    uvicorn.run(app, host="0.0.0.0", port=8000)

Step 4: Manage Dependencies

Create app/requirements.txt:

text

fastapi==0.109.0
uvicorn[standard]==0.27.0
crawl4ai==0.2.77
pydantic==2.5.3
python-multipart==0.0.6

Deploying Crawl4AI with Docker for n8n

Create a Dockerfile in the project root:

dockerfile

FROM python:3.11-slim

# Install system dependencies
RUN apt-get update && apt-get install -y \
    wget \
    gnupg \
    && rm -rf /var/lib/apt/lists/*

# Install Playwright browsers
RUN playwright install --with-deps chromium

# Set working directory
WORKDIR /app

# Copy requirements and install Python dependencies
COPY app/requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Copy application code
COPY app/ ./app/

# Expose port
EXPOSE 8000

# Health check
HEALTHCHECK --interval=30s --timeout=10s --start-period=5s --retries=3 \
  CMD python -c "import requests; requests.get('http://localhost:8000/health')"

# Run the application
CMD ["uvicorn", "app.main:app", "--host", "0.0.0.0", "--port", "8000"]

Create docker-compose.yml for easier deployment:

yaml

version: '3.8'

services:
  crawl4ai-service:
    build: .
    container_name: crawl4ai-n8n
    ports:
      - "8000:8000"
    environment:
      - API_KEY=${API_KEY:-your-secret-api-key}
    restart: unless-stopped
    volumes:
      - ./logs:/app/logs
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
      interval: 30s
      timeout: 10s
      retries: 3
      start_period: 5s

Build and run:

bash

# Build the image
docker-compose build

# Start the service
docker-compose up -d

# Check logs
docker-compose logs -f

# Stop the service
docker-compose down

Configuring n8n Workflows

Once your FastAPI service is running, configure n8n to use it:

Step 1: Add HTTP Request Node

Drag an HTTP Request node into your n8n workflow and configure it:

Method: POST
URL: http://localhost:8000/scrape (or your deployed URL)
Authentication: Header API Key (if you set one)
Body: JSON with the scraping parameters

Step 2: Example Request Body

json

{
  "url": "https://example.com",
  "selector": ".article-content",
  "word_count_threshold": 10,
  "bypass_cache": true
}

Step 3: Handle and Transform the Response

The HTTP Request node will receive a JSON response with:

json

{
  "success": true,
  "url": "https://example.com",
  "markdown": "# Scraped content...\n\nFull markdown...",
  "extracted_content": null,
  "error": null,
  "metadata": {
    "word_count": 1234,
    "links": 45,
    "screenshot": true
  }
}

End-to-End Workflow Example

ℹ️

Workflow example

A typical workflow might be: Schedule Trigger → HTTP Request (scrape) → IF (check success field) → Code Node (process markdown) → Send to Slack/Database/Notion

Common Crawl4AI n8n Integration Mistakes

Async Lifecycle Mismanagement

Not Handling Async Lifecycle Properly

Forgetting to initialize and cleanup the crawler properly

Problem: Creating a new crawler instance for every request is slow and resource-intensive.

❌ Wrong - new instance per request

python

@app.post("/scrape")
async def scrape_url(request: ScrapeRequest):
    async with AsyncWebCrawler() as crawler:  # Spawns new browser every time
        result = await crawler.arun(url=request.url)
    return result

✅ Correct - reuse crawler instance

python

crawler_wrapper = CrawlerWrapper()

@asynccontextmanager
async def lifespan(app: FastAPI):
    await crawler_wrapper.initialize()  # Once on startup
    yield
    await crawler_wrapper.crawler.close()  # Cleanup on shutdown

Missing Rate Limiting

No Rate Limiting on API Endpoints

Allowing unlimited requests that overload the server

Problem: n8n workflows can trigger rapid-fire requests that crash your service or get you blocked.

✅ Add rate limiting with slowapi

python

from slowapi import Limiter, _rate_limit_exceeded_handler
from slowapi.util import get_remote_address
from slowapi.errors import RateLimitExceeded

limiter = Limiter(key_func=get_remote_address)
app.state.limiter = limiter
app.add_exception_handler(RateLimitExceeded, _rate_limit_exceeded_handler)

@app.post("/scrape")
@limiter.limit("10/minute")  # 10 requests per minute per IP
async def scrape_url(request: ScrapeRequest, http_request: Request):
    # ... scraping logic

Invalid URL Validation

Not Validating URLs Before Scraping

Accepting any URL without validation

Problem: Malicious actors can exploit your service to attack internal networks or phishing sites.

✅ Validate URLs before scraping

python

from urllib.parse import urlparse
import ipaddress

def is_safe_url(url: str) -> bool:
    """Check if URL is safe to scrape"""
    parsed = urlparse(url)

    # Reject non-HTTP(S) schemes
    if parsed.scheme not in ['http', 'https']:
        return False

    # Reject private IPs
    try:
        ip = ipaddress.ip_address(parsed.hostname)
        if ip.is_private or ip.is_loopback:
            return False
    except ValueError:
        pass  # Not an IP, continue

    return True

# In your endpoint
if not is_safe_url(str(request.url)):
    raise HTTPException(status_code=400, detail="Unsafe URL")

Timeout and Retry Issues

No Timeout Handling on Requests

Requests hanging indefinitely on slow sites

Problem: Some sites take minutes to load, blocking your workers and exhausting resources.

✅ Add timeouts to crawler

python

result = await crawler.arun(
    url=url,
    timeout=30,  # 30 second timeout
    page_timeout=30  # Page load timeout
)

Insufficient Error Context

Missing Error Context in Responses

Returning generic errors that make debugging impossible

Problem: When scraping fails, n8n workflows can't tell if it's a timeout, blocking, or parsing error.

❌ Wrong - generic error

python

return {"success": False, "error": "Scraping failed"}

✅ Correct - detailed error

python

return {
    "success": False,
    "error": f"Timeout after 30s waiting for selector '.content'",
    "error_type": "timeout",
    "url": url
}

When This Integration Makes Sense (and When It Doesn't)

Good Fit for n8n Automation

Good Fit ✅

Complex workflows: You need to scrape, transform, and send data to multiple services (Slack, Notion, Google Sheets)
Scheduled tasks: Daily/weekly scraping jobs that require minimal code
Dynamic sites: SPAs, React apps that need JavaScript rendering
Non-technical users: Teams who prefer drag-and-drop workflow builders over writing Python scripts
Integration-heavy: You need to connect scraped data to 5+ other services

Not a Good Fit ❌

High-volume scraping: Millions of pages per day—use dedicated scraping infrastructure
Simple scripts: One-off tasks where a 10-line Python script suffices
Real-time requirements: Sub-second response times needed
Static sites: Basic HTML sites that don't need Playwright overhead
Cost-sensitive: Running n8n + Docker + FastAPI has higher resource costs than a simple script

When MCP Is a Better Choice

If you're building AI agents with Cursor, Claude Desktop, or other MCP-compatible assistants, consider using Crawl4AI MCP instead:

AI-driven workflows: Let LLMs decide what to scrape and how to process results
Interactive exploration: Ask follow-up questions and iterate in real-time
Developer-focused: No drag-and-drop UI, just code and conversation
Tight IDE integration: Works directly in Cursor or Claude Desktop

Alternative Automation Approaches

For simple workflows: Use n8n's built-in HTTP Request node with puppeteer or cheerio services
For developers: Write Python scripts with Airflow or Prefect for orchestration
For enterprise: Consider Apify, which provides hosted web scraping with n8n integration out of the box

Production Best Practices for Crawl4AI n8n

Authentication and Security

Use strong API keys (32+ characters) and store them in environment variables, not code
Restrict CORS to your n8n domain instead of allowing all origins
Add HTTPS in production (use a reverse proxy like nginx or Traefik)
Implement request signing for additional security

Monitoring and Logging

Add structured logging with correlation IDs:

python

import uuid
from fastapi import Request

@app.middleware("http")
async def add_correlation_id(request: Request, call_next):
    correlation_id = str(uuid.uuid4())
    request.state.correlation_id = correlation_id
    response = await call_next(request)
    response.headers["X-Correlation-ID"] = correlation_id
    return response

@app.post("/scrape")
async def scrape_url(request: ScrapeRequest, http_request: Request):
    correlation_id = http_request.state.correlation_id
    logger.info(f"[{correlation_id}] Scraping {request.url}")

Scaling and Performance

Horizontal scaling: Run multiple containers behind a load balancer (each needs its own Crawl4AI instance)
Resource limits: Set Docker memory limits (2GB per container is typical)
Queue system: For heavy workloads, add Redis Queue or Celery between n8n and your scraper
Connection pooling: Reuse browser instances when possible, but recycle them periodically to prevent memory leaks

Cost Optimization

Use serverless or spot instances for scheduled workflows (you only pay during execution)
Cache aggressively—if content hasn't changed, return cached results instead of re-scraping
Batch requests when possible—scrape 10 URLs in one workflow execution instead of 10 separate runs
Monitor costs per 1000 pages and set alerts for unexpected spikes

Crawl4AI n8n Integration Summary and Next Steps

Key Takeaways

Integrating Crawl4AI with n8n through a FastAPI microservice gives you a powerful, flexible automation platform. The key implementation points are:

Microservice architecture: FastAPI provides a clean REST interface between n8n and Crawl4AI
Lifecycle management: Initialize the crawler once at startup, cleanup on shutdown
Security: Validate URLs, use API keys, enable HTTPS in production
Reliability: Add rate limiting, timeouts, detailed error messages, and logging
Scalability: Design for horizontal scaling with load balancers and queue systems

This integration shines for complex, multi-step workflows that need to connect scraped data to multiple services. For simpler use cases or high-volume scraping, consider lightweight scripts or dedicated scraping platforms.

Using Crawl4AI with MCP for AI Agents

Need This for Your AI Coding Agent Instead?

If you're using Cursor, Claude Desktop, or other AI assistants, check out our MCP guide. Set up web scraping in your AI workflow in under 5 minutes.

View Crawl4AI MCP Guide