Integrate Crawl4AI with n8n: Complete Workflow Automation Guide
Learn how to integrate Crawl4AI with n8n workflows using a FastAPI microservice architecture. Includes complete setup, deployment, and production best practices.
Ready-to-use deployment files included • Fork and customize for your needs
Why Combine Crawl4AI with n8n?
n8n is a powerful workflow automation tool that excels at connecting services, orchestrating multi-step processes, and handling data transformations. However, its built-in HTTP Request node has limitations when dealing with modern, JavaScript-heavy websites.
By wrapping Crawl4AI in a FastAPI microservice, you get the best of both worlds:
- n8n handles workflow orchestration, scheduling, error handling, and integrations with 400+ services
- Crawl4AI handles complex web scraping—JavaScript rendering, CSS extraction, LLM-based content parsing
- FastAPI provides a clean REST API interface between n8n and Crawl4AI
Why official documentation isn't enough
The Crawl4AI documentation shows you how to scrape websites programmatically, and n8n's docs explain how to make HTTP requests. But neither explains how to build a production-ready microservice that bridges the two—you need to design API endpoints, handle authentication, manage async operations, implement proper error handling, and deploy with Docker. This guide fills that gap.
Understanding the Architecture
The integration follows this microservice pattern:
n8n Workflow → HTTP Request Node → FastAPI Microservice → Crawl4AI → Target Website
↓ ↓ ↓ ↓
Orchestration REST API Async crawler HTTP request
Scheduling Authentication Playwright JavaScript
Error handling Rate limiting Content extraction renderingn8n never directly touches Crawl4AI or Playwright. Instead, it makes HTTP requests to your FastAPI service, which handles the heavy lifting. This separation keeps n8n lightweight while giving you full control over the scraping logic.
Building the FastAPI Microservice
Create a project structure:
mkdir crawl4ai-n8n-service
cd crawl4ai-n8n-service
mkdir app
touch app/main.py app/models.py app/crawler.py app/requirements.txtStep 1: Define Data Models
Create app/models.py:
from pydantic import BaseModel, HttpUrl, Field
from typing import Optional, List
class ScrapeRequest(BaseModel):
url: HttpUrl
selector: Optional[str] = Field(None, description="CSS selector for targeted extraction")
wait_for_selector: Optional[str] = Field(None, description="CSS selector to wait before scraping")
word_count_threshold: int = Field(10, description="Minimum word count for content blocks")
bypass_cache: bool = Field(False, description="Force fresh scrape")
class ScrapeResponse(BaseModel):
success: bool
url: str
markdown: Optional[str] = None
extracted_content: Optional[str] = None
error: Optional[str] = None
metadata: dict = {}
class HealthResponse(BaseModel):
status: str
version: strStep 2: Create Crawler Wrapper
Create app/crawler.py:
from crawl4ai import AsyncWebCrawler
from crawl4ai.extraction_strategy import LLMExtractionStrategy
from typing import Optional
import logging
logger = logging.getLogger(__name__)
class CrawlerWrapper:
def __init__(self):
self.crawler = None
async def initialize(self):
"""Initialize the async crawler (call once on startup)"""
self.crawler = AsyncWebCrawler(
verbose=True,
headless=True
)
await self.crawler.start()
async def scrape(
self,
url: str,
selector: Optional[str] = None,
wait_for_selector: Optional[str] = None,
word_count_threshold: int = 10,
bypass_cache: bool = False
) -> dict:
"""Perform scraping with configurable options"""
try:
if not self.crawler:
await self.initialize()
result = await self.crawler.arun(
url=str(url),
word_count_threshold=word_count_threshold,
css_selector=selector,
wait_for_selector=wait_for_selector,
bypass_cache=bypass_cache,
exclude_external_links=False,
exclude_social_media_links=True
)
if result.success:
return {
"success": True,
"url": url,
"markdown": result.markdown,
"extracted_content": result.extracted_content if hasattr(result, 'extracted_content') else None,
"metadata": {
"word_count": result.extracted_content_word_count if hasattr(result, 'extracted_content_word_count') else 0,
"links": len(result.links.get('internal', [])) + len(result.links.get('external', [])) if hasattr(result, 'links') else 0,
"screenshot": hasattr(result, 'screenshot')
}
}
else:
error_msg = result.error_message if hasattr(result, 'error_message') else 'Unknown error'
logger.error(f"Scraping failed for {url}: {error_msg}")
return {
"success": False,
"url": url,
"error": error_msg,
"metadata": {}
}
except Exception as e:
logger.exception(f"Exception during scraping: {str(e)}")
return {
"success": False,
"url": url,
"error": str(e),
"metadata": {}
}
# Global crawler instance
crawler_wrapper = CrawlerWrapper()Step 3: Create FastAPI Application
Create app/main.py. Secure your endpoint with a Crawl4AI API Token to prevent unauthorized access to your scraping service.
from fastapi import FastAPI, HTTPException, Security, status
from fastapi.middleware.cors import CORSMiddleware
from fastapi.responses import JSONResponse
import os
import logging
from contextlib import asynccontextmanager
from .models import ScrapeRequest, ScrapeResponse, HealthResponse
from .crawler import crawler_wrapper
# Configure logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
# API Key authentication (optional but recommended)
API_KEY = os.getenv("API_KEY", "your-secret-api-key")
async def verify_api_key(api_key: str = Security(None)):
"""Verify API key if one is set"""
if API_KEY and API_KEY != "your-secret-api-key":
if api_key != API_KEY:
raise HTTPException(
status_code=status.HTTP_401_UNAUTHORIZED,
detail="Invalid API key"
)
return True
@asynccontextmanager
async def lifespan(app: FastAPI):
"""Manage application lifecycle"""
# Startup
logger.info("Initializing Crawl4AI crawler...")
await crawler_wrapper.initialize()
logger.info("Crawler initialized successfully")
yield
# Shutdown
logger.info("Shutting down crawler...")
if crawler_wrapper.crawler:
await crawler_wrapper.crawler.close()
logger.info("Shutdown complete")
# Create FastAPI app
app = FastAPI(
title="Crawl4AI n8n Service",
description="FastAPI microservice for n8n workflows to use Crawl4AI",
version="1.0.0",
lifespan=lifespan
)
# Enable CORS for n8n
app.add_middleware(
CORSMiddleware,
allow_origins=["*"], # In production, specify n8n's domain
allow_credentials=True,
allow_methods=["POST"],
allow_headers=["*"],
)
@app.get("/health", response_model=HealthResponse)
async def health_check():
"""Health check endpoint for monitoring"""
return HealthResponse(status="healthy", version="1.0.0")
@app.post("/scrape", response_model=ScrapeResponse)
async def scrape_url(request: ScrapeRequest, auth: bool = Security(verify_api_key)):
"""
Main scraping endpoint for n8n workflows.
Expects a JSON body with:
- url (required): The URL to scrape
- selector (optional): CSS selector for targeted extraction
- wait_for_selector (optional): CSS selector to wait for before scraping
- word_count_threshold (optional): Minimum word count (default: 10)
- bypass_cache (optional): Force fresh scrape (default: false)
"""
try:
logger.info(f"Scraping request for URL: {request.url}")
result = await crawler_wrapper.scrape(
url=str(request.url),
selector=request.selector,
wait_for_selector=request.wait_for_selector,
word_count_threshold=request.word_count_threshold,
bypass_cache=request.bypass_cache
)
return ScrapeResponse(**result)
except Exception as e:
logger.exception(f"Unexpected error: {str(e)}")
raise HTTPException(
status_code=status.HTTP_500_INTERNAL_SERVER_ERROR,
detail=f"Internal server error: {str(e)}"
)
@app.exception_handler(Exception)
async def global_exception_handler(request, exc):
"""Global exception handler"""
logger.error(f"Unhandled exception: {str(exc)}")
return JSONResponse(
status_code=500,
content={"detail": "Internal server error"}
)
if __name__ == "__main__":
import uvicorn
uvicorn.run(app, host="0.0.0.0", port=8000)Step 4: Dependencies
Create app/requirements.txt:
fastapi==0.109.0
uvicorn[standard]==0.27.0
crawl4ai==0.2.77
pydantic==2.5.3
python-multipart==0.0.6Deploy Crawl4AI with Docker for n8n Automation
Create a Dockerfile in the project root:
FROM python:3.11-slim
# Install system dependencies
RUN apt-get update && apt-get install -y \
wget \
gnupg \
&& rm -rf /var/lib/apt/lists/*
# Install Playwright browsers
RUN playwright install --with-deps chromium
# Set working directory
WORKDIR /app
# Copy requirements and install Python dependencies
COPY app/requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
# Copy application code
COPY app/ ./app/
# Expose port
EXPOSE 8000
# Health check
HEALTHCHECK --interval=30s --timeout=10s --start-period=5s --retries=3 \
CMD python -c "import requests; requests.get('http://localhost:8000/health')"
# Run the application
CMD ["uvicorn", "app.main:app", "--host", "0.0.0.0", "--port", "8000"]Create docker-compose.yml for easier deployment:
version: '3.8'
services:
crawl4ai-service:
build: .
container_name: crawl4ai-n8n
ports:
- "8000:8000"
environment:
- API_KEY=${API_KEY:-your-secret-api-key}
restart: unless-stopped
volumes:
- ./logs:/app/logs
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
interval: 30s
timeout: 10s
retries: 3
start_period: 5sBuild and run:
# Build the image
docker-compose build
# Start the service
docker-compose up -d
# Check logs
docker-compose logs -f
# Stop the service
docker-compose downConfiguring n8n Workflows
Once your FastAPI service is running, configure n8n to use it:
1. Add HTTP Request Node
Drag an HTTP Request node into your n8n workflow and configure it:
- Method:
POST - URL:
http://localhost:8000/scrape(or your deployed URL) - Authentication: Header API Key (if you set one)
- Body: JSON with the scraping parameters
2. Example Request Body
{
"url": "https://example.com",
"selector": ".article-content",
"word_count_threshold": 10,
"bypass_cache": true
}3. Handle the Response
The HTTP Request node will receive a JSON response with:
{
"success": true,
"url": "https://example.com",
"markdown": "# Scraped content...\n\nFull markdown...",
"extracted_content": null,
"error": null,
"metadata": {
"word_count": 1234,
"links": 45,
"screenshot": true
}
}Workflow example
A typical workflow might be: Schedule Trigger → HTTP Request (scrape) → IF (check success field) → Code Node (process markdown) → Send to Slack/Database/Notion
Common Mistakes and How to Avoid Them
Mistake 1: Not Handling Async Lifecycle
Forgetting to initialize and cleanup the crawler properly
Problem: Creating a new crawler instance for every request is slow and resource-intensive.
❌ Wrong - new instance per request
@app.post("/scrape")
async def scrape_url(request: ScrapeRequest):
async with AsyncWebCrawler() as crawler: # Spawns new browser every time
result = await crawler.arun(url=request.url)
return result✅ Correct - reuse crawler instance
crawler_wrapper = CrawlerWrapper()
@asynccontextmanager
async def lifespan(app: FastAPI):
await crawler_wrapper.initialize() # Once on startup
yield
await crawler_wrapper.crawler.close() # Cleanup on shutdownMistake 2: No Rate Limiting
Allowing unlimited requests that overload the server
Problem: n8n workflows can trigger rapid-fire requests that crash your service or get you blocked.
✅ Add rate limiting with slowapi
from slowapi import Limiter, _rate_limit_exceeded_handler
from slowapi.util import get_remote_address
from slowapi.errors import RateLimitExceeded
limiter = Limiter(key_func=get_remote_address)
app.state.limiter = limiter
app.add_exception_handler(RateLimitExceeded, _rate_limit_exceeded_handler)
@app.post("/scrape")
@limiter.limit("10/minute") # 10 requests per minute per IP
async def scrape_url(request: ScrapeRequest, http_request: Request):
# ... scraping logicMistake 3: Not Validating URLs
Accepting any URL without validation
Problem: Malicious actors can exploit your service to attack internal networks or phishing sites.
✅ Validate URLs before scraping
from urllib.parse import urlparse
import ipaddress
def is_safe_url(url: str) -> bool:
"""Check if URL is safe to scrape"""
parsed = urlparse(url)
# Reject non-HTTP(S) schemes
if parsed.scheme not in ['http', 'https']:
return False
# Reject private IPs
try:
ip = ipaddress.ip_address(parsed.hostname)
if ip.is_private or ip.is_loopback:
return False
except ValueError:
pass # Not an IP, continue
return True
# In your endpoint
if not is_safe_url(str(request.url)):
raise HTTPException(status_code=400, detail="Unsafe URL")Mistake 4: No Timeout Handling
Requests hanging indefinitely on slow sites
Problem: Some sites take minutes to load, blocking your workers and exhausting resources.
✅ Add timeouts to crawler
result = await crawler.arun(
url=url,
timeout=30, # 30 second timeout
page_timeout=30 # Page load timeout
)Mistake 5: Missing Error Context
Returning generic errors that make debugging impossible
Problem: When scraping fails, n8n workflows can't tell if it's a timeout, blocking, or parsing error.
❌ Wrong - generic error
return {"success": False, "error": "Scraping failed"}✅ Correct - detailed error
return {
"success": False,
"error": f"Timeout after 30s waiting for selector '.content'",
"error_type": "timeout",
"url": url
}When This Integration Makes Sense (And When It Doesn't)
Good Fit ✅
- Complex workflows: You need to scrape, transform, and send data to multiple services (Slack, Notion, Google Sheets)
- Scheduled tasks: Daily/weekly scraping jobs that require minimal code
- Dynamic sites: SPAs, React apps that need JavaScript rendering
- Non-technical users: Teams who prefer drag-and-drop workflow builders over writing Python scripts
- Integration-heavy: You need to connect scraped data to 5+ other services
Not a Good Fit ❌
- High-volume scraping: Millions of pages per day—use dedicated scraping infrastructure
- Simple scripts: One-off tasks where a 10-line Python script suffices
- Real-time requirements: Sub-second response times needed
- Static sites: Basic HTML sites that don't need Playwright overhead
- Cost-sensitive: Running n8n + Docker + FastAPI has higher resource costs than a simple script
Alternative Approaches
- For simple workflows: Use n8n's built-in HTTP Request node with
puppeteerorcheerioservices - For developers: Write Python scripts with Airflow or Prefect for orchestration
- For enterprise: Consider Apify, which provides hosted web scraping with n8n integration out of the box
Production Best Practices
1. Authentication & Security
- Use strong API keys (32+ characters) and store them in environment variables, not code
- Restrict CORS to your n8n domain instead of allowing all origins
- Add HTTPS in production (use a reverse proxy like nginx or Traefik)
- Implement request signing for additional security
2. Monitoring & Logging
Add structured logging with correlation IDs:
import uuid
from fastapi import Request
@app.middleware("http")
async def add_correlation_id(request: Request, call_next):
correlation_id = str(uuid.uuid4())
request.state.correlation_id = correlation_id
response = await call_next(request)
response.headers["X-Correlation-ID"] = correlation_id
return response
@app.post("/scrape")
async def scrape_url(request: ScrapeRequest, http_request: Request):
correlation_id = http_request.state.correlation_id
logger.info(f"[{correlation_id}] Scraping {request.url}")3. Scaling Considerations
- Horizontal scaling: Run multiple containers behind a load balancer (each needs its own Crawl4AI instance)
- Resource limits: Set Docker memory limits (2GB per container is typical)
- Queue system: For heavy workloads, add Redis Queue or Celery between n8n and your scraper
- Connection pooling: Reuse browser instances when possible, but recycle them periodically to prevent memory leaks
4. Cost Optimization
- Use serverless or spot instances for scheduled workflows (you only pay during execution)
- Cache aggressively—if content hasn't changed, return cached results instead of re-scraping
- Batch requests when possible—scrape 10 URLs in one workflow execution instead of 10 separate runs
- Monitor costs per 1000 pages and set alerts for unexpected spikes
Summary
Integrating Crawl4AI with n8n through a FastAPI microservice gives you a powerful, flexible automation platform. The key implementation points are:
- Microservice architecture: FastAPI provides a clean REST interface between n8n and Crawl4AI
- Lifecycle management: Initialize the crawler once at startup, cleanup on shutdown
- Security: Validate URLs, use API keys, enable HTTPS in production
- Reliability: Add rate limiting, timeouts, detailed error messages, and logging
- Scalability: Design for horizontal scaling with load balancers and queue systems
This integration shines for complex, multi-step workflows that need to connect scraped data to multiple services. For simpler use cases or high-volume scraping, consider lightweight scripts or dedicated scraping platforms.
Need This for Your AI Coding Agent Instead?
If you're using Cursor, Claude Desktop, or other AI assistants, check out our MCP guide. Set up web scraping in your AI workflow in under 5 minutes.