Crawl4AI MCP Server Guide: Setup for Cursor & Claude (2026 Update)
Complete guide to building a Model Context Protocol (MCP) server with Crawl4AI. Enable Cursor IDE, Claude Desktop, and other MCP-compatible AI assistants to scrape websites and extract structured data in real-time conversations.
What is MCP and Why It Matters for Web Scraping
The Model Context Protocol (MCP) is an open standard that allows AI assistants like Claude Desktop to connect to external tools and data sources. Think of it as a universal API layer—instead of building custom integrations for each AI model, you create an MCP server once, and any MCP-compliant assistant can use it.
For web scraping, this is particularly valuable. Claude excels at analysis and reasoning but cannot fetch live web data on its own. By exposing Crawl4AI through MCP, you give Claude the ability to:
- Extract structured data from URLs during conversations
- Crawl multi-page sites without leaving the chat interface
- Process JavaScript-heavy content that basic HTTP clients miss
- Return cleaned, markdown-formatted content ready for analysis
Why official documentation falls short
The Crawl4AI documentation covers installation and basic usage, but MCP integration requires bridging two different frameworks—you need to understand both Crawl4AI's configuration options and MCP's server implementation patterns. The official docs don't explain how to structure an MCP server, handle tool registration, or manage error propagation between systems.
Prerequisites
Before starting, ensure you have:
- Python 3.10+ installed
- Crawl4AI installed (
pip install crawl4ai) - Claude Desktop (Mac/Windows) with MCP enabled
- Basic familiarity with async/await patterns in Python
- A target website to scrape (respect robots.txt and rate limits)
You'll also need to install the MCP SDK:
pip install mcpUnderstanding the MCP + Crawl4AI Architecture
The integration follows this flow:
Claude Desktop → MCP Client → MCP Server (your code) → Crawl4AI → Target Website
↓ ↓ ↓ ↓ ↓
User request MCP protocol Tool handler Async crawler HTTP request
(JSON-RPC)Your MCP server exposes "tools" that Claude can call. Each tool corresponds to a scraping operation—basic scraping, CSS selector extraction, or multi-page crawling. When Claude needs web data, it calls your tool, your server invokes Crawl4AI, and results return through the MCP protocol.
Basic MCP Server Implementation
Create a new file mcp_server.py:
#!/usr/bin/env python3
"""
Crawl4AI MCP Server
Exposes web scraping tools to Claude Desktop via Model Context Protocol
"""
from crawl4ai import AsyncWebCrawler
from crawl4ai.extraction_strategy import LLMExtractionStrategy
from mcp.server.models import InitializationOptions
from mcp.server import NotificationOptions, Server
from mcp.server.stdio import stdio_server
from mcp.types import (
Resource,
Tool,
TextContent,
ImageContent,
EmbeddedResource,
)
from typing import Any, Optional
import json
import asyncio
app = Server("crawl4ai-server")
@app.list_tools()
async def handle_list_tools() -> list[Tool]:
"""
List available scraping tools.
Claude Desktop uses this to discover what operations are available.
"""
return [
Tool(
name="scrape_url",
description="Scrape a single URL and return content as markdown",
inputSchema={
"type": "object",
"properties": {
"url": {
"type": "string",
"description": "URL to scrape (must include http:// or https://)"
},
"wait_for_selector": {
"type": "string",
"description": "Optional CSS selector to wait for before scraping"
}
},
"required": ["url"]
}
),
Tool(
name="extract_with_css",
description="Extract specific elements using CSS selectors",
inputSchema={
"type": "object",
"properties": {
"url": {
"type": "string",
"description": "URL to scrape"
},
"selector": {
"type": "string",
"description": "CSS selector for elements to extract"
},
"attribute": {
"type": "string",
"description": "Attribute to extract (text, href, src, etc.)",
"default": "text"
}
},
"required": ["url", "selector"]
}
),
Tool(
name="crawl_multiple",
description="Crawl multiple pages from a website",
inputSchema={
"type": "object",
"properties": {
"url": {
"type": "string",
"description": "Starting URL"
},
"max_pages": {
"type": "integer",
"description": "Maximum number of pages to crawl",
"default": 5
}
},
"required": ["url"]
}
)
]
@app.call_tool()
async def handle_call_tool(name: str, arguments: dict | None) -> list[TextContent | ImageContent | EmbeddedResource]:
"""
Handle tool calls from Claude Desktop.
This is where the actual scraping happens.
"""
if arguments is None:
arguments = {}
try:
if name == "scrape_url":
return await scrape_url(arguments.get("url"), arguments.get("wait_for_selector"))
elif name == "extract_with_css":
return await extract_with_css(
arguments.get("url"),
arguments.get("selector"),
arguments.get("attribute", "text")
)
elif name == "crawl_multiple":
return await crawl_multiple(
arguments.get("url"),
arguments.get("max_pages", 5)
)
else:
raise ValueError(f"Unknown tool: {name}")
except Exception as e:
return [TextContent(
type="text",
text=f"Error during scraping: {str(e)}\n\nTip: Check if the URL is accessible and respects robots.txt"
)]
async def scrape_url(url: str, wait_for_selector: Optional[str] = None) -> list[TextContent]:
"""Scrape a single URL and return markdown content"""
if not url:
return [TextContent(type="text", text="Error: URL is required")]
async with AsyncWebCrawler(verbose=True) as crawler:
result = await crawler.arun(
url=url,
word_count_threshold=10,
exclude_external_links=False,
wait_for_selector=wait_for_selector
)
if result.success:
markdown_content = f"""# Scraped Content from {url}
{result.markdown}
---
## Metadata
- **Status**: Success
- **Word Count**: {result.extracted_content_word_count if hasattr(result, 'extracted_content_word_count') else 'N/A'}
- **Links Found**: {len(result.links.get('internal', [])) + len(result.links.get('external', [])) if hasattr(result, 'links') else 'N/A'}
- **Screenshot**: {'Available' else 'Not available'}
"""
return [TextContent(type="text", text=markdown_content)]
else:
return [TextContent(
type="text",
text=f"Failed to scrape {url}\n\nError: {result.error_message if hasattr(result, 'error_message') else 'Unknown error'}"
)]
async def extract_with_css(url: str, selector: str, attribute: str = "text") -> list[TextContent]:
"""Extract specific elements using CSS selectors"""
if not url or not selector:
return [TextContent(type="text", text="Error: URL and selector are required")]
async with AsyncWebCrawler(verbose=True) as crawler:
# Use CSS extraction strategy
result = await crawler.arun(
url=url,
css_selector=selector,
bypass_cache=True
)
if result.success:
extracted_data = result.extracted_content if hasattr(result, 'extracted_content') else result.markdown
content = f"""# CSS Extraction Results from {url}
**Selector**: `{selector}`
**Attribute**: `{attribute}`
## Extracted Data
```
{extracted_data}
```
---
## Metadata
- **Status**: Success
- **Elements Found**: {len(result.extracted_content) if isinstance(result.extracted_content, list) else '1' if hasattr(result, 'extracted_content') else 'N/A'}
"""
return [TextContent(type="text", text=content)]
else:
return [TextContent(
type="text",
text=f"CSS extraction failed for {url}\n\nError: {result.error_message if hasattr(result, 'error_message') else 'Unknown error'}"
)]
async def crawl_multiple(url: str, max_pages: int = 5) -> list[TextContent]:
"""Crawl multiple pages from a starting URL"""
if not url:
return [TextContent(type="text", text="Error: URL is required")]
results = []
async with AsyncWebCrawler(verbose=True) as crawler:
# Basic crawl implementation
result = await crawler.arun(url=url, word_count_threshold=10)
if result.success:
results.append(f"## Page 1: {url}\n{result.markdown}\n")
# Get internal links for further crawling
if hasattr(result, 'links'):
internal_links = result.links.get('internal', [])[:max_pages-1]
for i, link in enumerate(internal_links, 2):
if i > max_pages:
break
try:
page_result = await crawler.arun(url=link, word_count_threshold=10)
if page_result.success:
results.append(f"## Page {i}: {link}\n{page_result.markdown}\n")
except Exception as e:
results.append(f"## Page {i}: {link}\nError: {str(e)}\n")
content = f"""# Multi-Page Crawl Results from {url}
**Pages Crawled**: {len(results)}
{chr(10).join(results)}
---
## Metadata
- **Total Pages**: {len(results)}
- **Starting URL**: {url}
- **Max Pages Limit**: {max_pages}
"""
return [TextContent(type="text", text=content)]
async def main():
"""Run the MCP server"""
async with stdio_server() as (read_stream, write_stream):
await app.run(
read_stream,
write_stream,
InitializationOptions(
server_name="crawl4ai-server",
server_version="1.0.0",
capabilities=app.get_capabilities(
notification_options=NotificationOptions(),
experimental_capabilities={}
)
)
)
if __name__ == "__main__":
asyncio.run(main())Configuring Claude Desktop
To connect your MCP server to Claude Desktop, create or edit the Claude config file:
- Mac:
~/Library/Application Support/Claude/claude_desktop_config.json - Windows:
%APPDATA%\\Claude\\claude_desktop_config.json
Add this configuration:
{
"mcpServers": {
"crawl4ai": {
"command": "python3",
"args": ["/absolute/path/to/your/mcp_server.py"],
"env": {
"PYTHONPATH": "/absolute/path/to/your/project"
}
}
}
}Replace /absolute/path/to/your/ with your actual file paths.
Testing the Integration
After restarting Claude Desktop:
- Open a new conversation
- Try a command like:
"Scrape https://example.com and summarize the main points" - Claude will automatically use the
scrape_urltool - For targeted extraction:
"Extract all product prices from https://shop.example.com using the selector .price"
Debugging tip
Check Claude Desktop's developer console (View → Developer → Show Console) for MCP server logs and error messages.
Common Mistakes and How to Avoid Them
Mistake 1: Blocking the Event Loop
Using synchronous blocking operations in async handlers
Problem: Using synchronous time.sleep() or blocking I/O in async handlers
❌ Wrong - blocks the entire server
async def scrape_url(url: str):
time.sleep(5) # This blocks everything
return await crawler.arun(url=url)âś… Correct - use asyncç‰ĺľ…
async def scrape_url(url: str):
await asyncio.sleep(5) # Non-blocking
return await crawler.arun(url=url)Mistake 2: Not Handling Errors Gracefully
Unhandled exceptions crash the MCP server
Problem: Unhandled exceptions crash the MCP server and disconnect Claude
❌ Wrong - crashes on invalid URL
async def scrape_url(url: str):
result = await crawler.arun(url=url)
return result.markdownâś… Correct - wrap in try/except
async def scrape_url(url: str):
try:
if not url.startswith(('http://', 'https://')):
raise ValueError("URL must start with http:// or https://")
result = await crawler.arun(url=url)
if not result.success:
return f"Scraping failed: {result.error_message}"
return result.markdown
except Exception as e:
return f"Error: {str(e)}. Please check the URL and try again."Mistake 3: Ignoring Rate Limits and robots.txt
Aggressive crawling that gets you blocked
Problem: Aggressive crawling that gets you blocked
❌ Wrong - no delays
for url in urls:
await crawler.arun(url=url)âś… Correct - respectful crawling
import asyncio
for url in urls:
await crawler.arun(url=url)
await asyncio.sleep(2) # 2 second delay between requestsAlways check robots.txt before crawling:
import urllib.robotparser
rp = urllib.robotparser.RobotFileParser()
rp.set_url("https://example.com/robots.txt")
rp.read()
if not rp.can_fetch("*", url):
return "Error: This URL is disallowed by robots.txt"Mistake 4: Returning Malformed Tool Responses
MCP expects specific response types
Problem: MCP expects specific response types
❌ Wrong - returning raw strings
@app.call_tool()
async def handle_call_tool(name: str, arguments: dict):
return "Some text content"âś… Correct - wrapping in TextContent
@app.call_tool()
async def handle_call_tool(name: str, arguments: dict):
return [TextContent(type="text", text="Some text content")]Mistake 5: Not Cleaning Extracted Content
Returning raw HTML with scripts, navbars, and footers
Problem: Returning raw HTML with scripts, navbars, and footers
âś… Use Crawl4AI's built-in content extraction
result = await crawler.arun(
url=url,
word_count_threshold=10, # Filter out boilerplate
exclude_external_links=False,
exclude_social_media_links=True
)When Crawl4AI is a Good Fit (And When It's Not)
Good Fit âś…
- JavaScript-heavy sites: Crawl4AI runs Playwright/Chromium, so it handles SPAs, React apps, and dynamic content
- Complex extraction needs: CSS selectors, XPath, LLM-based extraction
- Authenticated content: Can handle cookies and sessions
- Screenshot needs: Can capture page screenshots alongside content
- Local development: Free and self-hosted for testing
- Cost-effective alternative: Compared to Firecrawl, Crawl4AI MCP allows for local processing without API credits, making it the top Crawl4AI alternative for power users who want full control over their web scraping pipeline.
Not a Good Fit ❌
- High-scale production: Running browser instances is resource-intensive; consider dedicated APIs or hosted solutions
- Real-time requirements: Browser spawning adds 1-3 seconds per request
- Simple static sites: Overkill for basic HTML—use
requests+BeautifulSoupinstead - Strict rate limits: Heavy resource usage may trigger anti-bot measures faster than lightweight scrapers
- Cost-sensitive deployments: Browser instances require significant CPU/memory compared to HTTP clients
Alternative Approaches
- For simple scraping:
requests,httpx, orscrapy - For hosted APIs: Firecrawl, Apify, ScrapingBee
- For enterprise scale: Custom proxy rotation + dedicated scraping infrastructure
Production Considerations
Docker Deployment
Create a Dockerfile:
FROM python:3.11-slim
# Install system dependencies
RUN apt-get update && apt-get install -y \
wget \
gnupg \
&& rm -rf /var/lib/apt/lists/*
# Install Playwright browsers
RUN playwright install --with-deps chromium
# Install Python dependencies
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
# Copy server code
COPY mcp_server.py .
# Run the server
CMD ["python", "mcp_server.py"]Build and run:
docker build -t crawl4ai-mcp . docker run -v ~/.config:/root/.config crawl4ai-mcpSecurity Hardening
- Validate all URLs: Reject non-HTTP(S) schemes, internal IPs, or invalid patterns
- Rate limiting: Implement per-session and per-IP rate limits
- Content size limits: Cap response size to prevent memory exhaustion
- Sandboxing: Run in Docker or separate user namespace
- Logging: Log all scraping operations for audit trails
Monitoring
Add structured logging:
import logging
import json
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
@app.call_tool()
async def handle_call_tool(name: str, arguments: dict):
logger.info(json.dumps({
"event": "tool_call",
"tool": name,
"arguments": arguments
}))
# ... rest of implementationTroubleshooting Common Errors
Fixing "ModuleNotFoundError: No module named 'crawl4ai'"
Problem: MCP server fails to start with module import error
Common Error
ModuleNotFoundError: No module named 'crawl4ai'. This usually happens when the Python environment in Claude/Cursor config doesn't match your local pip install. The MCP server runs in a separate Python process, so you must install Crawl4AI in the exact same Python environment specified in your config file.
Solutions:
- Install Crawl4AI:
pip install crawl4ai - Check Python path in Claude Desktop config includes the correct environment
- Use absolute paths in
claude_desktop_config.json - Verify you're using the same Python environment where Crawl4AI is installed
# Check if crawl4ai is installed
python3 -m pip list | grep crawl4ai
# Install if missing
python3 -m pip install crawl4ai
# Find Python path
which python3Fixing "MCP server crashed: Connection closed" in Cursor
Problem: Cursor shows MCP server connection error immediately after startup
Solutions:
- Check Cursor's MCP logs in Developer Console (View → Developer → Show Console)
- Verify the Python path is correct for your system (use
which python3on macOS/Linux) - Ensure the script has execute permissions:
chmod +x mcp_server.py - Test the server manually:
python3 mcp_server.py(should wait for stdin)
Fixing "playwright> error: Executable doesn't exist at /path/to/chromium"
Problem: Crawl4AI cannot find Playwright browsers
Solutions:
- Install Playwright browsers:
playwright install chromium - Install system dependencies:
playwright install --with-deps chromium - On macOS, ensure Xcode command line tools are installed
- On Linux, install missing libraries:
apt-get install -y wget gnupg
# Install Playwright browsers
playwright install chromium
# Install with system dependencies (Linux)
playwright install --with-deps chromium
# Verify installation
playwright install --helpFixing "asyncio.run() cannot be called from a running event loop"
Problem: Server crashes when Claude calls tools
Solution: The MCP SDK already manages the event loop. Never callasyncio.run() inside tool handlers.
# ❌ WRONG - will crash
@app.call_tool()
async def handle_call_tool(name: str, arguments: dict):
result = asyncio.run(crawler.arun(url=url)) # Crashes!
return [TextContent(type="text", text=result)]
# âś… CORRECT - just await
@app.call_tool()
async def handle_call_tool(name: str, arguments: dict):
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(url=url) # Works!
return [TextContent(type="text", text=result.markdown)]Fixing " Claude Desktop doesn't show MCP tools"
Problem: Claude Desktop starts but tools don't appear in the UI
Solutions:
- Restart Claude Desktop completely (quit and reopen, not just window close)
- Verify config file syntax is valid JSON (use a JSON validator)
- Check the config file path is correct for your OS
- Look for MCP errors in Claude Desktop → Help → Developer → Show Console
- Ensure
mcpServerskey exists and your server is listed
// Valid config example
{
"mcpServers": {
"crawl4ai": {
"command": "python3",
"args": ["/Users/yourname/projects/crawl4ai-mcp/mcp_server.py"],
"env": {
"PYTHONPATH": "/Users/yourname/projects/crawl4ai-mcp"
}
}
}
}Fixing "TimeoutError: Tool call timed out after 30 seconds"
Problem: Scraping operations timeout on large pages
Solutions:
- Increase timeout in Claude Desktop config (add
"timeout": 120) - Use
word_count_thresholdto limit content extraction - Skip screenshots with
screenshot=Falsein Crawl4AI options - For large sites, use CSS selectors to extract only needed content
# Faster scraping with limits
result = await crawler.arun(
url=url,
word_count_threshold=10, # Skip boilerplate
screenshot=False, # Skip screenshots
bypass_cache=False, # Use cache when possible
page_timeout=10000 # 10 second page load timeout
)Summary
Building an MCP server with Crawl4AI bridges the gap between Claude's reasoning capabilities and live web data. The key implementation points are:
- Tool registration: Expose scraping operations as MCP tools with clear input schemas
- Async patterns: Use
async/awaitthroughout to avoid blocking - Error handling: Catch exceptions and return helpful error messages
- Rate limiting: Respect robots.txt and add delays between requests
- Production readiness: Use Docker, validation, and monitoring for deployments
This integration is ideal for development, research, and internal tools where control and flexibility outweigh raw performance concerns. For high-scale production use cases, consider dedicated scraping APIs or more lightweight solutions.