Skip to main content

Crawl4AI MCP Server Guide: Setup for Cursor & Claude (2026 Update)

Complete guide to building a Model Context Protocol (MCP) server with Crawl4AI. Enable Cursor IDE, Claude Desktop, and other MCP-compatible AI assistants to scrape websites and extract structured data in real-time conversations.

What is MCP and Why It Matters for Web Scraping

The Model Context Protocol (MCP) is an open standard that allows AI assistants like Claude Desktop to connect to external tools and data sources. Think of it as a universal API layer—instead of building custom integrations for each AI model, you create an MCP server once, and any MCP-compliant assistant can use it.

For web scraping, this is particularly valuable. Claude excels at analysis and reasoning but cannot fetch live web data on its own. By exposing Crawl4AI through MCP, you give Claude the ability to:

  • Extract structured data from URLs during conversations
  • Crawl multi-page sites without leaving the chat interface
  • Process JavaScript-heavy content that basic HTTP clients miss
  • Return cleaned, markdown-formatted content ready for analysis
ℹ️

Why official documentation falls short

The Crawl4AI documentation covers installation and basic usage, but MCP integration requires bridging two different frameworks—you need to understand both Crawl4AI's configuration options and MCP's server implementation patterns. The official docs don't explain how to structure an MCP server, handle tool registration, or manage error propagation between systems.

Prerequisites

Before starting, ensure you have:

  • Python 3.10+ installed
  • Crawl4AI installed (pip install crawl4ai)
  • Claude Desktop (Mac/Windows) with MCP enabled
  • Basic familiarity with async/await patterns in Python
  • A target website to scrape (respect robots.txt and rate limits)

You'll also need to install the MCP SDK:

bash
pip install mcp

Understanding the MCP + Crawl4AI Architecture

The integration follows this flow:

text
Claude Desktop → MCP Client → MCP Server (your code) → Crawl4AI → Target Website
     ↓                    ↓                    ↓               ↓            ↓
 User request       MCP protocol         Tool handler    Async crawler   HTTP request
                     (JSON-RPC)

Your MCP server exposes "tools" that Claude can call. Each tool corresponds to a scraping operation—basic scraping, CSS selector extraction, or multi-page crawling. When Claude needs web data, it calls your tool, your server invokes Crawl4AI, and results return through the MCP protocol.

Basic MCP Server Implementation

Create a new file mcp_server.py:

python
#!/usr/bin/env python3
"""
Crawl4AI MCP Server
Exposes web scraping tools to Claude Desktop via Model Context Protocol
"""

from crawl4ai import AsyncWebCrawler
from crawl4ai.extraction_strategy import LLMExtractionStrategy
from mcp.server.models import InitializationOptions
from mcp.server import NotificationOptions, Server
from mcp.server.stdio import stdio_server
from mcp.types import (
    Resource,
    Tool,
    TextContent,
    ImageContent,
    EmbeddedResource,
)
from typing import Any, Optional
import json
import asyncio

app = Server("crawl4ai-server")

@app.list_tools()
async def handle_list_tools() -> list[Tool]:
    """
    List available scraping tools.
    Claude Desktop uses this to discover what operations are available.
    """
    return [
        Tool(
            name="scrape_url",
            description="Scrape a single URL and return content as markdown",
            inputSchema={
                "type": "object",
                "properties": {
                    "url": {
                        "type": "string",
                        "description": "URL to scrape (must include http:// or https://)"
                    },
                    "wait_for_selector": {
                        "type": "string",
                        "description": "Optional CSS selector to wait for before scraping"
                    }
                },
                "required": ["url"]
            }
        ),
        Tool(
            name="extract_with_css",
            description="Extract specific elements using CSS selectors",
            inputSchema={
                "type": "object",
                "properties": {
                    "url": {
                        "type": "string",
                        "description": "URL to scrape"
                    },
                    "selector": {
                        "type": "string",
                        "description": "CSS selector for elements to extract"
                    },
                    "attribute": {
                        "type": "string",
                        "description": "Attribute to extract (text, href, src, etc.)",
                        "default": "text"
                    }
                },
                "required": ["url", "selector"]
            }
        ),
        Tool(
            name="crawl_multiple",
            description="Crawl multiple pages from a website",
            inputSchema={
                "type": "object",
                "properties": {
                    "url": {
                        "type": "string",
                        "description": "Starting URL"
                    },
                    "max_pages": {
                        "type": "integer",
                        "description": "Maximum number of pages to crawl",
                        "default": 5
                    }
                },
                "required": ["url"]
            }
        )
    ]

@app.call_tool()
async def handle_call_tool(name: str, arguments: dict | None) -> list[TextContent | ImageContent | EmbeddedResource]:
    """
    Handle tool calls from Claude Desktop.
    This is where the actual scraping happens.
    """
    if arguments is None:
        arguments = {}

    try:
        if name == "scrape_url":
            return await scrape_url(arguments.get("url"), arguments.get("wait_for_selector"))
        elif name == "extract_with_css":
            return await extract_with_css(
                arguments.get("url"),
                arguments.get("selector"),
                arguments.get("attribute", "text")
            )
        elif name == "crawl_multiple":
            return await crawl_multiple(
                arguments.get("url"),
                arguments.get("max_pages", 5)
            )
        else:
            raise ValueError(f"Unknown tool: {name}")
    except Exception as e:
        return [TextContent(
            type="text",
            text=f"Error during scraping: {str(e)}\n\nTip: Check if the URL is accessible and respects robots.txt"
        )]

async def scrape_url(url: str, wait_for_selector: Optional[str] = None) -> list[TextContent]:
    """Scrape a single URL and return markdown content"""
    if not url:
        return [TextContent(type="text", text="Error: URL is required")]

    async with AsyncWebCrawler(verbose=True) as crawler:
        result = await crawler.arun(
            url=url,
            word_count_threshold=10,
            exclude_external_links=False,
            wait_for_selector=wait_for_selector
        )

        if result.success:
            markdown_content = f"""# Scraped Content from {url}

{result.markdown}

---

## Metadata
- **Status**: Success
- **Word Count**: {result.extracted_content_word_count if hasattr(result, 'extracted_content_word_count') else 'N/A'}
- **Links Found**: {len(result.links.get('internal', [])) + len(result.links.get('external', [])) if hasattr(result, 'links') else 'N/A'}
- **Screenshot**: {'Available' else 'Not available'}
"""
            return [TextContent(type="text", text=markdown_content)]
        else:
            return [TextContent(
                type="text",
                text=f"Failed to scrape {url}\n\nError: {result.error_message if hasattr(result, 'error_message') else 'Unknown error'}"
            )]

async def extract_with_css(url: str, selector: str, attribute: str = "text") -> list[TextContent]:
    """Extract specific elements using CSS selectors"""
    if not url or not selector:
        return [TextContent(type="text", text="Error: URL and selector are required")]

    async with AsyncWebCrawler(verbose=True) as crawler:
        # Use CSS extraction strategy
        result = await crawler.arun(
            url=url,
            css_selector=selector,
            bypass_cache=True
        )

        if result.success:
            extracted_data = result.extracted_content if hasattr(result, 'extracted_content') else result.markdown
            content = f"""# CSS Extraction Results from {url}

**Selector**: `{selector}`
**Attribute**: `{attribute}`

## Extracted Data
```
{extracted_data}
```

---

## Metadata
- **Status**: Success
- **Elements Found**: {len(result.extracted_content) if isinstance(result.extracted_content, list) else '1' if hasattr(result, 'extracted_content') else 'N/A'}
"""
            return [TextContent(type="text", text=content)]
        else:
            return [TextContent(
                type="text",
                text=f"CSS extraction failed for {url}\n\nError: {result.error_message if hasattr(result, 'error_message') else 'Unknown error'}"
            )]

async def crawl_multiple(url: str, max_pages: int = 5) -> list[TextContent]:
    """Crawl multiple pages from a starting URL"""
    if not url:
        return [TextContent(type="text", text="Error: URL is required")]

    results = []
    async with AsyncWebCrawler(verbose=True) as crawler:
        # Basic crawl implementation
        result = await crawler.arun(url=url, word_count_threshold=10)

        if result.success:
            results.append(f"## Page 1: {url}\n{result.markdown}\n")

            # Get internal links for further crawling
            if hasattr(result, 'links'):
                internal_links = result.links.get('internal', [])[:max_pages-1]

                for i, link in enumerate(internal_links, 2):
                    if i > max_pages:
                        break
                    try:
                        page_result = await crawler.arun(url=link, word_count_threshold=10)
                        if page_result.success:
                            results.append(f"## Page {i}: {link}\n{page_result.markdown}\n")
                    except Exception as e:
                        results.append(f"## Page {i}: {link}\nError: {str(e)}\n")

        content = f"""# Multi-Page Crawl Results from {url}

**Pages Crawled**: {len(results)}

{chr(10).join(results)}

---

## Metadata
- **Total Pages**: {len(results)}
- **Starting URL**: {url}
- **Max Pages Limit**: {max_pages}
"""
        return [TextContent(type="text", text=content)]

async def main():
    """Run the MCP server"""
    async with stdio_server() as (read_stream, write_stream):
        await app.run(
            read_stream,
            write_stream,
            InitializationOptions(
                server_name="crawl4ai-server",
                server_version="1.0.0",
                capabilities=app.get_capabilities(
                    notification_options=NotificationOptions(),
                    experimental_capabilities={}
                )
            )
        )

if __name__ == "__main__":
    asyncio.run(main())

Configuring Claude Desktop

To connect your MCP server to Claude Desktop, create or edit the Claude config file:

  • Mac: ~/Library/Application Support/Claude/claude_desktop_config.json
  • Windows: %APPDATA%\\Claude\\claude_desktop_config.json

Add this configuration:

json
{
  "mcpServers": {
    "crawl4ai": {
      "command": "python3",
      "args": ["/absolute/path/to/your/mcp_server.py"],
      "env": {
        "PYTHONPATH": "/absolute/path/to/your/project"
      }
    }
  }
}

Replace /absolute/path/to/your/ with your actual file paths.

Testing the Integration

After restarting Claude Desktop:

  1. Open a new conversation
  2. Try a command like:
    "Scrape https://example.com and summarize the main points"
  3. Claude will automatically use the scrape_url tool
  4. For targeted extraction:
    "Extract all product prices from https://shop.example.com using the selector .price"
ℹ️

Debugging tip

Check Claude Desktop's developer console (View → Developer → Show Console) for MCP server logs and error messages.

Common Mistakes and How to Avoid Them

Mistake 1: Blocking the Event Loop

Using synchronous blocking operations in async handlers

Problem: Using synchronous time.sleep() or blocking I/O in async handlers

❌ Wrong - blocks the entire server

python
async def scrape_url(url: str):
    time.sleep(5)  # This blocks everything
    return await crawler.arun(url=url)

✅ Correct - use async等待

python
async def scrape_url(url: str):
    await asyncio.sleep(5)  # Non-blocking
    return await crawler.arun(url=url)

Mistake 2: Not Handling Errors Gracefully

Unhandled exceptions crash the MCP server

Problem: Unhandled exceptions crash the MCP server and disconnect Claude

❌ Wrong - crashes on invalid URL

python
async def scrape_url(url: str):
    result = await crawler.arun(url=url)
    return result.markdown

âś… Correct - wrap in try/except

python
async def scrape_url(url: str):
    try:
        if not url.startswith(('http://', 'https://')):
            raise ValueError("URL must start with http:// or https://")

        result = await crawler.arun(url=url)

        if not result.success:
            return f"Scraping failed: {result.error_message}"

        return result.markdown

    except Exception as e:
        return f"Error: {str(e)}. Please check the URL and try again."

Mistake 3: Ignoring Rate Limits and robots.txt

Aggressive crawling that gets you blocked

Problem: Aggressive crawling that gets you blocked

❌ Wrong - no delays

python
for url in urls:
    await crawler.arun(url=url)

âś… Correct - respectful crawling

python
import asyncio
for url in urls:
    await crawler.arun(url=url)
    await asyncio.sleep(2)  # 2 second delay between requests

Always check robots.txt before crawling:

python
import urllib.robotparser
rp = urllib.robotparser.RobotFileParser()
rp.set_url("https://example.com/robots.txt")
rp.read()

if not rp.can_fetch("*", url):
    return "Error: This URL is disallowed by robots.txt"

Mistake 4: Returning Malformed Tool Responses

MCP expects specific response types

Problem: MCP expects specific response types

❌ Wrong - returning raw strings

python
@app.call_tool()
async def handle_call_tool(name: str, arguments: dict):
    return "Some text content"

âś… Correct - wrapping in TextContent

python
@app.call_tool()
async def handle_call_tool(name: str, arguments: dict):
    return [TextContent(type="text", text="Some text content")]

Mistake 5: Not Cleaning Extracted Content

Returning raw HTML with scripts, navbars, and footers

Problem: Returning raw HTML with scripts, navbars, and footers

âś… Use Crawl4AI's built-in content extraction

python
result = await crawler.arun(
    url=url,
    word_count_threshold=10,  # Filter out boilerplate
    exclude_external_links=False,
    exclude_social_media_links=True
)

When Crawl4AI is a Good Fit (And When It's Not)

Good Fit âś…

  • JavaScript-heavy sites: Crawl4AI runs Playwright/Chromium, so it handles SPAs, React apps, and dynamic content
  • Complex extraction needs: CSS selectors, XPath, LLM-based extraction
  • Authenticated content: Can handle cookies and sessions
  • Screenshot needs: Can capture page screenshots alongside content
  • Local development: Free and self-hosted for testing
  • Cost-effective alternative: Compared to Firecrawl, Crawl4AI MCP allows for local processing without API credits, making it the top Crawl4AI alternative for power users who want full control over their web scraping pipeline.

Not a Good Fit ❌

  • High-scale production: Running browser instances is resource-intensive; consider dedicated APIs or hosted solutions
  • Real-time requirements: Browser spawning adds 1-3 seconds per request
  • Simple static sites: Overkill for basic HTML—use requests + BeautifulSoup instead
  • Strict rate limits: Heavy resource usage may trigger anti-bot measures faster than lightweight scrapers
  • Cost-sensitive deployments: Browser instances require significant CPU/memory compared to HTTP clients

Alternative Approaches

  • For simple scraping: requests, httpx, or scrapy
  • For hosted APIs: Firecrawl, Apify, ScrapingBee
  • For enterprise scale: Custom proxy rotation + dedicated scraping infrastructure

Production Considerations

Docker Deployment

Create a Dockerfile:

dockerfile
FROM python:3.11-slim

# Install system dependencies
RUN apt-get update && apt-get install -y \
    wget \
    gnupg \
    && rm -rf /var/lib/apt/lists/*

# Install Playwright browsers
RUN playwright install --with-deps chromium

# Install Python dependencies
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Copy server code
COPY mcp_server.py .

# Run the server
CMD ["python", "mcp_server.py"]

Build and run:

bash
docker build -t crawl4ai-mcp . docker run -v ~/.config:/root/.config crawl4ai-mcp

Security Hardening

  1. Validate all URLs: Reject non-HTTP(S) schemes, internal IPs, or invalid patterns
  2. Rate limiting: Implement per-session and per-IP rate limits
  3. Content size limits: Cap response size to prevent memory exhaustion
  4. Sandboxing: Run in Docker or separate user namespace
  5. Logging: Log all scraping operations for audit trails

Monitoring

Add structured logging:

python
import logging
import json

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

@app.call_tool()
async def handle_call_tool(name: str, arguments: dict):
    logger.info(json.dumps({
        "event": "tool_call",
        "tool": name,
        "arguments": arguments
    }))
    # ... rest of implementation

Troubleshooting Common Errors

Fixing "ModuleNotFoundError: No module named 'crawl4ai'"

Problem: MCP server fails to start with module import error

⚠️

Common Error

ModuleNotFoundError: No module named 'crawl4ai'. This usually happens when the Python environment in Claude/Cursor config doesn't match your local pip install. The MCP server runs in a separate Python process, so you must install Crawl4AI in the exact same Python environment specified in your config file.

Solutions:

  • Install Crawl4AI: pip install crawl4ai
  • Check Python path in Claude Desktop config includes the correct environment
  • Use absolute paths in claude_desktop_config.json
  • Verify you're using the same Python environment where Crawl4AI is installed
bash
# Check if crawl4ai is installed
python3 -m pip list | grep crawl4ai

# Install if missing
python3 -m pip install crawl4ai

# Find Python path
which python3

Fixing "MCP server crashed: Connection closed" in Cursor

Problem: Cursor shows MCP server connection error immediately after startup

Solutions:

  • Check Cursor's MCP logs in Developer Console (View → Developer → Show Console)
  • Verify the Python path is correct for your system (use which python3 on macOS/Linux)
  • Ensure the script has execute permissions: chmod +x mcp_server.py
  • Test the server manually: python3 mcp_server.py (should wait for stdin)

Fixing "playwright> error: Executable doesn't exist at /path/to/chromium"

Problem: Crawl4AI cannot find Playwright browsers

Solutions:

  • Install Playwright browsers: playwright install chromium
  • Install system dependencies: playwright install --with-deps chromium
  • On macOS, ensure Xcode command line tools are installed
  • On Linux, install missing libraries: apt-get install -y wget gnupg
bash
# Install Playwright browsers
playwright install chromium

# Install with system dependencies (Linux)
playwright install --with-deps chromium

# Verify installation
playwright install --help

Fixing "asyncio.run() cannot be called from a running event loop"

Problem: Server crashes when Claude calls tools

Solution: The MCP SDK already manages the event loop. Never callasyncio.run() inside tool handlers.

python
# ❌ WRONG - will crash
@app.call_tool()
async def handle_call_tool(name: str, arguments: dict):
    result = asyncio.run(crawler.arun(url=url))  # Crashes!
    return [TextContent(type="text", text=result)]

# âś… CORRECT - just await
@app.call_tool()
async def handle_call_tool(name: str, arguments: dict):
    async with AsyncWebCrawler() as crawler:
        result = await crawler.arun(url=url)  # Works!
        return [TextContent(type="text", text=result.markdown)]

Fixing " Claude Desktop doesn't show MCP tools"

Problem: Claude Desktop starts but tools don't appear in the UI

Solutions:

  • Restart Claude Desktop completely (quit and reopen, not just window close)
  • Verify config file syntax is valid JSON (use a JSON validator)
  • Check the config file path is correct for your OS
  • Look for MCP errors in Claude Desktop → Help → Developer → Show Console
  • Ensure mcpServers key exists and your server is listed
json
// Valid config example
{
  "mcpServers": {
    "crawl4ai": {
      "command": "python3",
      "args": ["/Users/yourname/projects/crawl4ai-mcp/mcp_server.py"],
      "env": {
        "PYTHONPATH": "/Users/yourname/projects/crawl4ai-mcp"
      }
    }
  }
}

Fixing "TimeoutError: Tool call timed out after 30 seconds"

Problem: Scraping operations timeout on large pages

Solutions:

  • Increase timeout in Claude Desktop config (add "timeout": 120)
  • Use word_count_threshold to limit content extraction
  • Skip screenshots with screenshot=False in Crawl4AI options
  • For large sites, use CSS selectors to extract only needed content
python
# Faster scraping with limits
result = await crawler.arun(
    url=url,
    word_count_threshold=10,  # Skip boilerplate
    screenshot=False,  # Skip screenshots
    bypass_cache=False,  # Use cache when possible
    page_timeout=10000  # 10 second page load timeout
)

Summary

Building an MCP server with Crawl4AI bridges the gap between Claude's reasoning capabilities and live web data. The key implementation points are:

  1. Tool registration: Expose scraping operations as MCP tools with clear input schemas
  2. Async patterns: Use async/await throughout to avoid blocking
  3. Error handling: Catch exceptions and return helpful error messages
  4. Rate limiting: Respect robots.txt and add delays between requests
  5. Production readiness: Use Docker, validation, and monitoring for deployments

This integration is ideal for development, research, and internal tools where control and flexibility outweigh raw performance concerns. For high-scale production use cases, consider dedicated scraping APIs or more lightweight solutions.