Crawl4AI MCP Server Guide: Setup for Cursor & Claude (2026 Update)

Complete guide to building a Model Context Protocol (MCP) server with Crawl4AI. Enable Cursor IDE, Claude Desktop, and other MCP-compatible AI assistants to scrape websites and extract structured data in real-time conversations.

🚀 Quick Setup: Jump to Configuration Code

What Is MCP and Why It Matters for Crawl4AI

Why MCP Is Critical for AI Agent Data Pipelines

The Model Context Protocol (MCP) is an open standard that allows AI assistants like Claude Desktop to connect to external tools and data sources. Think of it as a universal API layer—instead of building custom integrations for each AI model, you create an MCP server once, and any MCP-compliant assistant can use it.

For web scraping, this is particularly valuable. Claude excels at analysis and reasoning but cannot fetch live web data on its own. By exposing Crawl4AI through MCP, you give Claude the ability to:

Extract structured data from URLs during conversations
Crawl multi-page sites without leaving the chat interface
Process JavaScript-heavy content that basic HTTP clients miss
Return cleaned, markdown-formatted content ready for analysis

Limitations of Official MCP Documentation

ℹ️

Why official documentation falls short

The Crawl4AI documentation covers installation and basic usage, but MCP integration requires bridging two different frameworks—you need to understand both Crawl4AI's configuration options and MCP's server implementation patterns. The official docs don't explain how to structure an MCP server, handle tool registration, or manage error propagation between systems.

Prerequisites and Environment Setup

Required Software and Accounts

Before starting, ensure you have:

Python 3.10+ installed
Crawl4AI installed (pip install crawl4ai)
Claude Desktop (Mac/Windows) with MCP enabled
Basic familiarity with async/await patterns in Python
A target website to scrape (respect robots.txt and rate limits)

Local Development Environment Checklist

You'll also need to install the MCP SDK:

bash

pip install mcp

MCP Server Architecture and Setup with Crawl4AI

Understanding the MCP + Crawl4AI Architecture

The integration follows this flow:

text

Claude Desktop → MCP Client → MCP Server (your code) → Crawl4AI → Target Website
     ↓                    ↓                    ↓               ↓            ↓
 User request       MCP protocol         Tool handler    Async crawler   HTTP request
                     (JSON-RPC)

Your MCP server exposes "tools" that Claude can call. Each tool corresponds to a scraping operation—basic scraping, CSS selector extraction, or multi-page crawling. When Claude needs web data, it calls your tool, your server invokes Crawl4AI, and results return through the MCP protocol.

Basic MCP Server Implementation

Create a new file mcp_server.py:

python

#!/usr/bin/env python3
"""
Crawl4AI MCP Server
Exposes web scraping tools to Claude Desktop via Model Context Protocol
"""

from crawl4ai import AsyncWebCrawler
from crawl4ai.extraction_strategy import LLMExtractionStrategy
from mcp.server.models import InitializationOptions
from mcp.server import NotificationOptions, Server
from mcp.server.stdio import stdio_server
from mcp.types import (
    Resource,
    Tool,
    TextContent,
    ImageContent,
    EmbeddedResource,
)
from typing import Any, Optional
import json
import asyncio

app = Server("crawl4ai-server")

@app.list_tools()
async def handle_list_tools() -> list[Tool]:
    """
    List available scraping tools.
    Claude Desktop uses this to discover what operations are available.
    """
    return [
        Tool(
            name="scrape_url",
            description="Scrape a single URL and return content as markdown",
            inputSchema={
                "type": "object",
                "properties": {
                    "url": {
                        "type": "string",
                        "description": "URL to scrape (must include http:// or https://)"
                    },
                    "wait_for_selector": {
                        "type": "string",
                        "description": "Optional CSS selector to wait for before scraping"
                    }
                },
                "required": ["url"]
            }
        ),
        Tool(
            name="extract_with_css",
            description="Extract specific elements using CSS selectors",
            inputSchema={
                "type": "object",
                "properties": {
                    "url": {
                        "type": "string",
                        "description": "URL to scrape"
                    },
                    "selector": {
                        "type": "string",
                        "description": "CSS selector for elements to extract"
                    },
                    "attribute": {
                        "type": "string",
                        "description": "Attribute to extract (text, href, src, etc.)",
                        "default": "text"
                    }
                },
                "required": ["url", "selector"]
            }
        ),
        Tool(
            name="crawl_multiple",
            description="Crawl multiple pages from a website",
            inputSchema={
                "type": "object",
                "properties": {
                    "url": {
                        "type": "string",
                        "description": "Starting URL"
                    },
                    "max_pages": {
                        "type": "integer",
                        "description": "Maximum number of pages to crawl",
                        "default": 5
                    }
                },
                "required": ["url"]
            }
        )
    ]

@app.call_tool()
async def handle_call_tool(name: str, arguments: dict | None) -> list[TextContent | ImageContent | EmbeddedResource]:
    """
    Handle tool calls from Claude Desktop.
    This is where the actual scraping happens.
    """
    if arguments is None:
        arguments = {}

    try:
        if name == "scrape_url":
            return await scrape_url(arguments.get("url"), arguments.get("wait_for_selector"))
        elif name == "extract_with_css":
            return await extract_with_css(
                arguments.get("url"),
                arguments.get("selector"),
                arguments.get("attribute", "text")
            )
        elif name == "crawl_multiple":
            return await crawl_multiple(
                arguments.get("url"),
                arguments.get("max_pages", 5)
            )
        else:
            raise ValueError(f"Unknown tool: {name}")
    except Exception as e:
        return [TextContent(
            type="text",
            text=f"Error during scraping: {str(e)}\n\nTip: Check if the URL is accessible and respects robots.txt"
        )]

async def scrape_url(url: str, wait_for_selector: Optional[str] = None) -> list[TextContent]:
    """Scrape a single URL and return markdown content"""
    if not url:
        return [TextContent(type="text", text="Error: URL is required")]

    async with AsyncWebCrawler(verbose=True) as crawler:
        result = await crawler.arun(
            url=url,
            word_count_threshold=10,
            exclude_external_links=False,
            wait_for_selector=wait_for_selector
        )

        if result.success:
            markdown_content = f"""# Scraped Content from {url}

{result.markdown}

---

## Metadata
- **Status**: Success
- **Word Count**: {result.extracted_content_word_count if hasattr(result, 'extracted_content_word_count') else 'N/A'}
- **Links Found**: {len(result.links.get('internal', [])) + len(result.links.get('external', [])) if hasattr(result, 'links') else 'N/A'}
- **Screenshot**: {'Available' else 'Not available'}
"""
            return [TextContent(type="text", text=markdown_content)]
        else:
            return [TextContent(
                type="text",
                text=f"Failed to scrape {url}\n\nError: {result.error_message if hasattr(result, 'error_message') else 'Unknown error'}"
            )]

async def extract_with_css(url: str, selector: str, attribute: str = "text") -> list[TextContent]:
    """Extract specific elements using CSS selectors"""
    if not url or not selector:
        return [TextContent(type="text", text="Error: URL and selector are required")]

    async with AsyncWebCrawler(verbose=True) as crawler:
        # Use CSS extraction strategy
        result = await crawler.arun(
            url=url,
            css_selector=selector,
            bypass_cache=True
        )

        if result.success:
            extracted_data = result.extracted_content if hasattr(result, 'extracted_content') else result.markdown
            content = f"""# CSS Extraction Results from {url}

**Selector**: `{selector}`
**Attribute**: `{attribute}`

## Extracted Data
```
{extracted_data}
```

---

## Metadata
- **Status**: Success
- **Elements Found**: {len(result.extracted_content) if isinstance(result.extracted_content, list) else '1' if hasattr(result, 'extracted_content') else 'N/A'}
"""
            return [TextContent(type="text", text=content)]
        else:
            return [TextContent(
                type="text",
                text=f"CSS extraction failed for {url}\n\nError: {result.error_message if hasattr(result, 'error_message') else 'Unknown error'}"
            )]

async def crawl_multiple(url: str, max_pages: int = 5) -> list[TextContent]:
    """Crawl multiple pages from a starting URL"""
    if not url:
        return [TextContent(type="text", text="Error: URL is required")]

    results = []
    async with AsyncWebCrawler(verbose=True) as crawler:
        # Basic crawl implementation
        result = await crawler.arun(url=url, word_count_threshold=10)

        if result.success:
            results.append(f"## Page 1: {url}\n{result.markdown}\n")

            # Get internal links for further crawling
            if hasattr(result, 'links'):
                internal_links = result.links.get('internal', [])[:max_pages-1]

                for i, link in enumerate(internal_links, 2):
                    if i > max_pages:
                        break
                    try:
                        page_result = await crawler.arun(url=link, word_count_threshold=10)
                        if page_result.success:
                            results.append(f"## Page {i}: {link}\n{page_result.markdown}\n")
                    except Exception as e:
                        results.append(f"## Page {i}: {link}\nError: {str(e)}\n")

        content = f"""# Multi-Page Crawl Results from {url}

**Pages Crawled**: {len(results)}

{chr(10).join(results)}

---

## Metadata
- **Total Pages**: {len(results)}
- **Starting URL**: {url}
- **Max Pages Limit**: {max_pages}
"""
        return [TextContent(type="text", text=content)]

async def main():
    """Run the MCP server"""
    async with stdio_server() as (read_stream, write_stream):
        await app.run(
            read_stream,
            write_stream,
            InitializationOptions(
                server_name="crawl4ai-server",
                server_version="1.0.0",
                capabilities=app.get_capabilities(
                    notification_options=NotificationOptions(),
                    experimental_capabilities={}
                )
            )
        )

if __name__ == "__main__":
    asyncio.run(main())

MCP Client Configuration and Integration

Configuring Claude Desktop to Use MCP Tools

To connect your MCP server to Claude Desktop, create or edit the Claude config file:

Mac: ~/Library/Application Support/Claude/claude_desktop_config.json
Windows: %APPDATA%\\Claude\\claude_desktop_config.json

Add this configuration:

json

{
  "mcpServers": {
    "crawl4ai": {
      "command": "python3",
      "args": ["/absolute/path/to/your/mcp_server.py"],
      "env": {
        "PYTHONPATH": "/absolute/path/to/your/project"
      }
    }
  }
}

Replace /absolute/path/to/your/ with your actual file paths.

Using Crawl4AI MCP with Cursor

Cursor IDE also supports MCP servers. The configuration is similar to Claude Desktop:

Cursor Settings Location: Settings → MCP Servers → Add Server

json

{
  "mcpServers": {
    "crawl4ai": {
      "command": "python3",
      "args": ["/absolute/path/to/your/mcp_server.py"],
      "env": {
        "PYTHONPATH": "/absolute/path/to/your/project"
      }
    }
  }
}

ℹ️

Cursor-specific considerations

Cursor may require additional environment variables depending on your operating system. Check Cursor's MCP documentation for platform-specific configuration details.

Testing the MCP Integration End-to-End

After restarting Claude Desktop:

Open a new conversation
Try a command like:
"Scrape https://example.com and summarize the main points"
Claude will automatically use the scrape_url tool
For targeted extraction:
"Extract all product prices from https://shop.example.com using the selector .price"

ℹ️

Debugging tip

Check Claude Desktop's developer console (View → Developer → Show Console) for MCP server logs and error messages.

Common MCP Implementation Mistakes and How to Avoid Them

Blocking the Event Loop in Async MCP Servers

Problem: Using synchronous time.sleep() or blocking I/O in async handlers

❌ Wrong - blocks the entire server

python

async def scrape_url(url: str):
    time.sleep(5)  # This blocks everything
    return await crawler.arun(url=url)

✅ Correct - use async等待

python

async def scrape_url(url: str):
    await asyncio.sleep(5)  # Non-blocking
    return await crawler.arun(url=url)

Improper Error Handling in MCP Tool Responses

Problem: Unhandled exceptions crash the MCP server and disconnect Claude

❌ Wrong - crashes on invalid URL

python

async def scrape_url(url: str):
    result = await crawler.arun(url=url)
    return result.markdown

✅ Correct - wrap in try/except

python

async def scrape_url(url: str):
    try:
        if not url.startswith(('http://', 'https://')):
            raise ValueError("URL must start with http:// or https://")

        result = await crawler.arun(url=url)

        if not result.success:
            return f"Scraping failed: {result.error_message}"

        return result.markdown

    except Exception as e:
        return f"Error: {str(e)}. Please check the URL and try again."

Ignoring Rate Limits and robots.txt

Problem: Aggressive crawling that gets you blocked

❌ Wrong - no delays

python

for url in urls:
    await crawler.arun(url=url)

✅ Correct - respectful crawling

python

import asyncio
for url in urls:
    await crawler.arun(url=url)
    await asyncio.sleep(2)  # 2 second delay between requests

Always check robots.txt before crawling:

python

import urllib.robotparser
rp = urllib.robotparser.RobotFileParser()
rp.set_url("https://example.com/robots.txt")
rp.read()

if not rp.can_fetch("*", url):
    return "Error: This URL is disallowed by robots.txt"

Returning Malformed or Incomplete Tool Schemas

Problem: MCP expects specific response types

❌ Wrong - returning raw strings

python

@app.call_tool()
async def handle_call_tool(name: str, arguments: dict):
    return "Some text content"

✅ Correct - wrapping in TextContent

python

@app.call_tool()
async def handle_call_tool(name: str, arguments: dict):
    return [TextContent(type="text", text="Some text content")]

Not Cleaning or Normalizing Extracted Content

Problem: Returning raw HTML with scripts, navbars, and footers

✅ Use Crawl4AI's built-in content extraction

python

result = await crawler.arun(
    url=url,
    word_count_threshold=10,  # Filter out boilerplate
    exclude_external_links=False,
    exclude_social_media_links=True
)

When Crawl4AI Is a Good Fit (and When It's Not)

Ideal Use Cases for Crawl4AI MCP

JavaScript-heavy sites: Crawl4AI runs Playwright/Chromium, so it handles SPAs, React apps, and dynamic content
Complex extraction needs: CSS selectors, XPath, LLM-based extraction
Authenticated content: Can handle cookies and sessions
Screenshot needs: Can capture page screenshots alongside content
Local development: Free and self-hosted for testing
Cost-effective alternative: Compared to Firecrawl, Crawl4AI MCP allows for local processing without API credits, making it the top Crawl4AI alternative for power users who want full control over their web scraping pipeline.

When Crawl4AI MCP Is Not the Right Choice

High-scale production: Running browser instances is resource-intensive; consider dedicated APIs or hosted solutions
Real-time requirements: Browser spawning adds 1-3 seconds per request
Simple static sites: Overkill for basic HTML—use requests + BeautifulSoup instead
Strict rate limits: Heavy resource usage may trigger anti-bot measures faster than lightweight scrapers
Cost-sensitive deployments: Browser instances require significant CPU/memory compared to HTTP clients

Alternative MCP-Compatible Scraping Approaches

For simple scraping: requests, httpx, or scrapy
For hosted APIs: Firecrawl, Apify, ScrapingBee
For enterprise scale: Custom proxy rotation + dedicated scraping infrastructure

Production Deployment and Operational Considerations

Deploying a Crawl4AI MCP Server with Docker

Create a Dockerfile:

dockerfile

FROM python:3.11-slim

# Install system dependencies
RUN apt-get update && apt-get install -y \
    wget \
    gnupg \
    && rm -rf /var/lib/apt/lists/*

# Install Playwright browsers
RUN playwright install --with-deps chromium

# Install Python dependencies
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Copy server code
COPY mcp_server.py .

# Run the server
CMD ["python", "mcp_server.py"]

Build and run:

bash

docker build -t crawl4ai-mcp . docker run -v ~/.config:/root/.config crawl4ai-mcp

Security Hardening and Credential Management

Validate all URLs: Reject non-HTTP(S) schemes, internal IPs, or invalid patterns
Rate limiting: Implement per-session and per-IP rate limits
Content size limits: Cap response size to prevent memory exhaustion
Sandboxing: Run in Docker or separate user namespace
Logging: Log all scraping operations for audit trails

Monitoring, Logging, and Reliability

Add structured logging:

python

import logging
import json

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

@app.call_tool()
async def handle_call_tool(name: str, arguments: dict):
    logger.info(json.dumps({
        "event": "tool_call",
        "tool": name,
        "arguments": arguments
    }))
    # ... rest of implementation

Troubleshooting Crawl4AI MCP Integration Issues

Fixing "ModuleNotFoundError: No module named 'crawl4ai'"

Problem: MCP server fails to start with module import error

⚠️

Common Error

ModuleNotFoundError: No module named 'crawl4ai'. This usually happens when the Python environment in Claude/Cursor config doesn't match your local pip install. The MCP server runs in a separate Python process, so you must install Crawl4AI in the exact same Python environment specified in your config file.

Solutions:

Install Crawl4AI: pip install crawl4ai
Check Python path in Claude Desktop config includes the correct environment
Use absolute paths in claude_desktop_config.json
Verify you're using the same Python environment where Crawl4AI is installed

bash

# Check if crawl4ai is installed
python3 -m pip list | grep crawl4ai

# Install if missing
python3 -m pip install crawl4ai

# Find Python path
which python3

Fixing "MCP server crashed: Connection closed" in Cursor

Problem: Cursor shows MCP server connection error immediately after startup

Solutions:

Check Cursor's MCP logs in Developer Console (View → Developer → Show Console)
Verify the Python path is correct for your system (use which python3 on macOS/Linux)
Ensure the script has execute permissions: chmod +x mcp_server.py
Test the server manually: python3 mcp_server.py (should wait for stdin)

Fixing "playwright> error: Executable doesn't exist at /path/to/chromium"

Problem: Crawl4AI cannot find Playwright browsers

Solutions:

Install Playwright browsers: playwright install chromium
Install system dependencies: playwright install --with-deps chromium
On macOS, ensure Xcode command line tools are installed
On Linux, install missing libraries: apt-get install -y wget gnupg

bash

# Install Playwright browsers
playwright install chromium

# Install with system dependencies (Linux)
playwright install --with-deps chromium

# Verify installation
playwright install --help

Fixing "asyncio.run() cannot be called from a running event loop"

Problem: Server crashes when Claude calls tools

Solution: The MCP SDK already manages the event loop. Never callasyncio.run() inside tool handlers.

python

# ❌ WRONG - will crash
@app.call_tool()
async def handle_call_tool(name: str, arguments: dict):
    result = asyncio.run(crawler.arun(url=url))  # Crashes!
    return [TextContent(type="text", text=result)]

# ✅ CORRECT - just await
@app.call_tool()
async def handle_call_tool(name: str, arguments: dict):
    async with AsyncWebCrawler() as crawler:
        result = await crawler.arun(url=url)  # Works!
        return [TextContent(type="text", text=result.markdown)]

Fixing " Claude Desktop doesn't show MCP tools"

Problem: Claude Desktop starts but tools don't appear in the UI

Solutions:

Restart Claude Desktop completely (quit and reopen, not just window close)
Verify config file syntax is valid JSON (use a JSON validator)
Check the config file path is correct for your OS
Look for MCP errors in Claude Desktop → Help → Developer → Show Console
Ensure mcpServers key exists and your server is listed

json

// Valid config example
{
  "mcpServers": {
    "crawl4ai": {
      "command": "python3",
      "args": ["/Users/yourname/projects/crawl4ai-mcp/mcp_server.py"],
      "env": {
        "PYTHONPATH": "/Users/yourname/projects/crawl4ai-mcp"
      }
    }
  }
}

Fixing "TimeoutError: Tool call timed out after 30 seconds"

Problem: Scraping operations timeout on large pages

Solutions:

Increase timeout in Claude Desktop config (add "timeout": 120)
Use word_count_threshold to limit content extraction
Skip screenshots with screenshot=False in Crawl4AI options
For large sites, use CSS selectors to extract only needed content

python

# Faster scraping with limits
result = await crawler.arun(
    url=url,
    word_count_threshold=10,  # Skip boilerplate
    screenshot=False,  # Skip screenshots
    bypass_cache=False,  # Use cache when possible
    page_timeout=10000  # 10 second page load timeout
)

Crawl4AI MCP Server Setup Summary

Building an MCP server with Crawl4AI bridges the gap between Claude's reasoning capabilities and live web data. The key implementation points are:

Tool registration: Expose scraping operations as MCP tools with clear input schemas
Async patterns: Use async/await throughout to avoid blocking
Error handling: Catch exceptions and return helpful error messages
Rate limiting: Respect robots.txt and add delays between requests
Production readiness: Use Docker, validation, and monitoring for deployments

This integration is ideal for development, research, and internal tools where control and flexibility outweigh raw performance concerns. For high-scale production use cases, consider dedicated scraping APIs or more lightweight solutions.