Crawl4AI

AI-optimized web scraping with JavaScript rendering, LLM-friendly markdown conversion, and RAG pipeline integration.

Overview

Crawl4AI is an open-source web scraping solution optimized for AI applications:
- LLM-optimized output - Clean markdown perfect for RAG pipelines
- JavaScript rendering - Playwright-based for modern SPAs
- Fast and efficient - Parallel crawling with smart caching
- Structured extraction - JSON schema-based data extraction
- 100% private - No telemetry, all processing local
- CPU-based - No GPU required

Access

Endpoint URL
Web UI / API https://crawler.haiven.local
Direct Port http://localhost:11235
Health Check https://crawler.haiven.local/health

Quick Start

cd /mnt/apps/docker/ai/crawl4ai

# Start the service
docker compose up -d

# Watch logs (first start downloads Chromium browser ~200MB)
docker compose logs -f crawl4ai

Note: First startup takes 5-10 minutes to download Chromium browser.

API Usage

Health Check

curl http://localhost:11235/health

Basic Crawl

curl -X POST http://localhost:11235/crawl \
  -H "Content-Type: application/json" \
  -d '{"urls": ["https://example.com"]}' | jq '.results[0].markdown.raw_markdown'

Crawl with JavaScript Rendering

curl -X POST http://localhost:11235/crawl \
  -H "Content-Type: application/json" \
  -d '{
    "urls": ["https://react-app.com"],
    "js_code": "await new Promise(r => setTimeout(r, 3000));",
    "wait_for": "css:.content"
  }' | jq '.results[0].markdown.raw_markdown'

Batch Crawl (Multiple URLs)

curl -X POST http://localhost:11235/crawl \
  -H "Content-Type: application/json" \
  -d '{
    "urls": [
      "https://example.com",
      "https://example.org"
    ]
  }' | jq '.results[] | {url: .url, title: .metadata.title}'

Configuration

See .env.example for all configuration options.

Variable Default Description
CRAWL4AI_API_TOKEN required API authentication token
CRAWL4AI_MAX_CONCURRENT_REQUESTS 10 Max parallel crawl requests

Storage

Path Purpose
/mnt/storage/crawler/cache Cached web pages
/mnt/storage/crawler/outputs Crawled content (markdown, JSON)
/mnt/storage/crawler/sessions Browser sessions and cookies

Maintenance

View Logs

docker logs -f crawl4ai

Clear Cache

rm -rf /mnt/storage/crawler/cache/*

Update Service

cd /mnt/apps/docker/ai/crawl4ai
docker compose pull
docker compose down
docker compose up -d

Resources


Generated by haiven-service-onboarding plugin