Skip to content

MabudAlam/QuickCrawl

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

113 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Quickcrawl

QuickCrawl Logo

πŸš€ Web Scraping API for AI Agents β€” Scrape, crawl, and map websites with a single binary.

Playground: Try it online

Go Gin MCP Chrome Chrome LightPanda DuckDuckGo

Deploy on Railway


⚑ Overview

Quickcrawl is a powerful Go-based web scraping service that brings intelligence to AI agents. Whether you're scraping a single page, crawling an entire website, or mapping out link structures β€” Quickcrawl handles the heavy lifting with a sophisticated multi-layered architecture combining HTTP fetching, browser automation, and LLM-powered structured extraction.


✨ Features

Feature Description
🌐 Web Scraping Convert any URL to Markdown, HTML, Plain Text, or Links
πŸ”„ Async Crawling BFS website crawler with depth/page limits and rate limiting
πŸ—ΊοΈ URL Mapping Discover all URLs on a site instantly without scraping content
🧠 JavaScript Rendering Auto-detect SPAs and render via LightPanda or Chrome
πŸ“Š LLM Extraction Send a JSON schema, get validated structured data back
πŸ” Web Search DuckDuckGo-powered search for AI agent integration
πŸ€– MCP Server Built-in stdio transport for seamless AI agent integration
πŸ“¦ Multi-format Output Markdown, HTML, RawHTML, PlainText, Links, JSON

πŸ“Š Benchmark

85.4% scrape success rate across 1,000 diverse URLs from the Firecrawl Scrape Content Dataset v1

Tested against Firecrawl v2.5 on the same dataset:

Feature Quickcrawl Firecrawl
Coverage 85.4% 79.4%
Avg Scrape Latency 1,841.9ms 4,048ms
Self-hosting Single binary Multi-container (~4GB+)
Cost / 1K scrapes $0 (self-hosted) $9

Quality Metrics

Metric Value
Content Recall 44.03%
Noise Rejection 86.65%
Content Matches 376
Noise Leaks 114

Use Cases

Build RAG pipelines with clean LLM-ready markdown, give AI agents real-time web access, monitor content changes, extract structured data, convert HTML to clean markdown, or archive web pages at scale.


πŸ€– MCP Server

Quickcrawl MCP server provides AI agents with web scraping capabilities:

Available Tools

Tool Description
scrape Scrape a single URL
crawl Start async website crawl
check_crawl_status Check crawl job status
cancel_crawl Cancel running crawl
map Discover URLs on a site
site_map Discover URLs without scraping content (sitemap-aware)
search Search DuckDuckGo

OpenCode Integration

Add Quickcrawl to your OpenCode configuration:

{
  "mcp": {
    "quickcrawl": {
      "type": "local",
      "command": ["npx", "-y", "@mabudalam/quickcrawl-mcp"],
      "enabled": true
    }
  }
}

Configuration

Renderer selection in MCP:

  • Set renderJs: true to use the configured Chrome browser (chromedp) for the request
  • Set renderJs: false (default) to use the plain HTTP fetcher
  • When [renderer.chrome].ws_url is unset in quickcrawl.toml, MCP auto-launches a local LightPanda and uses its CDP endpoint. The launched process is killed on MCP shutdown.

Example MCP tool arguments:

{
  "url": "https://www.notion.so/",
  "formats": ["markdown"],
  "renderJs": true
}

πŸ’» CLI

The Quickcrawl CLI provides standalone command-line access to all features. No server or Python needed.

Installation

# Quick install (recommended)
curl -fsSL https://raw.githubusercontent.com/MabudAlam/quickcrawl/main/install.sh | sh

# Or from source
go install github.com/MabudAlam/quickcrawl/cli

# Download from GitHub releases
curl -L https://github.com/MabudAlam/QuickCrawl/releases/latest/download/quickcrawl_darwin_arm64.tar.gz | tar -xz
./quickcrawl --help

Usage

# Scrape a single URL
quickcrawl scrape https://example.com
quickcrawl scrape https://example.com --formats html,markdown

# Crawl a website
quickcrawl crawl https://example.com --max-pages 10 --max-depth 3

# Discover URLs (without scraping content)
quickcrawl map https://example.com --max-depth 2

# Search DuckDuckGo
quickcrawl search "golang web scraping"
quickcrawl search "python" --scrape --formats markdown

CLI Examples

See cli/cmd/ for all subcommands:


🐍 Python SDK

Python SDK for Quickcrawl β€” scrape, crawl, and map websites from Python code.

Installation

# From PyPI (coming soon)
pip install quickcrawl

# From GitHub
pip install git+https://github.com/MabudAlam/quickcrawl.git@python-sdk#subdirectory=python

# Or clone and install
git clone https://github.com/MabudAlam/quickcrawl
cd quickcrawl/python
pip install -e .

Quick Start

from quickcrawl import QuickCrawlClient

# CLI mode (zero config, auto-downloads binary)
with QuickCrawlClient() as client:
    result = client.scrape("https://example.com")
    print(result["markdown"])

# HTTP mode (connect to deployed server)
client = QuickCrawlClient(api_url="https://your-server.com", api_key="...")
result = client.scrape("https://example.com")

Python SDK Examples

See python/examples/:


πŸ—οΈ Architecture

graph TD
    Client["πŸ–₯️ Client (HTTP / MCP)"]
    Server["πŸš€ Quickcrawl Server"]

    Client --> Server

    Server --> Router["πŸ“‘ Gin Router"]
    Router --> Handlers["πŸ”§ API Handlers"]

    Handlers --> Renderer["🎨 Renderer Layer"]
    Renderer --> HTTPFetcher["🌐 HTTP Fetcher"]
    Renderer --> Browser["🌎 Browser (CDP)"]
    Browser --> LightPanda["🐼 LightPanda"]
    Browser --> Chrome["πŸ”΅ Chrome DevTools"]
    Browser --> Chrome["πŸ‰ Chrome"]

    Renderer --> Extractor["πŸ“ Extractor"]
    Extractor --> Markdown["πŸ“„ Markdown"]
    Extractor --> HTML["🌐 HTML"]
    Extractor --> PlainText["πŸ“ƒ Plain Text"]
    Extractor --> Links["πŸ”— Links"]

    Handlers --> Crawler["πŸ”„ Crawler"]
    Crawler --> Robots["πŸ€– Robots.txt"]
    Crawler --> Sitemap["πŸ—ΊοΈ Sitemap"]
    Crawler --> RateLimit["⚑ Rate Limiter"]

    Handlers --> Search["πŸ” DuckDuckGo Search"]
    Search --> LLM["🧠 LLM Extraction"]
    LLM --> JSONSchema["πŸ“‹ JSON Schema Output"]
Loading

Renderer Selection

Request
     ↓
*core.Scraper  (single render path; chromedp + shared HTTPFetcher)
     ↓
     β”œβ”€β”€ renderJs=false β†’ renderer.HTTPFetcher  (plain HTTP GET, no JS)
     β”‚
     └── renderJs=true  β†’ chromedp RemoteAllocator β†’ persistent Chrome
                          (JS rendered, anti-bot stealth, SPA readiness poll)

Both cli and mcp entry points auto-launch a local LightPanda when no Chrome WS URL is configured (HTTP server does not β€” it requires a user-supplied WS URL and falls back to HTTP-only when none is configured).

Scrape API End-to-End Flow

When a client calls POST /v1/scrape, Quickcrawl executes the following pipeline:

HTTP POST /v1/scrape
    β”‚
    β–Ό
handlers.Scrape()                       [internal/api/handlers/handler.go:64]
    β”‚  β€’ Parse + JSON-decode into core.ScrapeRequest
    β”‚  β€’ Validate URL (http/https required, non-empty)
    β”‚  β€’ Default formats to ["markdown"] if empty
    β”‚  β€’ Optional robots.txt check (config: crawler.respect_robots_txt)
    β”‚
    β–Ό
core.Scraper.Scrape()                  [internal/core/scraper.go:69]
    β”‚  β€’ resolveRenderJS()   β€” request override, defaults to false
    β”‚  β€’ resolveWaitMs()     β€” request override, defaults to 0
    │  ‒ resolveFormats()    — string→types.OutputFormat conversion
    β”‚
    β–Ό
core.Renderer.FetchOrchestrator()      [internal/core/renderer.go:213]
    β”‚
    β”œβ”€β”€ (renderJs=false) ── HTTP path ──────────────────┐
    β”‚                                                    β”‚
    β”‚                                     renderer.HTTPFetcher.Fetch()
    β”‚                                     [internal/renderer/http.go]
    β”‚                                     β€’ HTTP GET with stealth headers
    β”‚                                     β€’ Returns FetchResult{HTML, StatusCode}
    β”‚
    └── (renderJs=true)  ── CDP path ───────────────────┐
                                                      β–Ό
core.Renderer.fetchWithCDPBrowser()   [internal/core/renderer.go:334]
    β”‚  β€’ Acquire per-host concurrency slot
    β”‚  β€’ Create isolated browser context (chromedp.NewContext)
    β”‚  β€’ Apply page_timeout_ms to chromedp.Run
    β”‚  β€’ Action sequence:
    β”‚      - enableNetworkTracking(networkBundle)
    β”‚      - stealthInjectionAction()           (when crawler.stealth.enabled)
    β”‚      - navigateIgnoringHTTPStatus()
    β”‚      - dismissCookieBannersFastAction()   (when waitMs == 0)
    β”‚      - WaitForSPAReady()                  (polls for content readiness,
    β”‚                                            network-idle, or selector hit)
    β”‚      - autoScrollAction()                 (when waitMs == 0 + lazy markers)
    β”‚      - OuterHTML of <head> + <body>
    β”‚  β€’ Anti-bot challenge detection (status 4xx/5xx)
    β”‚  β€’ Returns FetchResult{HTML, FinalURL, StatusCode, ContentType}
    β”‚
    β–Ό
core.Extractor.Extract()               [internal/core/extractor.go]
    β”‚  β€’ ExtractMetadata β€” title, description, OG tags, canonical, language
    β”‚  β€’ preprocessHTML  β€” strip head, cleanNoise, applyNoisePatterns
    β”‚                       (and IncludeTags / ExcludeTags / CSSSelector if set)
    β”‚  β€’ postprocessHTML β€” sanitize, dedupe, normalize whitespace
    β”‚  β€’ HTMLToMarkdown  β€” primary conversion (fullClean β†’ structural β†’ plaintext)
    β”‚  β€’ HTMLToPlaintext
    β”‚  β€’ ExtractLinks, ExtractImageURLs
    β”‚
    β–Ό Returns: core.ScrapeData{Markdown, HTML, PlainText, Links, ImageLinks, Metadata}
    β”‚
    β–Ό
(Optional) LLM Structured Extraction   [internal/core/llm.go]
    β”‚  β€’ Triggered when formats contains "json" and [extraction.llm] is configured
    β”‚  β€’ buildLLMInput β†’ callOpenAI(chat/completions) β†’ validateDataAgainstSchema
    β”‚  β€’ Populates data.JSON
    β”‚
    β–Ό
handlers.Scrape()
    β”‚  β€’ If statusCode >= 400 and body < 200 chars β†’ surface as failure
    β”‚  β€’ Else return success with data + warning
    β”‚
    β–Ό
c.JSON(http.StatusOK, APIResponse{ScrapeData})

Crawl API End-to-End Flow

HTTP POST /v1/crawl
    β”‚
    β–Ό
handlers.StartCrawl()                  [internal/api/handlers/handler.go:166]
    β”‚  β€’ Parse + JSON-decode into types.CrawlRequest
    β”‚  β€’ Validate URL, maxDepth (0-10), maxPages (1-1000)
    β”‚  β€’ Default maxDepth/maxPages from config
    β”‚  β€’ Reject formats=["json"] with 400 (use /v1/scrape for LLM extraction)
    β”‚  β€’ Generate job ID, store in AppState.CrawlJobs
    β”‚
    β–Ό
crawler.RunCrawl()                     [internal/crawler/crawl.go]
    β”‚  β€’ BFS from seed URL, respecting same-origin
    β”‚  β€’ robots.txt check per page (if enabled)
    β”‚  β€’ Per-host rate limiter (crawler.requests_per_second)
    β”‚  β€’ Per-host + global concurrency slots
    β”‚  β€’ For each page:
    β”‚      - core.Scraper.Scrape() (same pipeline as above)
    β”‚      - Stealth jitter added to inter-request sleep
    β”‚      - Update CrawlState via stateCh
    β”‚
    β–Ό Returns: types.CrawlState{Total, Completed, Data[], Status}
    β”‚
    β–Ό
c.JSON(http.StatusOK, CrawlStartResponse{ID})

GET /v1/crawl/:id  β†’  handlers.GetCrawlStatus()  β†’  returns CrawlState
DELETE /v1/crawl/:id β†’ handlers.CancelCrawl()     β†’  204 No Content

πŸ”Œ API Endpoints

🌐 Scraping β€” /v1/scrape

Method Path Description
POST /v1/scrape Scrape a single URL with one or more output formats

Scrape a single URL. This is the canonical endpoint for fetching and extracting content from one page β€” supports HTTP, browser (JS) rendering, content filters, and LLM-based structured extraction.

Request body (core.ScrapeRequest):

Field Type Required Description
url string yes Absolute http:// or https:// URL to scrape
formats string[] no Output formats. One or more of markdown, html, rawHtml, plainText, links, imageLinks, json. Defaults to ["markdown"]
renderJs bool no When true, fetch the page through a headless Chrome (chromedp). When false (default), use plain HTTP via the shared *renderer.HTTPFetcher
waitFor int no Milliseconds to wait after navigation for late content / XHRs. 0 = use the SPA-readiness poll (default)
headers object no Custom HTTP headers sent on the fetch
includeTags string[] no CSS selectors to keep (e.g. ["article", "h1"]) β€” applied during preprocessHTML
excludeTags string[] no CSS selectors to drop (e.g. ["nav", "footer", ".ad"])
cssSelector string no Extract content matching this CSS selector only
jsonSchema object no JSON Schema used by formats:["json"] to constrain LLM extraction
extract object no LLM extraction overrides: { schema, prompt, responseFormat }
llmExtractionPrompt string no Per-request LLM system prompt override
llmResponseFormat string no Per-request LLM response_format name override
browser string no Deprecated. Accepted for backward-compat; ignored.

Minimal request:

{
  "url": "https://example.com"
}

Full request with filters and JS rendering:

{
  "url": "https://example.com/article",
  "formats": ["markdown", "html", "links"],
  "renderJs": true,
  "waitFor": 2000,
  "headers": { "Cookie": "session=abc" },
  "includeTags": ["article", "h1", "h2", "p"],
  "excludeTags": ["nav", "footer", ".advertisement"]
}

Response (200 OK on success):

{
  "success": true,
  "data": {
    "markdown": "# Example Domain\n\nThis domain is for use in documentation examples...",
    "html": "<h1>Example Domain</h1><p>This domain is for use in...</p>",
    "plainText": "Example Domain This domain is for use in documentation examples...",
    "links": ["https://www.iana.org/domains/example"],
    "imageLinks": [],
    "metadata": {
      "title": "Example Domain",
      "description": null,
      "ogpTitle": null,
      "ogpDescription": null,
      "ogpImage": null,
      "canonicalUrl": null,
      "sourceURL": "https://example.com",
      "language": "en",
      "statusCode": 200,
      "renderedMode": "http",
      "timeTaken": 281
    }
  },
  "warning": null
}

renderedMode is "http" when fetched via the HTTP fetcher, or "browser" when fetched via chromedp. See Metadata.renderedMode.

LLM-extraction response (when formats includes "json"):

{
  "success": true,
  "data": {
    "markdown": "...",
    "json": {
      "title": "Example Domain",
      "purpose": "documentation example"
    },
    "metadata": { "...": "..." }
  }
}

data.json is populated only when [extraction.llm] is configured in the server TOML. See the LLM Extraction section below.

Error responses:

Status Code Cause
400 invalid_request Missing url, non-http(s) scheme, malformed JSON, or headers/includeTags/etc. of the wrong type
400 forbidden crawler.respect_robots_txt=true and the page is disallowed
500 internal_error Scraper not initialized
200 with success:false http Target returned HTTP 4xx/5xx with a small body (surfaced as a soft failure rather than an HTTP error so callers can still inspect the metadata)
200 with success:false renderer_error Browser path requested but no Chrome WS URL is configured (set [renderer.chrome].ws_url)

πŸ”„ Crawling β€” /v1/crawl

Method Path Description
POST /v1/crawl Start an async BFS crawl of a website
GET /v1/crawl/:id Check crawl status and retrieve results
DELETE /v1/crawl/:id Cancel a running crawl job

Start a BFS crawl from a seed URL. The job runs asynchronously β€” POST returns a job ID immediately, and you poll GET /v1/crawl/:id for progress and results.

Request body (types.CrawlRequest):

Field Type Required Description
url string yes Starting URL
maxDepth int no Maximum link depth to follow. 0-10. Defaults to crawler.default_max_depth (TOML)
maxPages int no Maximum pages to scrape. 1-1000. Defaults to crawler.default_max_pages (TOML)
formats string[] no Output formats per page. Any subset of markdown, html, rawHtml, plainText, links, imageLinks. Note: "json" is rejected with 400 β€” use /v1/scrape for LLM extraction
renderJs bool no Force JS rendering on every page (chromedp path)
waitFor int no Milliseconds to wait after each navigation
browser string no Deprecated. Accepted for backward-compat; ignored.

Start crawl request:

{
  "url": "https://example.com",
  "maxDepth": 2,
  "maxPages": 50,
  "formats": ["markdown", "links"],
  "renderJs": false
}

Start response (200 OK):

{
  "success": true,
  "id": "crawl-1748899200000000000"
}

Check status β€” GET /v1/crawl/:id. No body required.

Status response while running:

{
  "id": "crawl-1748899200000000000",
  "success": true,
  "status": "scraping",
  "total": 47,
  "completed": 12,
  "data": []
}

Status response when complete:

{
  "id": "crawl-1748899200000000000",
  "success": true,
  "status": "completed",
  "total": 47,
  "completed": 47,
  "data": [
    {
      "markdown": "# Example Domain\n\n...",
      "html": null,
      "plainText": null,
      "links": ["https://www.iana.org/domains/example"],
      "imageLinks": [],
      "metadata": {
        "sourceURL": "https://example.com",
        "statusCode": 200,
        "renderedMode": "http",
        "timeTaken": 281
      }
    }
  ]
}

status is one of pending, scraping, completed, failed. When the crawl fails, error contains a human-readable message and success is false.

Cancel β€” DELETE /v1/crawl/:id. No body. Returns 204 No Content on success, 404 Not Found if the job ID is unknown.

Error responses for POST /v1/crawl:

Status Code Cause
400 invalid_request Missing url, non-http(s) scheme, malformed JSON, or formats contains "json"
500 internal_error Scraper not initialized

πŸ—ΊοΈ Mapping β€” /v1/map

Method Path Description
POST /v1/map Discover all URLs on a site instantly

Request:

{
  "url": "https://www.mabud.dev/",
  "maxDepth": 2,
  "useSitemap": true,
  "timeout": 30000
}

Response:

{
  "success": true,
  "data": {
    "links": [
      "https://www.mabud.dev/blog",
      "https://www.mabud.dev/projects",
      "https://www.mabud.dev/resume"
    ]
  }
}

πŸ” Search β€” /v1/search

Method Path Description
POST /v1/search Search DuckDuckGo and optionally scrape results in parallel

By default /v1/search returns only search-result metadata (title, URL, snippet). Set "scrape": true to also fetch and extract content (markdown/html/etc.) from each result URL β€” 10 workers in parallel.

{
  "query": "golang web scraping",
  "scrape": true,
  "formats": ["markdown"]
}

πŸ₯ Health β€” /health

Method Path Description
GET /health Health check with browser availability and active job count

🧠 LLM Extraction

Quickcrawl supports JSON-Schema-based structured extraction for /v1/scrape. Add "json" to the formats array and supply a jsonSchema (or extract.schema) describing the shape you want. The page's markdown is sent to the LLM along with the schema, and the response is returned in data.json.

Request:

{
  "url": "https://news.example.com/article",
  "formats": ["markdown", "json"],
  "jsonSchema": {
    "type": "object",
    "properties": {
      "title":     { "type": "string" },
      "author":    { "type": "string" },
      "published": { "type": "string" }
    },
    "required": ["title", "author"]
  },
  "extract": {
    "prompt": "Extract article title, author, and publish date",
    "responseFormat": "article"
  }
}

Response (200 OK):

{
  "success": true,
  "data": {
    "markdown": "# Headline\n\nBy Jane Doe. Published 2024-01-15...",
    "json": {
      "title": "Headline",
      "author": "Jane Doe",
      "published": "2024-01-15"
    },
    "metadata": { "...": "..." }
  }
}

Field reference:

Field Type Description
jsonSchema object Top-level shortcut for extract.schema β€” JSON Schema for the data you want extracted
extract.schema object Same as jsonSchema (nested form)
extract.prompt string Per-request system prompt override (otherwise [extraction.llm].extraction_prompt from the server TOML is used)
extract.responseFormat string OpenAI response_format.name for the structured output. Defaults to "extracted_data"
llmExtractionPrompt string Top-level shortcut for extract.prompt
llmResponseFormat string Top-level shortcut for extract.responseFormat

Server-side configuration (quickcrawl.toml):

[extraction.llm]
api_key   = ""                # or set EXTRACTION__LLM__API_KEY in the env
model     = "gpt-4o-mini"
base_url  = ""                # override for non-OpenAI endpoints
max_tokens = 8192
extraction_prompt = "You are a data extraction assistant..."
response_format   = "extracted_data"

The LLM is only invoked when formats includes "json". If formats:["json"] is requested but [extraction.llm] is not configured, the scrape returns {"success": false, "errorCode": "extraction_error", "error": "json extraction requested but no LLM configured. Set [extraction.llm] in server config."}.


βš™οΈ Configuration

Config file: quickcrawl.toml

[server]
host = "0.0.0.0"
port = 3000
rate_limit_rps = 10

[renderer]
page_timeout_ms = 30000
pool_size = 4

[renderer.chrome]
ws_url = ""

[crawler]
max_concurrency = 40
requests_per_second = 40.0
respect_robots_txt = true
default_max_depth = 2
default_max_pages = 100

[extraction.llm]
model = "gpt-4o-mini"
api_key = ""
base_url = ""
max_tokens = 8192
extraction_prompt = "You are a data extraction assistant..."
response_format   = "extracted_data"

Or via environment variables:

SERVER__PORT=3000
RENDERER__CHROME__WS_URL=ws://127.0.0.1:9222/devtools/browser/...
CRAWLER__MAX_CONCURRENCY=40
EXTRACTION__LLM__API_KEY=your-key

πŸ“‚ Project Structure

quickcrawl/
β”œβ”€β”€ cmd/
β”‚   β”œβ”€β”€ server/              # HTTP API server
β”‚   └── mcp/                 # MCP server for AI agents
β”œβ”€β”€ cli/                     # Standalone CLI binary
β”‚   β”œβ”€β”€ main.go              # Entry point
β”‚   └── cmd/                 # Cobra subcommands
β”œβ”€β”€ internal/
β”‚   β”œβ”€β”€ api/                 # HTTP handlers, routes, middleware
β”‚   β”œβ”€β”€ crawler/             # BFS crawler, robots.txt, sitemap
β”‚   β”œβ”€β”€ extractor/           # HTML cleaning, markdown conversion, link extraction
β”‚   β”œβ”€β”€ renderer/            # HTTP, browser fetching via CDP
β”‚   β”œβ”€β”€ search/              # DuckDuckGo integration
β”‚   β”œβ”€β”€ mcp/                 # MCP tool implementation
β”‚   └── types/               # Type definitions
β”œβ”€β”€ playground/              # Web UI playground
β”œβ”€β”€ npm/                     # NPM wrapper package
β”œβ”€β”€ python/                  # Python SDK
β”‚   β”œβ”€β”€ examples/            # Python usage examples
β”‚   └── README.md           # Python SDK documentation
β”œβ”€β”€ bench/                   # Benchmarks
β”œβ”€β”€ scripts/                 # Release scripts
└── workflows/               # CI/CD workflows

πŸ‘¨β€πŸ’» Tech Stack

Tech Use Case
Go 1.21+ Core backend, CLI, server
Gin HTTP framework
Cobra CLI framework
goquery HTML parsing and DOM manipulation
lightpanda Headless browser automation (CDP over WebSocket)
Chrome DevTools Browser automation via CDP WebSocket
MCP SDK Model Context Protocol server
slog Structured logging

πŸ§ͺ Playground UI

Quickcrawl includes a web-based playground for testing:

playground/
β”œβ”€β”€ app/
β”‚   └── playground/
β”‚       └── page.tsx        # Main playground UI
β”œβ”€β”€ components/
β”‚   └── response-viewer.tsx  # Response display components
└── lib/
    └── api-client.ts        # API client functions

Access at http://localhost:3000/playground when the server is running.


πŸš€ Deployment

Docker

Docker images are published to GitHub Container Registry:

# Server
docker build -f infra/Dockerfile.server -t quickcrawl .
docker run -p 3000:3000 quickcrawl

# Playground
docker build -f infra/Dockerfile.playground -t quickcrawl-playground .
docker run -p 3000:3000 quickcrawl-playground

Railway

Deploy on Railway

  1. Click the button above
  2. Connect your GitHub repository
  3. Configure environment variables
  4. Deploy

Roadmap

  1. Add Page Interaction
  2. Hooks & Auth
  3. Improve Search
  4. Better SPA Handling
  5. Auto mode improvements for JS rendering
  6. Add support for https://github.com/h4ckf0r0day/obscura headless
  7. Focus on scalablity
  8. Cache System
  9. Improve the SDK perfomance

MCP Testing on Inspector

CONFIG=/Users/skmabudalam/Desktop/quickcrawl/quickcrawl.toml
npx @modelcontextprotocol/inspector /Users/skmabudalam/Desktop/quickcrawl/bin/quickcrawl-mcp

About

Fast, lightweight web scraper, crawler & search API with MCP server for AI agents. 2.2x faster than Firecrawl in 1K-URL benchmarks . Drop In QuickCrawl CLI , QuickCrawl API , QuickCrawl MCP. Supports Markdown, HTML, PlainText, Links, JSON output. Built-in DuckDuckGo search. Free for self-hosting.

Topics

Resources

Stars

Watchers

Forks

Contributors