Playground: Try it online
Quickcrawl is a powerful Go-based web scraping service that brings intelligence to AI agents. Whether you're scraping a single page, crawling an entire website, or mapping out link structures β Quickcrawl handles the heavy lifting with a sophisticated multi-layered architecture combining HTTP fetching, browser automation, and LLM-powered structured extraction.
| Feature | Description |
|---|---|
| π Web Scraping | Convert any URL to Markdown, HTML, Plain Text, or Links |
| π Async Crawling | BFS website crawler with depth/page limits and rate limiting |
| πΊοΈ URL Mapping | Discover all URLs on a site instantly without scraping content |
| π§ JavaScript Rendering | Auto-detect SPAs and render via LightPanda or Chrome |
| π LLM Extraction | Send a JSON schema, get validated structured data back |
| π Web Search | DuckDuckGo-powered search for AI agent integration |
| π€ MCP Server | Built-in stdio transport for seamless AI agent integration |
| π¦ Multi-format Output | Markdown, HTML, RawHTML, PlainText, Links, JSON |
85.4% scrape success rate across 1,000 diverse URLs from the Firecrawl Scrape Content Dataset v1
Tested against Firecrawl v2.5 on the same dataset:
| Feature | Quickcrawl | Firecrawl |
|---|---|---|
| Coverage | 85.4% | 79.4% |
| Avg Scrape Latency | 1,841.9ms | 4,048ms |
| Self-hosting | Single binary | Multi-container (~4GB+) |
| Cost / 1K scrapes | $0 (self-hosted) | $9 |
| Metric | Value |
|---|---|
| Content Recall | 44.03% |
| Noise Rejection | 86.65% |
| Content Matches | 376 |
| Noise Leaks | 114 |
Build RAG pipelines with clean LLM-ready markdown, give AI agents real-time web access, monitor content changes, extract structured data, convert HTML to clean markdown, or archive web pages at scale.
Quickcrawl MCP server provides AI agents with web scraping capabilities:
| Tool | Description |
|---|---|
scrape |
Scrape a single URL |
crawl |
Start async website crawl |
check_crawl_status |
Check crawl job status |
cancel_crawl |
Cancel running crawl |
map |
Discover URLs on a site |
site_map |
Discover URLs without scraping content (sitemap-aware) |
search |
Search DuckDuckGo |
Add Quickcrawl to your OpenCode configuration:
{
"mcp": {
"quickcrawl": {
"type": "local",
"command": ["npx", "-y", "@mabudalam/quickcrawl-mcp"],
"enabled": true
}
}
}Renderer selection in MCP:
- Set
renderJs: trueto use the configured Chrome browser (chromedp) for the request - Set
renderJs: false(default) to use the plain HTTP fetcher - When
[renderer.chrome].ws_urlis unset inquickcrawl.toml, MCP auto-launches a local LightPanda and uses its CDP endpoint. The launched process is killed on MCP shutdown.
Example MCP tool arguments:
{
"url": "https://www.notion.so/",
"formats": ["markdown"],
"renderJs": true
}The Quickcrawl CLI provides standalone command-line access to all features. No server or Python needed.
# Quick install (recommended)
curl -fsSL https://raw.githubusercontent.com/MabudAlam/quickcrawl/main/install.sh | sh
# Or from source
go install github.com/MabudAlam/quickcrawl/cli
# Download from GitHub releases
curl -L https://github.com/MabudAlam/QuickCrawl/releases/latest/download/quickcrawl_darwin_arm64.tar.gz | tar -xz
./quickcrawl --help# Scrape a single URL
quickcrawl scrape https://example.com
quickcrawl scrape https://example.com --formats html,markdown
# Crawl a website
quickcrawl crawl https://example.com --max-pages 10 --max-depth 3
# Discover URLs (without scraping content)
quickcrawl map https://example.com --max-depth 2
# Search DuckDuckGo
quickcrawl search "golang web scraping"
quickcrawl search "python" --scrape --formats markdownSee cli/cmd/ for all subcommands:
scrape.goβ Scrape single URLscrawl.goβ Crawl websitesmap.goβ Discover URLssearch.goβ Web search
Python SDK for Quickcrawl β scrape, crawl, and map websites from Python code.
# From PyPI (coming soon)
pip install quickcrawl
# From GitHub
pip install git+https://github.com/MabudAlam/quickcrawl.git@python-sdk#subdirectory=python
# Or clone and install
git clone https://github.com/MabudAlam/quickcrawl
cd quickcrawl/python
pip install -e .from quickcrawl import QuickCrawlClient
# CLI mode (zero config, auto-downloads binary)
with QuickCrawlClient() as client:
result = client.scrape("https://example.com")
print(result["markdown"])
# HTTP mode (connect to deployed server)
client = QuickCrawlClient(api_url="https://your-server.com", api_key="...")
result = client.scrape("https://example.com")See python/examples/:
01_scrape.pyβ Scrape a single URL02_crawl.pyβ Crawl entire website03_map.pyβ Discover URLs without scraping04_formats.pyβ Multiple output formats05_cloud.pyβ Connect to deployed server06_search.pyβ Web search with scrapingperplexity.pyβ Perplexity-style AI research agent with Google ADK
graph TD
Client["π₯οΈ Client (HTTP / MCP)"]
Server["π Quickcrawl Server"]
Client --> Server
Server --> Router["π‘ Gin Router"]
Router --> Handlers["π§ API Handlers"]
Handlers --> Renderer["π¨ Renderer Layer"]
Renderer --> HTTPFetcher["π HTTP Fetcher"]
Renderer --> Browser["π Browser (CDP)"]
Browser --> LightPanda["πΌ LightPanda"]
Browser --> Chrome["π΅ Chrome DevTools"]
Browser --> Chrome["π Chrome"]
Renderer --> Extractor["π Extractor"]
Extractor --> Markdown["π Markdown"]
Extractor --> HTML["π HTML"]
Extractor --> PlainText["π Plain Text"]
Extractor --> Links["π Links"]
Handlers --> Crawler["π Crawler"]
Crawler --> Robots["π€ Robots.txt"]
Crawler --> Sitemap["πΊοΈ Sitemap"]
Crawler --> RateLimit["β‘ Rate Limiter"]
Handlers --> Search["π DuckDuckGo Search"]
Search --> LLM["π§ LLM Extraction"]
LLM --> JSONSchema["π JSON Schema Output"]
Request
β
*core.Scraper (single render path; chromedp + shared HTTPFetcher)
β
βββ renderJs=false β renderer.HTTPFetcher (plain HTTP GET, no JS)
β
βββ renderJs=true β chromedp RemoteAllocator β persistent Chrome
(JS rendered, anti-bot stealth, SPA readiness poll)
Both cli and mcp entry points auto-launch a local LightPanda when no
Chrome WS URL is configured (HTTP server does not β it requires a user-supplied
WS URL and falls back to HTTP-only when none is configured).
When a client calls POST /v1/scrape, Quickcrawl executes the following pipeline:
HTTP POST /v1/scrape
β
βΌ
handlers.Scrape() [internal/api/handlers/handler.go:64]
β β’ Parse + JSON-decode into core.ScrapeRequest
β β’ Validate URL (http/https required, non-empty)
β β’ Default formats to ["markdown"] if empty
β β’ Optional robots.txt check (config: crawler.respect_robots_txt)
β
βΌ
core.Scraper.Scrape() [internal/core/scraper.go:69]
β β’ resolveRenderJS() β request override, defaults to false
β β’ resolveWaitMs() β request override, defaults to 0
β β’ resolveFormats() β stringβtypes.OutputFormat conversion
β
βΌ
core.Renderer.FetchOrchestrator() [internal/core/renderer.go:213]
β
βββ (renderJs=false) ββ HTTP path βββββββββββββββββββ
β β
β renderer.HTTPFetcher.Fetch()
β [internal/renderer/http.go]
β β’ HTTP GET with stealth headers
β β’ Returns FetchResult{HTML, StatusCode}
β
βββ (renderJs=true) ββ CDP path ββββββββββββββββββββ
βΌ
core.Renderer.fetchWithCDPBrowser() [internal/core/renderer.go:334]
β β’ Acquire per-host concurrency slot
β β’ Create isolated browser context (chromedp.NewContext)
β β’ Apply page_timeout_ms to chromedp.Run
β β’ Action sequence:
β - enableNetworkTracking(networkBundle)
β - stealthInjectionAction() (when crawler.stealth.enabled)
β - navigateIgnoringHTTPStatus()
β - dismissCookieBannersFastAction() (when waitMs == 0)
β - WaitForSPAReady() (polls for content readiness,
β network-idle, or selector hit)
β - autoScrollAction() (when waitMs == 0 + lazy markers)
β - OuterHTML of <head> + <body>
β β’ Anti-bot challenge detection (status 4xx/5xx)
β β’ Returns FetchResult{HTML, FinalURL, StatusCode, ContentType}
β
βΌ
core.Extractor.Extract() [internal/core/extractor.go]
β β’ ExtractMetadata β title, description, OG tags, canonical, language
β β’ preprocessHTML β strip head, cleanNoise, applyNoisePatterns
β (and IncludeTags / ExcludeTags / CSSSelector if set)
β β’ postprocessHTML β sanitize, dedupe, normalize whitespace
β β’ HTMLToMarkdown β primary conversion (fullClean β structural β plaintext)
β β’ HTMLToPlaintext
β β’ ExtractLinks, ExtractImageURLs
β
βΌ Returns: core.ScrapeData{Markdown, HTML, PlainText, Links, ImageLinks, Metadata}
β
βΌ
(Optional) LLM Structured Extraction [internal/core/llm.go]
β β’ Triggered when formats contains "json" and [extraction.llm] is configured
β β’ buildLLMInput β callOpenAI(chat/completions) β validateDataAgainstSchema
β β’ Populates data.JSON
β
βΌ
handlers.Scrape()
β β’ If statusCode >= 400 and body < 200 chars β surface as failure
β β’ Else return success with data + warning
β
βΌ
c.JSON(http.StatusOK, APIResponse{ScrapeData})
HTTP POST /v1/crawl
β
βΌ
handlers.StartCrawl() [internal/api/handlers/handler.go:166]
β β’ Parse + JSON-decode into types.CrawlRequest
β β’ Validate URL, maxDepth (0-10), maxPages (1-1000)
β β’ Default maxDepth/maxPages from config
β β’ Reject formats=["json"] with 400 (use /v1/scrape for LLM extraction)
β β’ Generate job ID, store in AppState.CrawlJobs
β
βΌ
crawler.RunCrawl() [internal/crawler/crawl.go]
β β’ BFS from seed URL, respecting same-origin
β β’ robots.txt check per page (if enabled)
β β’ Per-host rate limiter (crawler.requests_per_second)
β β’ Per-host + global concurrency slots
β β’ For each page:
β - core.Scraper.Scrape() (same pipeline as above)
β - Stealth jitter added to inter-request sleep
β - Update CrawlState via stateCh
β
βΌ Returns: types.CrawlState{Total, Completed, Data[], Status}
β
βΌ
c.JSON(http.StatusOK, CrawlStartResponse{ID})
GET /v1/crawl/:id β handlers.GetCrawlStatus() β returns CrawlState
DELETE /v1/crawl/:id β handlers.CancelCrawl() β 204 No Content
| Method | Path | Description |
|---|---|---|
POST |
/v1/scrape |
Scrape a single URL with one or more output formats |
Scrape a single URL. This is the canonical endpoint for fetching and extracting content from one page β supports HTTP, browser (JS) rendering, content filters, and LLM-based structured extraction.
Request body (core.ScrapeRequest):
| Field | Type | Required | Description |
|---|---|---|---|
url |
string | yes | Absolute http:// or https:// URL to scrape |
formats |
string[] | no | Output formats. One or more of markdown, html, rawHtml, plainText, links, imageLinks, json. Defaults to ["markdown"] |
renderJs |
bool | no | When true, fetch the page through a headless Chrome (chromedp). When false (default), use plain HTTP via the shared *renderer.HTTPFetcher |
waitFor |
int | no | Milliseconds to wait after navigation for late content / XHRs. 0 = use the SPA-readiness poll (default) |
headers |
object | no | Custom HTTP headers sent on the fetch |
includeTags |
string[] | no | CSS selectors to keep (e.g. ["article", "h1"]) β applied during preprocessHTML |
excludeTags |
string[] | no | CSS selectors to drop (e.g. ["nav", "footer", ".ad"]) |
cssSelector |
string | no | Extract content matching this CSS selector only |
jsonSchema |
object | no | JSON Schema used by formats:["json"] to constrain LLM extraction |
extract |
object | no | LLM extraction overrides: { schema, prompt, responseFormat } |
llmExtractionPrompt |
string | no | Per-request LLM system prompt override |
llmResponseFormat |
string | no | Per-request LLM response_format name override |
browser |
string | no | Deprecated. Accepted for backward-compat; ignored. |
Minimal request:
{
"url": "https://example.com"
}Full request with filters and JS rendering:
{
"url": "https://example.com/article",
"formats": ["markdown", "html", "links"],
"renderJs": true,
"waitFor": 2000,
"headers": { "Cookie": "session=abc" },
"includeTags": ["article", "h1", "h2", "p"],
"excludeTags": ["nav", "footer", ".advertisement"]
}Response (200 OK on success):
{
"success": true,
"data": {
"markdown": "# Example Domain\n\nThis domain is for use in documentation examples...",
"html": "<h1>Example Domain</h1><p>This domain is for use in...</p>",
"plainText": "Example Domain This domain is for use in documentation examples...",
"links": ["https://www.iana.org/domains/example"],
"imageLinks": [],
"metadata": {
"title": "Example Domain",
"description": null,
"ogpTitle": null,
"ogpDescription": null,
"ogpImage": null,
"canonicalUrl": null,
"sourceURL": "https://example.com",
"language": "en",
"statusCode": 200,
"renderedMode": "http",
"timeTaken": 281
}
},
"warning": null
}renderedMode is "http" when fetched via the HTTP fetcher, or "browser"
when fetched via chromedp. See Metadata.renderedMode.
LLM-extraction response (when formats includes "json"):
{
"success": true,
"data": {
"markdown": "...",
"json": {
"title": "Example Domain",
"purpose": "documentation example"
},
"metadata": { "...": "..." }
}
}data.json is populated only when [extraction.llm] is configured in the
server TOML. See the LLM Extraction section below.
Error responses:
| Status | Code | Cause |
|---|---|---|
400 |
invalid_request |
Missing url, non-http(s) scheme, malformed JSON, or headers/includeTags/etc. of the wrong type |
400 |
forbidden |
crawler.respect_robots_txt=true and the page is disallowed |
500 |
internal_error |
Scraper not initialized |
200 with success:false |
http |
Target returned HTTP 4xx/5xx with a small body (surfaced as a soft failure rather than an HTTP error so callers can still inspect the metadata) |
200 with success:false |
renderer_error |
Browser path requested but no Chrome WS URL is configured (set [renderer.chrome].ws_url) |
| Method | Path | Description |
|---|---|---|
POST |
/v1/crawl |
Start an async BFS crawl of a website |
GET |
/v1/crawl/:id |
Check crawl status and retrieve results |
DELETE |
/v1/crawl/:id |
Cancel a running crawl job |
Start a BFS crawl from a seed URL. The job runs asynchronously β POST returns
a job ID immediately, and you poll GET /v1/crawl/:id for progress and results.
Request body (types.CrawlRequest):
| Field | Type | Required | Description |
|---|---|---|---|
url |
string | yes | Starting URL |
maxDepth |
int | no | Maximum link depth to follow. 0-10. Defaults to crawler.default_max_depth (TOML) |
maxPages |
int | no | Maximum pages to scrape. 1-1000. Defaults to crawler.default_max_pages (TOML) |
formats |
string[] | no | Output formats per page. Any subset of markdown, html, rawHtml, plainText, links, imageLinks. Note: "json" is rejected with 400 β use /v1/scrape for LLM extraction |
renderJs |
bool | no | Force JS rendering on every page (chromedp path) |
waitFor |
int | no | Milliseconds to wait after each navigation |
browser |
string | no | Deprecated. Accepted for backward-compat; ignored. |
Start crawl request:
{
"url": "https://example.com",
"maxDepth": 2,
"maxPages": 50,
"formats": ["markdown", "links"],
"renderJs": false
}Start response (200 OK):
{
"success": true,
"id": "crawl-1748899200000000000"
}Check status β GET /v1/crawl/:id. No body required.
Status response while running:
{
"id": "crawl-1748899200000000000",
"success": true,
"status": "scraping",
"total": 47,
"completed": 12,
"data": []
}Status response when complete:
{
"id": "crawl-1748899200000000000",
"success": true,
"status": "completed",
"total": 47,
"completed": 47,
"data": [
{
"markdown": "# Example Domain\n\n...",
"html": null,
"plainText": null,
"links": ["https://www.iana.org/domains/example"],
"imageLinks": [],
"metadata": {
"sourceURL": "https://example.com",
"statusCode": 200,
"renderedMode": "http",
"timeTaken": 281
}
}
]
}status is one of pending, scraping, completed, failed.
When the crawl fails, error contains a human-readable message and
success is false.
Cancel β DELETE /v1/crawl/:id. No body. Returns 204 No Content on
success, 404 Not Found if the job ID is unknown.
Error responses for POST /v1/crawl:
| Status | Code | Cause |
|---|---|---|
400 |
invalid_request |
Missing url, non-http(s) scheme, malformed JSON, or formats contains "json" |
500 |
internal_error |
Scraper not initialized |
| Method | Path | Description |
|---|---|---|
POST |
/v1/map |
Discover all URLs on a site instantly |
Request:
{
"url": "https://www.mabud.dev/",
"maxDepth": 2,
"useSitemap": true,
"timeout": 30000
}Response:
{
"success": true,
"data": {
"links": [
"https://www.mabud.dev/blog",
"https://www.mabud.dev/projects",
"https://www.mabud.dev/resume"
]
}
}| Method | Path | Description |
|---|---|---|
POST |
/v1/search |
Search DuckDuckGo and optionally scrape results in parallel |
By default /v1/search returns only search-result metadata (title, URL, snippet). Set "scrape": true to also fetch and extract content (markdown/html/etc.) from each result URL β 10 workers in parallel.
{
"query": "golang web scraping",
"scrape": true,
"formats": ["markdown"]
}| Method | Path | Description |
|---|---|---|
GET |
/health |
Health check with browser availability and active job count |
Quickcrawl supports JSON-Schema-based structured extraction for /v1/scrape.
Add "json" to the formats array and supply a jsonSchema (or extract.schema)
describing the shape you want. The page's markdown is sent to the LLM along
with the schema, and the response is returned in data.json.
Request:
{
"url": "https://news.example.com/article",
"formats": ["markdown", "json"],
"jsonSchema": {
"type": "object",
"properties": {
"title": { "type": "string" },
"author": { "type": "string" },
"published": { "type": "string" }
},
"required": ["title", "author"]
},
"extract": {
"prompt": "Extract article title, author, and publish date",
"responseFormat": "article"
}
}Response (200 OK):
{
"success": true,
"data": {
"markdown": "# Headline\n\nBy Jane Doe. Published 2024-01-15...",
"json": {
"title": "Headline",
"author": "Jane Doe",
"published": "2024-01-15"
},
"metadata": { "...": "..." }
}
}Field reference:
| Field | Type | Description |
|---|---|---|
jsonSchema |
object | Top-level shortcut for extract.schema β JSON Schema for the data you want extracted |
extract.schema |
object | Same as jsonSchema (nested form) |
extract.prompt |
string | Per-request system prompt override (otherwise [extraction.llm].extraction_prompt from the server TOML is used) |
extract.responseFormat |
string | OpenAI response_format.name for the structured output. Defaults to "extracted_data" |
llmExtractionPrompt |
string | Top-level shortcut for extract.prompt |
llmResponseFormat |
string | Top-level shortcut for extract.responseFormat |
Server-side configuration (quickcrawl.toml):
[extraction.llm]
api_key = "" # or set EXTRACTION__LLM__API_KEY in the env
model = "gpt-4o-mini"
base_url = "" # override for non-OpenAI endpoints
max_tokens = 8192
extraction_prompt = "You are a data extraction assistant..."
response_format = "extracted_data"The LLM is only invoked when formats includes "json". If formats:["json"]
is requested but [extraction.llm] is not configured, the scrape returns
{"success": false, "errorCode": "extraction_error", "error": "json extraction requested but no LLM configured. Set [extraction.llm] in server config."}.
Config file: quickcrawl.toml
[server]
host = "0.0.0.0"
port = 3000
rate_limit_rps = 10
[renderer]
page_timeout_ms = 30000
pool_size = 4
[renderer.chrome]
ws_url = ""
[crawler]
max_concurrency = 40
requests_per_second = 40.0
respect_robots_txt = true
default_max_depth = 2
default_max_pages = 100
[extraction.llm]
model = "gpt-4o-mini"
api_key = ""
base_url = ""
max_tokens = 8192
extraction_prompt = "You are a data extraction assistant..."
response_format = "extracted_data"Or via environment variables:
SERVER__PORT=3000
RENDERER__CHROME__WS_URL=ws://127.0.0.1:9222/devtools/browser/...
CRAWLER__MAX_CONCURRENCY=40
EXTRACTION__LLM__API_KEY=your-keyquickcrawl/
βββ cmd/
β βββ server/ # HTTP API server
β βββ mcp/ # MCP server for AI agents
βββ cli/ # Standalone CLI binary
β βββ main.go # Entry point
β βββ cmd/ # Cobra subcommands
βββ internal/
β βββ api/ # HTTP handlers, routes, middleware
β βββ crawler/ # BFS crawler, robots.txt, sitemap
β βββ extractor/ # HTML cleaning, markdown conversion, link extraction
β βββ renderer/ # HTTP, browser fetching via CDP
β βββ search/ # DuckDuckGo integration
β βββ mcp/ # MCP tool implementation
β βββ types/ # Type definitions
βββ playground/ # Web UI playground
βββ npm/ # NPM wrapper package
βββ python/ # Python SDK
β βββ examples/ # Python usage examples
β βββ README.md # Python SDK documentation
βββ bench/ # Benchmarks
βββ scripts/ # Release scripts
βββ workflows/ # CI/CD workflows
| Tech | Use Case |
|---|---|
| Go 1.21+ | Core backend, CLI, server |
| Gin | HTTP framework |
| Cobra | CLI framework |
| goquery | HTML parsing and DOM manipulation |
| lightpanda | Headless browser automation (CDP over WebSocket) |
| Chrome DevTools | Browser automation via CDP WebSocket |
| MCP SDK | Model Context Protocol server |
| slog | Structured logging |
Quickcrawl includes a web-based playground for testing:
playground/
βββ app/
β βββ playground/
β βββ page.tsx # Main playground UI
βββ components/
β βββ response-viewer.tsx # Response display components
βββ lib/
βββ api-client.ts # API client functions
Access at http://localhost:3000/playground when the server is running.
Docker images are published to GitHub Container Registry:
# Server
docker build -f infra/Dockerfile.server -t quickcrawl .
docker run -p 3000:3000 quickcrawl
# Playground
docker build -f infra/Dockerfile.playground -t quickcrawl-playground .
docker run -p 3000:3000 quickcrawl-playground- Click the button above
- Connect your GitHub repository
- Configure environment variables
- Deploy
- Add Page Interaction
- Hooks & Auth
- Improve Search
- Better SPA Handling
- Auto mode improvements for JS rendering
- Add support for https://github.com/h4ckf0r0day/obscura headless
- Focus on scalablity
- Cache System
- Improve the SDK perfomance
MCP Testing on Inspector
CONFIG=/Users/skmabudalam/Desktop/quickcrawl/quickcrawl.toml
npx @modelcontextprotocol/inspector /Users/skmabudalam/Desktop/quickcrawl/bin/quickcrawl-mcp