Local benchmark harness for comparing agent workflows across Docs/API, CLI, MCP, and optional official-skills context. The default provider is InsForge; the benchmark also includes a Supabase local-stack workflow comparison.
The current public database task set is documented in docs/db-task-set.md.
The benchmark loop is deliberately small:
- prepare task state
- run an agent adapter
- verify success
- run attack/security checks
- clean up
- write structured JSON results
For harness-only checks:
- Node.js 20 or newer
- npm
For real Claude Code runs:
- Claude Code installed and authenticated
ANTHROPIC_API_KEYset in.envor the shell environment
For real InsForge pilot tasks:
- Docker running. The default provider sandbox starts an isolated InsForge Compose stack from locked public Docker images, not from a local InsForge source checkout. Docs/API context is extracted from the same locked InsForge runtime image.
For real Supabase pilot tasks:
- Docker running
- Supabase CLI available through
npx. The default provider sandbox creates and starts a temporary Supabase local project for the run.
npm install
npm run typecheck
npm run smokeThe smoke command uses the mock adapter so the harness can be validated without Claude Code or an API key.
The benchmark CLI automatically loads .env from the repo root. Values already
present in the shell environment take precedence over .env.
The local .env should stay narrow: user secrets plus reproducibility pins only.
Provider URLs, provider API keys, Supabase project directories, Docs/API context,
MCP configs, workflow tool lists, ports, and sandbox credentials are generated
by the harness for each run.
Each run writes a summary.json with result totals plus reproducibility metadata such as benchmark package version, Node runtime, git commit/dirty state, agent command version, model, permission mode, and allowed tools.
Runs also record a workflow profile: direct for the baseline prompt, or docs-api, cli, cli-skills, mcp, and mcp-skills for workflow comparisons.
Use npm run benchmark:k -- --k <count> --workflow <workflow> for reliability checks. K-run attempts are independent child benchmark runs, each with its own run id and provider sandbox. The generated k-run report summarizes pass@k, pass^k, backend stability, workflow-compliance stability, time, and token metrics.
The CLI exits non-zero when any selected task fails, so shell scripts can stop on failed benchmark runs.
Tasks can declare workflow-specific unsupported surfaces in meta.json; skipped tasks are written to results as SKIP:unsupported_workflow and are not counted in the executed pass-rate denominator.
CLI and MCP runs also compute workflow evidence: npx @insforge/cli or npx supabase command logs for CLI, and MCP tools/call logs for MCP. Evidence is diagnostic by default; pass --require-workflow-evidence to make missing workflow evidence fail the run.
Use --hard-isolation when comparing workflow surfaces strictly. Hard isolation requires workflow evidence and narrows Claude Code allowed/disallowed tools per workflow: Docs/API may make direct backend requests but must derive product details from the workspace docs-context snapshot; CLI must use the provider CLI from a harness-prelinked local project, runs CLI subprocesses with workspace-local HOME/config/cache state, and cannot make direct backend requests, ambient local Skill-tool calls, MCP tools, file read/edit tools, or link/discover cloud projects; cli-skills additionally fetches the pinned official skills snapshot into the workspace and enables file reads for that workspace context. MCP gets explicit provider MCP tools without Bash, direct file read/edit tools, or raw-SQL request tunneling; mcp-skills additionally gets the official skills snapshot as read-only context.
Provider runs always use an isolated Docker sandbox. The harness starts a
provider stack per run from the versions selected by
BENCH_INSFORGE_IMAGE_VERSION and BENCH_SUPABASE_IMAGE_VERSION, resolves them
to exact digests through configs/provider-images.lock.json, injects generated
provider env values, and tears the stack down after the run.
Create local benchmark config:
cp configs/local.example.env .envAt minimum, edit .env for real pilot runs:
ANTHROPIC_API_KEY=<your Anthropic API key>The rest of configs/local.example.env is only reproducibility pins:
ANTHROPIC_MODEL=claude-sonnet-4-6
CLAUDE_CODE_PERMISSION_MODE=acceptEdits
BENCH_INSFORGE_IMAGE_VERSION=v2.1.10
BENCH_SUPABASE_IMAGE_VERSION=local-stack-2.101.0
BENCH_INSFORGE_CLI_PACKAGE=@insforge/cli@0.1.82
BENCH_SUPABASE_CLI_PACKAGE=supabase@2.101.0
BENCH_MCP_PACKAGE=@insforge/mcp@1.2.11
BENCH_SKILLS_LOCK_PATH=configs/insforge-skills.lock.json
BENCH_SUPABASE_SKILLS_LOCK_PATH=configs/supabase-skills.lock.jsonProvider image pins must stay in lockstep with configs/provider-images.lock.json.
The Docs/API workflow does not read a local InsForge checkout. It extracts
/app/docs from the locked ghcr.io/insforge/insforge-oss image and copies
that snapshot into each task workspace.
With Docker running, .env populated, and Claude Code ready:
npm run benchmark:cli
npm run benchmark:cli-skills
npm run benchmark:mcp
npm run benchmark:mcp-skills
npm run benchmark:allDocs/API is currently opt-in because the locked InsForge runtime image does not yet include the product OpenAPI YAML. Run it explicitly when you want to test that surface:
npm run benchmark:apiBy default these commands start their own Docker Compose project for each
workflow run. The InsForge sandbox uses BENCH_INSFORGE_IMAGE_VERSION plus
configs/provider-images.lock.json, assigns free host ports, injects generated
provider connection values, and tears down the Compose project with volumes
after the run. Use --keep-provider-sandbox when you want to debug the
generated stack after a run.
These benchmark shortcuts run the canonical task set with claude-code under hard workflow isolation. benchmark:api maps to docs-api, benchmark:cli maps to bare CLI, benchmark:cli-skills maps to CLI+official skills, benchmark:mcp maps to bare MCP, and benchmark:mcp-skills maps to MCP+official skills. benchmark:all runs the default non-Docs workflows only: InsForge runs cli, cli-skills, and mcp; Supabase runs cli, cli-skills, mcp, and mcp-skills. benchmark:matrix also excludes Docs/API by default. Both continue after a workflow failure so comparison data is not cut short, then exit non-zero if any workflow failed. Each benchmark shortcut writes a Notion-friendly Markdown report under reports/ after the run finishes.
Pass filters and run options after --:
npm run benchmark:api -- --module db --task owner_notes
npm run benchmark:all -- --module db
npm run benchmark:all -- --capability access-control
npm run benchmark:all -- --dry-run
npm run benchmark:all -- --no-reportDB tasks ask Claude Code to configure public-schema database capabilities, then verifiers perform legitimate user operations, invariant checks, query checks, and attack checks where relevant. Database tasks are grouped into access-control, integrity, query, and vector; query tasks may also record query performance metrics. Storage tasks ask Claude Code to configure buckets and object policies; verifiers handle object uploads/downloads and access attacks.
The Supabase phase uses the Supabase CLI local development stack in a temporary project created by the benchmark harness for each run. It does not clone the Supabase monorepo, does not connect to hosted Supabase projects, and does not target an already-running local Supabase container.
With Docker running and Claude Code ready:
npm run benchmark:supabase:cli
npm run benchmark:supabase:cli-skills
npm run benchmark:supabase:mcp
npm run benchmark:supabase:mcp-skills
npm run benchmark:supabase:all
npm run benchmark:matrixThe Supabase sandbox creates results/<run-id>/.provider/supabase/project,
assigns free local ports in supabase/config.toml, starts the pinned Supabase
CLI stack, parses npx supabase status -o env, and injects the generated
connection values, project dir, and local MCP URL. It stops the stack with
npx supabase stop --no-backup unless --keep-provider-sandbox is set. The
harness verifies the started Supabase containers against the Supabase image
digests in configs/provider-images.lock.json. None of those generated
Supabase values belong in .env.
Supabase shortcuts request the central task catalog with claude-code under
hard workflow isolation. DB runs use the db suite by default when
--module db is supplied; other modules keep the pilot suite default. The
harness resolves tasks from the canonical module-first catalog under tasks/db,
tasks/storage, tasks/auth, and tasks/e2e; provider-specific verifier
details live inside those task directories instead of a parallel provider suite.
Supabase Auth config tasks are active for cli-skills, where the agent can
combine the pinned Supabase CLI with the official skills snapshot and edit the
task-local supabase/config.toml. Bare cli remains unsupported for Auth
config tasks because hard isolation exposes only npx supabase ..., and the
Supabase CLI does not provide a local auth config mutation subcommand.
For Supabase CLI runs, the harness copies the sandbox project's supabase/
directory into each isolated task workspace before the agent runs. This is the
Supabase equivalent of the InsForge local prelink: the agent should not run
hosted login, link, project discovery, or cloud commands.
Supabase CLI+Skills runs use npx supabase ... plus the official
supabase/agent-skills GitHub snapshot pinned by
configs/supabase-skills.lock.json. The harness pins npx supabase to
BENCH_SUPABASE_CLI_PACKAGE and requires it to match the provider image lock
for reproducible CLI runs. It also requires BENCH_SUPABASE_IMAGE_VERSION to
match the locked local-stack image set. Supabase MCP runs auto-configure the
sandbox MCP endpoint from generated sandbox values. Supabase MCP+Skills uses the
same MCP surface plus the official skills snapshot as read-only context.
Auth signup/session is intentionally not a standalone pilot task: user creation and login checks belong in verifiers, not in the workflow task surface. Supabase Storage parity is included in the central task catalog. Supabase MCP Auth config tasks are explicit unsupported skips: the official Supabase MCP tool surface currently lists database, debugging, development, edge functions, account, docs, and storage configuration tools, but no Auth configuration tools.
For a first real Supabase workflow comparison, run a single known-supported DB task through both strict Supabase workflow surfaces:
npm run benchmark:supabase:all -- --module db --task owner_notesThen inspect the generated Markdown report under reports/, or print the quick
terminal summary:
npm run report -- --limit 10Some task/workflow pairs are expected to skip or fail because the current
workflow surface does not expose that backend capability. Those cases are
recorded in each task's meta.json; see docs/db-task-set.md for the
current public database task set.
Install and authenticate Claude Code following Anthropic's CLI docs, then configure local secrets:
cp configs/local.example.env .env
# edit .env and set ANTHROPIC_API_KEY
npm run benchmark:api -- --module db --task owner_notesThe Claude adapter invokes:
claude -p "<task prompt>" --model "$ANTHROPIC_MODEL" --output-format json --permission-mode "$CLAUDE_CODE_PERMISSION_MODE"Default model is claude-sonnet-4-6.
To run a workflow profile directly:
npm run benchmark:api -- --module db --task owner_notesTo run a capability subset in one summary:
npm run benchmark:mcp -- --capability auth,access-control,integrity,storageThe cli and cli-skills workflows preconfigure each task workspace with a local .insforge/project.json that points at the generated sandbox backend URL; this is harness setup, not part of the agent workflow, and it is not read from .env. They record npx @insforge/cli invocations in workflow-commands.jsonl when they happen and capture the pinned InsForge CLI package/version in run metadata. Agents still invoke npx @insforge/cli ..., but the harness wrapper executes BENCH_INSFORGE_CLI_PACKAGE (default @insforge/cli@0.1.82) for reproducibility. CLI+Skills runs fetch the official InsForge/insforge-skills GitHub snapshot pinned by configs/insforge-skills.lock.json into each task workspace, so local installed skill edits do not silently change benchmark results. Command logs include sanitized args, exit code, and sanitized stdout/stderr snippets; API keys, bearer/JWT-like tokens, access tokens, passwords, and secrets add values are redacted.
Update the CLI package pin or GitHub skills commit lock only when intentionally moving the official benchmark version; do not patch local skills to turn a failing benchmark task into a passing one.
The docs-api workflow copies a workspace-local docs-context/ snapshot extracted from the locked InsForge runtime image. The task prompt intentionally does not embed endpoint details; Docs/API agents must discover them from that snapshot.
Workflow prompts explicitly tell CLI agents to use the provider CLI and MCP agents to use provider MCP tools. The cli-skills and mcp-skills workflows add an official GitHub skills snapshot as workspace context; bare cli and mcp workflows do not. MCP workflows auto-generate provider-local MCP config from sandbox values, run strict MCP config mode by default, and record a sanitized MCP server summary in run metadata. Inline MCP configs are wrapped in a workspace-local proxy that writes sanitized tool-call metadata to mcp-tool-calls.jsonl; it records timestamps, JSON-RPC ids, and tool names only, never request arguments, responses, headers, or env vars. The default MCP allowed-tool list enumerates the pinned server's current tools explicitly for Claude Code.
To summarize recent results:
npm run report -- --limit 30To regenerate a Markdown report for existing run ids:
npm run report:generate -- --run-id fresh-core-api,fresh-core-cli,fresh-core-mcp
npm run report:generate -- --current --output reports/current-benchmark-report.mdconfigs/run-registry.json marks local run ids as current, historical,
polluted, or superseded. Both npm run report and
npm run report:generate read it automatically when present; use --current or
--status current,polluted to filter report inputs.
To run an independent k-run reliability check:
npm run benchmark:k -- --k 3 --workflow mcp --module db --task owner_notes
npm run benchmark:k -- --k 3 --provider supabase --workflow cli-skills --module dbTo require actual MCP/CLI usage during a workflow experiment:
npm run benchmark:mcp -- --module db --task owner_notes --require-workflow-evidenceThe benchmark shortcuts already force strict workflow isolation. For low-level experiments, the underlying CLI remains available:
npm run run -- --module db --agent claude-code --workflow mcp --task owner_notes --hard-isolationIn hard isolation, CLI workflows must use CLI workflow evidence from the harness-prelinked local project, perform at least one successful backend-relevant CLI operation, and must not use direct backend requests, inherited CLI login state, ambient local Skill-tool calls, cloud auth behavior, cloud account discovery, MCP tools, file read/edit tools, or any CLI link command. The cli-skills variant must also use only the pinned official workspace skills snapshot for skills guidance. MCP workflows must not use Bash, direct backend requests, CLI, file read/edit tools, or backend request tunneling through raw SQL. The mcp-skills variant must also use only the pinned official workspace skills snapshot for skills guidance. Tasks with known workflow-surface gaps should declare unsupportedWorkflows in metadata so the runner records a skip instead of asking the agent to rediscover an impossible operation.
src/cli.ts- command entrypointsrc/benchmark.ts- workflow benchmark shortcutssrc/benchmark-k.ts- independent k-run reliability orchestrationsrc/runner/benchmark-runner.ts- task lifecycle orchestrationsrc/adapters/- agent workflow adapterssrc/tasks/- task discovery and script loadingsrc/reporting/- result persistencetasks/db/<category>/,tasks/storage/,tasks/auth/,tasks/e2e/- module-first canonical task catalogresults/- generated run artifactsreports/- generated Markdown reports
Add the logical task first under the central catalog:
tasks/<module>/<task_id>/
For database tasks, place public tasks under one of the category directories:
tasks/db/access-control/<task_id>/
tasks/db/integrity/<task_id>/
tasks/db/query/<task_id>/
tasks/db/vector/<task_id>/
Each database task owns its prompt and lifecycle scripts:
tasks/db/<category>/<task_id>/prompt.md
tasks/db/<category>/<task_id>/prepare.ts
tasks/db/<category>/<task_id>/verify.ts
tasks/db/<category>/<task_id>/attack.ts
tasks/db/<category>/<task_id>/cleanup.ts
Use capability in meta.json as the lightweight category filter:
access-control, integrity, query, or vector. Query verifiers may return
query performance details in their normal result details; other categories do
not need to emit empty performance fields.
When verifier or cleanup logic must differ by provider, keep the provider adapter inside the same task directory:
tasks/<module>/<task_id>/run.ts
tasks/<module>/<task_id>/<provider>/...
Discovery only reads the module-first catalog. Reports use the same suite/task id across providers, so provider and workflow comparisons line up against one shared task list.
Each task should include:
meta.jsonprompt.md- lifecycle script exports for
prepare,verify, optionalattack, and optionalcleanupeither as individual files or a sharedrun.ts
Task meta.json uses a three-level difficulty scale:
simple: one resource or one configuration surface, with straightforward ownership or config checks.medium: joins, role checks, uniqueness, soft-delete, JSON/temporal predicates, or one focused lifecycle invariant.hard: multi-table state machines, ACL matrices, quotas, trigger-maintained consistency, field-level write masks, or combined backend surfaces.
Security-sensitive tasks should make the attack export a hard gate: a task only passes when both verify and attack pass.
If a workflow cannot support the task by design, add unsupportedWorkflows to
meta.json with a short reason. If support differs by provider, use
unsupportedWorkflowsByProvider. Entries are workflow-exact, so mark both mcp
and mcp-skills when a task is unsupported for both MCP variants.