InsForge Agent Benchmark

Local benchmark harness for comparing agent workflows across Docs/API, CLI, MCP, and optional official-skills context. The default provider is InsForge; the benchmark also includes a Supabase local-stack workflow comparison.

The current public database task set is documented in docs/db-task-set.md.

The benchmark loop is deliberately small:

prepare task state
run an agent adapter
verify success
run attack/security checks
clean up
write structured JSON results

Prerequisites

For harness-only checks:

Node.js 20 or newer
npm

For real Claude Code runs:

Claude Code installed and authenticated
ANTHROPIC_API_KEY set in .env or the shell environment

For real InsForge pilot tasks:

Docker running. The default provider sandbox starts an isolated InsForge Compose stack from locked public Docker images, not from a local InsForge source checkout. Docs/API context is extracted from the same locked InsForge runtime image.

For real Supabase pilot tasks:

Docker running
Supabase CLI available through npx. The default provider sandbox creates and starts a temporary Supabase local project for the run.

Quickstart

npm install
npm run typecheck
npm run smoke

The smoke command uses the mock adapter so the harness can be validated without Claude Code or an API key. The benchmark CLI automatically loads .env from the repo root. Values already present in the shell environment take precedence over .env. The local .env should stay narrow: user secrets plus reproducibility pins only. Provider URLs, provider API keys, Supabase project directories, Docs/API context, MCP configs, workflow tool lists, ports, and sandbox credentials are generated by the harness for each run. Each run writes a summary.json with result totals plus reproducibility metadata such as benchmark package version, Node runtime, git commit/dirty state, agent command version, model, permission mode, and allowed tools. Runs also record a workflow profile: direct for the baseline prompt, or docs-api, cli, cli-skills, mcp, and mcp-skills for workflow comparisons. Use npm run benchmark:k -- --k <count> --workflow <workflow> for reliability checks. K-run attempts are independent child benchmark runs, each with its own run id and provider sandbox. The generated k-run report summarizes pass@k, pass^k, backend stability, workflow-compliance stability, time, and token metrics. The CLI exits non-zero when any selected task fails, so shell scripts can stop on failed benchmark runs. Tasks can declare workflow-specific unsupported surfaces in meta.json; skipped tasks are written to results as SKIP:unsupported_workflow and are not counted in the executed pass-rate denominator. CLI and MCP runs also compute workflow evidence: npx @insforge/cli or npx supabase command logs for CLI, and MCP tools/call logs for MCP. Evidence is diagnostic by default; pass --require-workflow-evidence to make missing workflow evidence fail the run. Use --hard-isolation when comparing workflow surfaces strictly. Hard isolation requires workflow evidence and narrows Claude Code allowed/disallowed tools per workflow: Docs/API may make direct backend requests but must derive product details from the workspace docs-context snapshot; CLI must use the provider CLI from a harness-prelinked local project, runs CLI subprocesses with workspace-local HOME/config/cache state, and cannot make direct backend requests, ambient local Skill-tool calls, MCP tools, file read/edit tools, or link/discover cloud projects; cli-skills additionally fetches the pinned official skills snapshot into the workspace and enables file reads for that workspace context. MCP gets explicit provider MCP tools without Bash, direct file read/edit tools, or raw-SQL request tunneling; mcp-skills additionally gets the official skills snapshot as read-only context. Provider runs always use an isolated Docker sandbox. The harness starts a provider stack per run from the versions selected by BENCH_INSFORGE_IMAGE_VERSION and BENCH_SUPABASE_IMAGE_VERSION, resolves them to exact digests through configs/provider-images.lock.json, injects generated provider env values, and tears the stack down after the run.

Pilot InsForge Tasks

Create local benchmark config:

cp configs/local.example.env .env

At minimum, edit .env for real pilot runs:

ANTHROPIC_API_KEY=<your Anthropic API key>

The rest of configs/local.example.env is only reproducibility pins:

ANTHROPIC_MODEL=claude-sonnet-4-6
CLAUDE_CODE_PERMISSION_MODE=acceptEdits
BENCH_INSFORGE_IMAGE_VERSION=v2.1.10
BENCH_SUPABASE_IMAGE_VERSION=local-stack-2.101.0
BENCH_INSFORGE_CLI_PACKAGE=@insforge/cli@0.1.82
BENCH_SUPABASE_CLI_PACKAGE=supabase@2.101.0
BENCH_MCP_PACKAGE=@insforge/mcp@1.2.11
BENCH_SKILLS_LOCK_PATH=configs/insforge-skills.lock.json
BENCH_SUPABASE_SKILLS_LOCK_PATH=configs/supabase-skills.lock.json

Provider image pins must stay in lockstep with configs/provider-images.lock.json. The Docs/API workflow does not read a local InsForge checkout. It extracts /app/docs from the locked ghcr.io/insforge/insforge-oss image and copies that snapshot into each task workspace.

With Docker running, .env populated, and Claude Code ready:

npm run benchmark:cli
npm run benchmark:cli-skills
npm run benchmark:mcp
npm run benchmark:mcp-skills
npm run benchmark:all

Docs/API is currently opt-in because the locked InsForge runtime image does not yet include the product OpenAPI YAML. Run it explicitly when you want to test that surface:

npm run benchmark:api

By default these commands start their own Docker Compose project for each workflow run. The InsForge sandbox uses BENCH_INSFORGE_IMAGE_VERSION plus configs/provider-images.lock.json, assigns free host ports, injects generated provider connection values, and tears down the Compose project with volumes after the run. Use --keep-provider-sandbox when you want to debug the generated stack after a run.

These benchmark shortcuts run the canonical task set with claude-code under hard workflow isolation. benchmark:api maps to docs-api, benchmark:cli maps to bare CLI, benchmark:cli-skills maps to CLI+official skills, benchmark:mcp maps to bare MCP, and benchmark:mcp-skills maps to MCP+official skills. benchmark:all runs the default non-Docs workflows only: InsForge runs cli, cli-skills, and mcp; Supabase runs cli, cli-skills, mcp, and mcp-skills. benchmark:matrix also excludes Docs/API by default. Both continue after a workflow failure so comparison data is not cut short, then exit non-zero if any workflow failed. Each benchmark shortcut writes a Notion-friendly Markdown report under reports/ after the run finishes.

Pass filters and run options after --:

npm run benchmark:api -- --module db --task owner_notes
npm run benchmark:all -- --module db
npm run benchmark:all -- --capability access-control
npm run benchmark:all -- --dry-run
npm run benchmark:all -- --no-report

DB tasks ask Claude Code to configure public-schema database capabilities, then verifiers perform legitimate user operations, invariant checks, query checks, and attack checks where relevant. Database tasks are grouped into access-control, integrity, query, and vector; query tasks may also record query performance metrics. Storage tasks ask Claude Code to configure buckets and object policies; verifiers handle object uploads/downloads and access attacks.

Pilot Supabase Local Tasks

The Supabase phase uses the Supabase CLI local development stack in a temporary project created by the benchmark harness for each run. It does not clone the Supabase monorepo, does not connect to hosted Supabase projects, and does not target an already-running local Supabase container.

With Docker running and Claude Code ready:

npm run benchmark:supabase:cli
npm run benchmark:supabase:cli-skills
npm run benchmark:supabase:mcp
npm run benchmark:supabase:mcp-skills
npm run benchmark:supabase:all
npm run benchmark:matrix

The Supabase sandbox creates results/<run-id>/.provider/supabase/project, assigns free local ports in supabase/config.toml, starts the pinned Supabase CLI stack, parses npx supabase status -o env, and injects the generated connection values, project dir, and local MCP URL. It stops the stack with npx supabase stop --no-backup unless --keep-provider-sandbox is set. The harness verifies the started Supabase containers against the Supabase image digests in configs/provider-images.lock.json. None of those generated Supabase values belong in .env.

Supabase shortcuts request the central task catalog with claude-code under hard workflow isolation. DB runs use the db suite by default when --module db is supplied; other modules keep the pilot suite default. The harness resolves tasks from the canonical module-first catalog under tasks/db, tasks/storage, tasks/auth, and tasks/e2e; provider-specific verifier details live inside those task directories instead of a parallel provider suite. Supabase Auth config tasks are active for cli-skills, where the agent can combine the pinned Supabase CLI with the official skills snapshot and edit the task-local supabase/config.toml. Bare cli remains unsupported for Auth config tasks because hard isolation exposes only npx supabase ..., and the Supabase CLI does not provide a local auth config mutation subcommand.

For Supabase CLI runs, the harness copies the sandbox project's supabase/ directory into each isolated task workspace before the agent runs. This is the Supabase equivalent of the InsForge local prelink: the agent should not run hosted login, link, project discovery, or cloud commands.

Supabase CLI+Skills runs use npx supabase ... plus the official supabase/agent-skills GitHub snapshot pinned by configs/supabase-skills.lock.json. The harness pins npx supabase to BENCH_SUPABASE_CLI_PACKAGE and requires it to match the provider image lock for reproducible CLI runs. It also requires BENCH_SUPABASE_IMAGE_VERSION to match the locked local-stack image set. Supabase MCP runs auto-configure the sandbox MCP endpoint from generated sandbox values. Supabase MCP+Skills uses the same MCP surface plus the official skills snapshot as read-only context.

Auth signup/session is intentionally not a standalone pilot task: user creation and login checks belong in verifiers, not in the workflow task surface. Supabase Storage parity is included in the central task catalog. Supabase MCP Auth config tasks are explicit unsupported skips: the official Supabase MCP tool surface currently lists database, debugging, development, edge functions, account, docs, and storage configuration tools, but no Auth configuration tools.

For a first real Supabase workflow comparison, run a single known-supported DB task through both strict Supabase workflow surfaces:

npm run benchmark:supabase:all -- --module db --task owner_notes

Then inspect the generated Markdown report under reports/, or print the quick terminal summary:

npm run report -- --limit 10

Some task/workflow pairs are expected to skip or fail because the current workflow surface does not expose that backend capability. Those cases are recorded in each task's meta.json; see docs/db-task-set.md for the current public database task set.

Claude Code Run

Install and authenticate Claude Code following Anthropic's CLI docs, then configure local secrets:

cp configs/local.example.env .env
# edit .env and set ANTHROPIC_API_KEY
npm run benchmark:api -- --module db --task owner_notes

The Claude adapter invokes:

claude -p "<task prompt>" --model "$ANTHROPIC_MODEL" --output-format json --permission-mode "$CLAUDE_CODE_PERMISSION_MODE"

Default model is claude-sonnet-4-6.

To run a workflow profile directly:

npm run benchmark:api -- --module db --task owner_notes

To run a capability subset in one summary:

npm run benchmark:mcp -- --capability auth,access-control,integrity,storage

The cli and cli-skills workflows preconfigure each task workspace with a local .insforge/project.json that points at the generated sandbox backend URL; this is harness setup, not part of the agent workflow, and it is not read from .env. They record npx @insforge/cli invocations in workflow-commands.jsonl when they happen and capture the pinned InsForge CLI package/version in run metadata. Agents still invoke npx @insforge/cli ..., but the harness wrapper executes BENCH_INSFORGE_CLI_PACKAGE (default @insforge/cli@0.1.82) for reproducibility. CLI+Skills runs fetch the official InsForge/insforge-skills GitHub snapshot pinned by configs/insforge-skills.lock.json into each task workspace, so local installed skill edits do not silently change benchmark results. Command logs include sanitized args, exit code, and sanitized stdout/stderr snippets; API keys, bearer/JWT-like tokens, access tokens, passwords, and secrets add values are redacted. Update the CLI package pin or GitHub skills commit lock only when intentionally moving the official benchmark version; do not patch local skills to turn a failing benchmark task into a passing one. The docs-api workflow copies a workspace-local docs-context/ snapshot extracted from the locked InsForge runtime image. The task prompt intentionally does not embed endpoint details; Docs/API agents must discover them from that snapshot. Workflow prompts explicitly tell CLI agents to use the provider CLI and MCP agents to use provider MCP tools. The cli-skills and mcp-skills workflows add an official GitHub skills snapshot as workspace context; bare cli and mcp workflows do not. MCP workflows auto-generate provider-local MCP config from sandbox values, run strict MCP config mode by default, and record a sanitized MCP server summary in run metadata. Inline MCP configs are wrapped in a workspace-local proxy that writes sanitized tool-call metadata to mcp-tool-calls.jsonl; it records timestamps, JSON-RPC ids, and tool names only, never request arguments, responses, headers, or env vars. The default MCP allowed-tool list enumerates the pinned server's current tools explicitly for Claude Code.

To summarize recent results:

npm run report -- --limit 30

To regenerate a Markdown report for existing run ids:

npm run report:generate -- --run-id fresh-core-api,fresh-core-cli,fresh-core-mcp
npm run report:generate -- --current --output reports/current-benchmark-report.md

configs/run-registry.json marks local run ids as current, historical, polluted, or superseded. Both npm run report and npm run report:generate read it automatically when present; use --current or --status current,polluted to filter report inputs.

To run an independent k-run reliability check:

npm run benchmark:k -- --k 3 --workflow mcp --module db --task owner_notes
npm run benchmark:k -- --k 3 --provider supabase --workflow cli-skills --module db

To require actual MCP/CLI usage during a workflow experiment:

npm run benchmark:mcp -- --module db --task owner_notes --require-workflow-evidence

The benchmark shortcuts already force strict workflow isolation. For low-level experiments, the underlying CLI remains available:

npm run run -- --module db --agent claude-code --workflow mcp --task owner_notes --hard-isolation

In hard isolation, CLI workflows must use CLI workflow evidence from the harness-prelinked local project, perform at least one successful backend-relevant CLI operation, and must not use direct backend requests, inherited CLI login state, ambient local Skill-tool calls, cloud auth behavior, cloud account discovery, MCP tools, file read/edit tools, or any CLI link command. The cli-skills variant must also use only the pinned official workspace skills snapshot for skills guidance. MCP workflows must not use Bash, direct backend requests, CLI, file read/edit tools, or backend request tunneling through raw SQL. The mcp-skills variant must also use only the pinned official workspace skills snapshot for skills guidance. Tasks with known workflow-surface gaps should declare unsupportedWorkflows in metadata so the runner records a skip instead of asking the agent to rediscover an impossible operation.

Layout

src/cli.ts - command entrypoint
src/benchmark.ts - workflow benchmark shortcuts
src/benchmark-k.ts - independent k-run reliability orchestration
src/runner/benchmark-runner.ts - task lifecycle orchestration
src/adapters/ - agent workflow adapters
src/tasks/ - task discovery and script loading
src/reporting/ - result persistence
tasks/db/<category>/, tasks/storage/, tasks/auth/, tasks/e2e/ - module-first canonical task catalog
results/ - generated run artifacts
reports/ - generated Markdown reports

Next Provider Tasks

Add the logical task first under the central catalog:

tasks/<module>/<task_id>/

For database tasks, place public tasks under one of the category directories:

tasks/db/access-control/<task_id>/
tasks/db/integrity/<task_id>/
tasks/db/query/<task_id>/
tasks/db/vector/<task_id>/

Each database task owns its prompt and lifecycle scripts:

tasks/db/<category>/<task_id>/prompt.md
tasks/db/<category>/<task_id>/prepare.ts
tasks/db/<category>/<task_id>/verify.ts
tasks/db/<category>/<task_id>/attack.ts
tasks/db/<category>/<task_id>/cleanup.ts

Use capability in meta.json as the lightweight category filter: access-control, integrity, query, or vector. Query verifiers may return query performance details in their normal result details; other categories do not need to emit empty performance fields.

When verifier or cleanup logic must differ by provider, keep the provider adapter inside the same task directory:

tasks/<module>/<task_id>/run.ts
tasks/<module>/<task_id>/<provider>/...

Discovery only reads the module-first catalog. Reports use the same suite/task id across providers, so provider and workflow comparisons line up against one shared task list.

Each task should include:

meta.json
prompt.md
lifecycle script exports for prepare, verify, optional attack, and optional cleanup either as individual files or a shared run.ts

Task meta.json uses a three-level difficulty scale:

simple: one resource or one configuration surface, with straightforward ownership or config checks.
medium: joins, role checks, uniqueness, soft-delete, JSON/temporal predicates, or one focused lifecycle invariant.
hard: multi-table state machines, ACL matrices, quotas, trigger-maintained consistency, field-level write masks, or combined backend surfaces.

Security-sensitive tasks should make the attack export a hard gate: a task only passes when both verify and attack pass. If a workflow cannot support the task by design, add unsupportedWorkflows to meta.json with a short reason. If support differs by provider, use unsupportedWorkflowsByProvider. Entries are workflow-exact, so mark both mcp and mcp-skills when a task is unsupported for both MCP variants.

Name		Name	Last commit message	Last commit date
Latest commit History 58 Commits
.agents/skills/benchmark-category-development		.agents/skills/benchmark-category-development
configs		configs
docs		docs
src		src
tasks		tasks
.gitignore		.gitignore
CONTEXT_TRANSFER.md		CONTEXT_TRANSFER.md
README.md		README.md
package-lock.json		package-lock.json
package.json		package.json
tsconfig.json		tsconfig.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

InsForge Agent Benchmark

Prerequisites

Quickstart

Pilot InsForge Tasks

Pilot Supabase Local Tasks

Claude Code Run

Layout

Next Provider Tasks

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

InsForge Agent Benchmark

Prerequisites

Quickstart

Pilot InsForge Tasks

Pilot Supabase Local Tasks

Claude Code Run

Layout

Next Provider Tasks

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages