spec(cli): SPEC-013 v0.3 — macprovider-cli autotune subcommand by Augustas11 · Pull Request #108 · Augustas11/macprovider

Augustas11 · 2026-06-18T11:50:16Z

Summary

SPEC-013 v0.3 specifies a new macprovider-cli autotune subcommand that encodes the network's "use this Mac's capacity to its maximum useful capability" product strategy.

The load-bearing decision: biggest-fit, not max-tps. Autotune wraps the PR #105 serving knobs (--kv-bits, --max-context, --max-batch) in a two-stage pipeline. Stage 1 iterates a curated largest-first candidate model list and STOPS on the first model that passes the feasibility gate; Stage 2 hill-climbs the knobs WITHIN that chosen model. This is the explicit rejection of the PR #103 Python prototype's objective — its cartesian max-tps cell-search would push every capable Mac to serve the smallest model (the buyer could call any cloud API for that), destroying the network value. Throughput is the tiebreaker among knob settings, NOT the cross-model objective.

A 16 GB Mac that can host a 7B SHOULD serve a 7B even though the 1B is faster on the same hardware. SPEC-013 is what encodes that.

What this is NOT

Not implementation. SPEC-only. No Swift code, no Python code, no tests, no deploy. Implementation choice between Option A (Swift-native subcommand) vs Option B (Python-wrapper around the prototype) is explicitly deferred to the implementing PR per §10.
Not merge-ready. This PR opens as DRAFT for spec review. The audit cycle has converged (3 codex rounds, see below), but reviewers should still pass on the locked v0.3 before any implementation PR begins.
Not a coordinator change. SPEC-002 v1.3.5 / SPEC-006 v0.8.3 are untouched; no wire protocol changes; no buyer-facing behavior changes; no billing implications. With autotune unused, provider behavior is byte-identical to pre-SPEC-013.

Audit cycle (codex, 3 rounds)

Per the project's house pattern, the SPEC went through a codex audit loop before opening this PR. Full report at specs/SPEC-013-audit.md.

Round	Verdict	Closure
1 (on v0.1)	0 CRITICAL / 7 MAJOR / 11 MINOR / 2 QUESTION	v0.2 closes 17 of 19 (1 MINOR deferred to post-lock checklist)
2 (on v0.2)	LOCK READY + 1 MAJOR new + 3 MINOR new	v0.3 closes all 4
3 (on v0.3)	LOCK + 0 MAJOR / 1 trivial MINOR	folded into v0.3

Key round-1 closures were code-grounded fixes that would have broken on day one of implementation: the --apply config keys (max_context_override / max_concurrency_override per Config.swift:239), the launchd label (live.streamvc.macprovider per SPEC-003 v0.9.2 §FR-C5), and the tune_trials.stage SQLite migration (ALTER TABLE ... ADD COLUMN ... INTEGER NOT NULL DEFAULT 1).

What's in the SPEC

§3 architecture: two-stage pipeline with provider-lifecycle invariants (one provider at a time, --no-join so candidates never enter the coordinator pool)
§5 functional requirements: FR-A.1–4 (Stage 1), FR-B.1–3 (Stage 2), FR-C.1–3 (default curated candidate list), FR-D.1–2 (pre-warm contract with Shape A vs Shape B implementation choice + integrity-vs-transient failure split), FR-E.1–2 (provider-conflict safety, launchd drain), FR-F.1–3 (recommendation surface + JSON schema + opt-in --apply), FR-G.1–2 (tune_trials + tune_runs SQLite schema), FR-H.1–4 (failure modes)
§6 NFRs: wall-clock budget per RAM tier (~20 min on 8 GB → ~95 min on 64 GB+), single-slot resource posture, reversibility via collision-safe backups, nothing leaves the machine except FR-D.1 HuggingFace pre-warm fetches
§8 acceptance criteria: 19 ACs including AC-17 — the load-bearing "operator-supplied order is honored verbatim, no internal rerank" guard
§9 open questions: 5 OQs flagged for the in-flight air5 n=3 replication run (TPS_TIE_EPSILON default, stage2_replicates, kv-bits axis vs default, stage1_replicates stability, OQ-E thermal/cell-order bias). Each names a measurable decision threshold.
§11 out of scope for v1 / queued for v2: coordinator-served candidate list, recipe attestation, sticky-affinity from recipes, warm-swap-driven tuning. SPEC-014 provisionally reserved for the coordinator-served recommended catalog (cross-spec renumber per SPEC-011 §8).
§12 prototype migration note: what survives from PR spike: provider-side model-selection autotune loop (autoresearch) #103 (provider lifecycle, _is_new_best knob tiebreak, tune_trials schema, signal handling), what changes (the objective itself).

Open questions (pending air5 n=3 replication run)

Each has a quantitative decision threshold; v0.4 either confirms placeholders or adjusts via a narrow PR with the air5 data attached:

OQ-A TPS_TIE_EPSILON default given measured σ(tps)/μ(tps)
OQ-B stage2_replicates minimum discriminable tps gap ≤ TPS_TIE_EPSILON × median
OQ-C whether kv-bits stays a search axis or becomes a fixed default
OQ-D Stage 1 fit-determination needs N>1 replicates (false-fit/false-reject rate > 5% → escalate)
OQ-E thermal/cell-order bias (10 paired forward/reverse runs; mismatch_pairs/10 > 0.05 → randomize/cooldown)

Test plan

Audit cycle converged (3 codex rounds, LOCK verdict)
All FR/AC cross-references resolve within the document
Spec text grounded against actual code: Config.swift YAML keys, live.streamvc.macprovider launchd label, ModelRuntime.configuration(for:) HF cache + online-fallback path, beta/autotune.py provider lifecycle + _is_new_best
Spec review: human reviewer reads v0.3 end-to-end and approves LOCK (or files round-4 findings)
DO NOT MERGE until spec review is complete
No implementation work begins until this PR merges and §11 post-lock checklist runs (decision-log entry, SPEC-003 install-flow note, PR spike: provider-side model-selection autotune loop (autoresearch) #103 disposition)

Initial draft of the autotune subcommand spec + the round-1 codex audit prompt. NOT for merge — this commit lives on the feature branch only and the PR is held until the codex audit loop converges. SPEC-013 wraps the PR #105 serve flags (--kv-bits, --max-context, --max-batch) in a two-stage pipeline that encodes the "biggest-fit, not max-tps" product strategy. Stage 1 iterates a curated largest-first candidate list and STOPS on the first model that passes the feasibility gate; Stage 2 hill-climbs knobs WITHIN the chosen model. This is the load-bearing departure from the PR #103 Python prototype (whose cartesian max-tps loop would push every capable Mac to serve the smallest model). Four numerical defaults (TPS_TIE_EPSILON, stage1_replicates, stage2_replicates, kv-bits axis-vs-default) are flagged as Open Questions pending the in-flight air5 n=3 replication run; v0.2 either confirms placeholders or sends a narrow PR adjusting them. Files: - specs/SPEC-013-cli-autotune.md (new, v0.1 draft) - specs/AUDIT_SPEC_013_PROMPT.md (new, round-1 codex audit prompt) - specs/README.md (+1 row in the index table) Next step: fire AUDIT_SPEC_013_PROMPT.md at codex, address findings in v0.2, re-audit, loop until 0 CRITICAL / 0 MAJOR, then push + PR. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Round-1 codex audit (specs/SPEC-013-audit.md) returned 0 CRITICAL / 7 MAJOR / 11 MINOR / 2 QUESTION on v0.1, with verdict "not ready to lock as drafted." v0.2 closes all 7 MAJORs, 10 of 11 MINORs, and both QUESTIONs. The product framing (biggest-fit, not max-tps) and two-stage architecture are unchanged — round 1 explicitly preserved both. MAJORs closed: - A.1 fallback contradiction: replaced metrics-bearing `fallbacks` with NAME-ONLY `alternates` (the STOP-on-first-feasible rule meant smaller candidates were never probed; v0.1's fallback metrics were structurally impossible). - D.1 `models pull` precondition was bigger than admitted: FR-D reframed as "weights cache-warm before probe; load-fetch latency excluded from gate-ttft-ms" with Shape A (explicit pull) vs Shape B (rely on runtime online-fallback + isolate measurement) implementation choice. No longer depends on a not-yet-existing subcommand. - E.1 launchd label wrong: `com.macprovider.cli` → `live.streamvc.macprovider` (matching SPEC-003 v0.9.2 §FR-C5, install.sh, plist template). Drain sequence bound to `launchctl bootout/bootstrap gui/$UID/...`. - F.1 `--apply` wrote wrong YAML keys: `max_context_tokens` / `max_batch` were the CLI flag names; actual YAML keys per Config.swift:239-241 are `max_context_override` / `max_concurrency_override`. JSON `knobs` object now uses YAML key names for round-trip into config.yaml; `serve_command` retains CLI flag names for shell paste. - F.2 recipe_hash not deterministic: pinned to `sha256:<64-lowercase-hex>` + RFC 8785 JCS canonicalization + explicit hash input domain enumeration (machine + inputs + recommendation.model + recommendation.knobs; excludes run_id, timestamps, observed metrics). - G.1 SQLite migration invalid: `ALTER TABLE tune_trials ADD COLUMN stage INTEGER NOT NULL DEFAULT 1` spelled out; new inserts MUST set stage=1 or stage=2 explicitly. - J.1 no AC for operator-supplied order: added AC-17 with `--candidate-models 1B,32B` on a Mac where both fit — must pick 1B because operator order is the contract. MINORs closed (10): B.1 (max-context-axis semantics), C.1 (CLI summary kv-bits default), F.3 (backup naming collision-safe), G.2 (transactional retention), H.1 (--resume removed from §7), J.2 (AC-18 new), J.3 (AC-19 new + exit_reason enum), K.1 (OQ-B/OQ-D quantitative thresholds), L.1 (prototype migration note), M.1 (cross-spec renumber to SPEC-014). QUESTIONs resolved (2): D.2 (signature vs network failure now asymmetric — integrity aborts whole run, transient advances), K.2 (added OQ-E flagging thermal/order bias with quantitative threshold). Deferred to post-lock: M.2 documentation checklist (decision-log entry, SPEC-003 install note, PR #103 disposition) — captured as a §11 checklist but not in the binding contract. Files: - specs/SPEC-013-cli-autotune.md (v0.1 → v0.2; +566 lines) - specs/SPEC-013-audit.md (NEW, codex round-1 output) - specs/AUDIT_SPEC_013_V0_2_PROMPT.md (NEW, round-2 audit prompt) Next step: fire AUDIT_SPEC_013_V0_2_PROMPT.md at codex for the round-2 closure check, address any new findings, repeat until LOCK READY, then push + PR. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Round-2 codex audit (specs/SPEC-013-audit.md § Round 2) returned LOCK READY with 17 CLOSED / 1 PARTIAL / 1 OVER-CLOSED on the round-1 findings, plus 1 MAJOR new + 3 MINOR new. Codex recommended a narrow v0.3 closing the 4 new findings before implementation. v0.3 closes all 4. No architecture change. Round-2 closures: - N-D.1 (MAJOR) Shape B vs models-pull-only wording: v0.2's FR-D rewrite permitted Shape B (rely on runtime online-fallback + measurement isolation) but NFR-4's egress exception and AC-8 still spoke only of `models pull`. v0.3 reworords NFR-4 to admit both Shape A and Shape B HuggingFace pre-warm paths; AC-8 is now shape-neutral with explicit Shape A (mocked pull exit non-zero) and Shape B (block egress + runtime fallback fails during load) variants. A new sub-variant explicitly tests the FR-D.2 integrity-class abort path. - Z-B.1 (PARTIAL → CLOSED) `--max-context-axis` parse rules: v0.2 put the parse rules in non-normative §7. v0.3 lifts them into FR-B.1 as a normative paragraph (absolute caps, sorted ascending after parse, each cell >= --target-context, flag-parse-time rejection with exit_reason='config_error', duplicate rejection, empty-axis = single-cell). The §7 / §5 conflict-resolution rule is now stated explicitly. - N-OQ-E.1 (MINOR) thermal/order threshold lacked sampling protocol: v0.3 adds a 10-paired-runs forward/reverse protocol with 60s inter-pair idle, mismatch_pairs/10 > 0.05 trigger threshold. Operators can close OQ-E without relitigating methodology. - O.1 (MINOR) residual v0.1-era wording drift: v0.3 closes four discrete sites — `tune_runs.spec_version` SQL comment, FR-H.2 "v0.1 normative contract" prose, NFR-3 stale `.bak-<unix-ts>` pattern, and §7's "MAY change in v0.2" disclaimer. Files: - specs/SPEC-013-cli-autotune.md (v0.2 → v0.3 LOCK candidate) - specs/SPEC-013-audit.md (codex round-2 output landed) - specs/AUDIT_SPEC_013_V0_3_PROMPT.md (NEW, narrow round-3 closure-confirmation audit prompt) Next step: fire AUDIT_SPEC_013_V0_3_PROMPT.md at codex for the round-3 LOCK-confirmation check. Expected outcome: LOCK with 0 new findings or ≤1 MINOR. If LOCK, push the branch and open the DRAFT PR. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…fold-in Round-3 codex audit (specs/SPEC-013-audit.md § Round 3) returned LOCK with 4 CLOSED / 0 PARTIAL across the round-2 findings, plus 0 CRITICAL anti-regression / 0 MAJOR new / 1 MINOR new. The single round-3 MINOR (O-V03.1) is editorial — FR-F.2's JSON example still showed "SPEC-013 v0.2" inside a v0.3 document. Codex explicitly said this does not block LOCK (the adjacent SQL comment already stated writers emit their own producing version), but recommended folding the fix in before implementation. This commit folds it: the JSON example now uses "SPEC-013 v<producing-version>" as a placeholder, and the spec_version bullet teaches the rule. Round-3 closures (from specs/SPEC-013-audit.md § Round 3): - N-D.1 CLOSED: NFR-4 admits both Shape A (`models pull` or equivalent) and Shape B (runtime online fallback during model load) HuggingFace pre-warm paths; carve-out scoped to autotune runs and weight fetches; AC-8 shape-neutral with explicit Shape A + Shape B + integrity-class variants. - Z-B.1 CLOSED: `--max-context-axis` parse contract lifted from non-normative §7 into binding FR-B.1 (absolute caps, sorted ascending, ≥ target-context, flag-parse-time rejection with exit_reason='config_error', duplicate rejection, empty-axis = single-cell); §7 vs §5 conflict-resolution rule explicit. - N-OQ-E.1 CLOSED: OQ-E thermal/order threshold has a measurable 10-paired-runs forward/reverse sampling protocol on air5 with 60s inter-pair idle and the mismatch_pairs/10 > 0.05 trigger. - O.1 CLOSED: all 4 named drift sites updated (tune_runs SQL comment, FR-H.2 prose, NFR-3 backup pattern, §7 disclaimer). Specs index updated: specs/README.md row for SPEC-013 now reads v0.3. Files: - specs/SPEC-013-cli-autotune.md (O-V03.1 editorial fold-in) - specs/SPEC-013-audit.md (codex round-3 LOCK verdict landed) - specs/README.md (SPEC-013 row → v0.3) Audit cycle complete after 3 codex rounds: v0.1 → v0.2 (7 MAJOR + 10 MINOR + 2 QUESTION closed) → v0.3 (1 MAJOR + 3 MINOR closed from round 2) → LOCK. Next step: push branch + open DRAFT PR. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Adds the operator-paste BUILD prompt that a fresh Codex CLI session uses to implement SPEC-013 v0.3 against the existing phase3-binary/ Swift package. The BUILD prompt picks Option A (Swift-native subcommand inside macprovider-cli) per SPEC-013 §10, rationale captured inline: single-binary install consistency with SPEC-003, drain semantics match existing patterns (UninstallCommand / SelfUpdate), future SPEC-011 warm-swap integration needs Swift-native. Shape A vs Shape B for FR-D pre-warm is left as the implementer's call — the binding contract is FR-D.1's measurement-isolation requirement, not the mechanism. The 11-step build sequence: 1. AutotuneCommand subcommand scaffolding + --dry-run 2. --no-join flag on ServeCommand (FR-E.2 precondition) 3. SQLite schema + DB layer (tune_trials + tune_runs + migration + transactional retention) 4. Provider lifecycle (start/stop/wait-ready, single-provider invariant) 5. Provider-conflict pre-flight (FR-E.1, launchd bootout/bootstrap) 6. Pre-warm (FR-D Shape A or Shape B, integrity vs transient classification per FR-D.2) 7. Stage 1 — feasibility iteration (FR-A, STOP-on-first-feasible, AC-17 biggest-fit guard) 8. Stage 2 — knob hill-climb (FR-B, _is_new_best verbatim from prototype) 9. Recommendation surface (FR-F: terminal block + JSON schema + RFC 8785 JCS recipe_hash + --apply atomic write) 10. Failure modes + signal handling (FR-H, exit_reason enum) 11. Acceptance test suite (AC-1 through AC-19) Branch strategy: build work happens on feat/cli-autotune-impl stacked off spec/cli-autotune-v1, so the implementing PR rebases cleanly onto main after the SPEC PR (#108) merges. Hard rules: do not modify any file under specs/ from the build branch (it's downstream of the SPEC PR); do not touch beta/coordinator/gateway; do not pivot to Option B without an Open Question; preserve the biggest-fit objective (AC-17 catches this if forgotten). The prompt is self-contained: severity definitions, required reading, step sequence, acceptance gate, hard rules, anti-rules, operator checkpoint cadence, and an Open Question template. Files: - specs/BUILD_SPEC_013_PROMPT.md (NEW) Next step: branch feat/cli-autotune-impl off spec/cli-autotune-v1 and fire BUILD_SPEC_013_PROMPT.md at codex via omc ask. Expected wall-clock: 1-2 weeks of session work for one new subcommand inside an existing binary. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Augustas11 and others added 5 commits June 18, 2026 14:06

Augustas11 marked this pull request as ready for review June 18, 2026 17:29

Augustas11 merged commit e2ebcf1 into main Jun 18, 2026
5 checks passed

Augustas11 deleted the spec/cli-autotune-v1 branch June 18, 2026 17:30

This was referenced Jun 18, 2026

feat(autotune): implement SPEC-013 v0.3 macprovider-cli autotune (11 BUILD steps, 21 audit rounds) #109

Merged

spike: provider-side model-selection autotune loop (autoresearch) #103

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

spec(cli): SPEC-013 v0.3 — macprovider-cli autotune subcommand#108

spec(cli): SPEC-013 v0.3 — macprovider-cli autotune subcommand#108
Augustas11 merged 5 commits into
mainfrom
spec/cli-autotune-v1

Augustas11 commented Jun 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Augustas11 commented Jun 18, 2026

Summary

What this is NOT

Audit cycle (codex, 3 rounds)

What's in the SPEC

Open questions (pending air5 n=3 replication run)

Test plan

Related

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant