spec(cli): SPEC-013 v0.3 — macprovider-cli autotune subcommand#108
Merged
Conversation
Initial draft of the autotune subcommand spec + the round-1 codex audit prompt. NOT for merge — this commit lives on the feature branch only and the PR is held until the codex audit loop converges. SPEC-013 wraps the PR #105 serve flags (--kv-bits, --max-context, --max-batch) in a two-stage pipeline that encodes the "biggest-fit, not max-tps" product strategy. Stage 1 iterates a curated largest-first candidate list and STOPS on the first model that passes the feasibility gate; Stage 2 hill-climbs knobs WITHIN the chosen model. This is the load-bearing departure from the PR #103 Python prototype (whose cartesian max-tps loop would push every capable Mac to serve the smallest model). Four numerical defaults (TPS_TIE_EPSILON, stage1_replicates, stage2_replicates, kv-bits axis-vs-default) are flagged as Open Questions pending the in-flight air5 n=3 replication run; v0.2 either confirms placeholders or sends a narrow PR adjusting them. Files: - specs/SPEC-013-cli-autotune.md (new, v0.1 draft) - specs/AUDIT_SPEC_013_PROMPT.md (new, round-1 codex audit prompt) - specs/README.md (+1 row in the index table) Next step: fire AUDIT_SPEC_013_PROMPT.md at codex, address findings in v0.2, re-audit, loop until 0 CRITICAL / 0 MAJOR, then push + PR. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Round-1 codex audit (specs/SPEC-013-audit.md) returned 0 CRITICAL / 7 MAJOR / 11 MINOR / 2 QUESTION on v0.1, with verdict "not ready to lock as drafted." v0.2 closes all 7 MAJORs, 10 of 11 MINORs, and both QUESTIONs. The product framing (biggest-fit, not max-tps) and two-stage architecture are unchanged — round 1 explicitly preserved both. MAJORs closed: - A.1 fallback contradiction: replaced metrics-bearing `fallbacks` with NAME-ONLY `alternates` (the STOP-on-first-feasible rule meant smaller candidates were never probed; v0.1's fallback metrics were structurally impossible). - D.1 `models pull` precondition was bigger than admitted: FR-D reframed as "weights cache-warm before probe; load-fetch latency excluded from gate-ttft-ms" with Shape A (explicit pull) vs Shape B (rely on runtime online-fallback + isolate measurement) implementation choice. No longer depends on a not-yet-existing subcommand. - E.1 launchd label wrong: `com.macprovider.cli` → `live.streamvc.macprovider` (matching SPEC-003 v0.9.2 §FR-C5, install.sh, plist template). Drain sequence bound to `launchctl bootout/bootstrap gui/$UID/...`. - F.1 `--apply` wrote wrong YAML keys: `max_context_tokens` / `max_batch` were the CLI flag names; actual YAML keys per Config.swift:239-241 are `max_context_override` / `max_concurrency_override`. JSON `knobs` object now uses YAML key names for round-trip into config.yaml; `serve_command` retains CLI flag names for shell paste. - F.2 recipe_hash not deterministic: pinned to `sha256:<64-lowercase-hex>` + RFC 8785 JCS canonicalization + explicit hash input domain enumeration (machine + inputs + recommendation.model + recommendation.knobs; excludes run_id, timestamps, observed metrics). - G.1 SQLite migration invalid: `ALTER TABLE tune_trials ADD COLUMN stage INTEGER NOT NULL DEFAULT 1` spelled out; new inserts MUST set stage=1 or stage=2 explicitly. - J.1 no AC for operator-supplied order: added AC-17 with `--candidate-models 1B,32B` on a Mac where both fit — must pick 1B because operator order is the contract. MINORs closed (10): B.1 (max-context-axis semantics), C.1 (CLI summary kv-bits default), F.3 (backup naming collision-safe), G.2 (transactional retention), H.1 (--resume removed from §7), J.2 (AC-18 new), J.3 (AC-19 new + exit_reason enum), K.1 (OQ-B/OQ-D quantitative thresholds), L.1 (prototype migration note), M.1 (cross-spec renumber to SPEC-014). QUESTIONs resolved (2): D.2 (signature vs network failure now asymmetric — integrity aborts whole run, transient advances), K.2 (added OQ-E flagging thermal/order bias with quantitative threshold). Deferred to post-lock: M.2 documentation checklist (decision-log entry, SPEC-003 install note, PR #103 disposition) — captured as a §11 checklist but not in the binding contract. Files: - specs/SPEC-013-cli-autotune.md (v0.1 → v0.2; +566 lines) - specs/SPEC-013-audit.md (NEW, codex round-1 output) - specs/AUDIT_SPEC_013_V0_2_PROMPT.md (NEW, round-2 audit prompt) Next step: fire AUDIT_SPEC_013_V0_2_PROMPT.md at codex for the round-2 closure check, address any new findings, repeat until LOCK READY, then push + PR. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Round-2 codex audit (specs/SPEC-013-audit.md § Round 2) returned LOCK READY with 17 CLOSED / 1 PARTIAL / 1 OVER-CLOSED on the round-1 findings, plus 1 MAJOR new + 3 MINOR new. Codex recommended a narrow v0.3 closing the 4 new findings before implementation. v0.3 closes all 4. No architecture change. Round-2 closures: - N-D.1 (MAJOR) Shape B vs models-pull-only wording: v0.2's FR-D rewrite permitted Shape B (rely on runtime online-fallback + measurement isolation) but NFR-4's egress exception and AC-8 still spoke only of `models pull`. v0.3 reworords NFR-4 to admit both Shape A and Shape B HuggingFace pre-warm paths; AC-8 is now shape-neutral with explicit Shape A (mocked pull exit non-zero) and Shape B (block egress + runtime fallback fails during load) variants. A new sub-variant explicitly tests the FR-D.2 integrity-class abort path. - Z-B.1 (PARTIAL → CLOSED) `--max-context-axis` parse rules: v0.2 put the parse rules in non-normative §7. v0.3 lifts them into FR-B.1 as a normative paragraph (absolute caps, sorted ascending after parse, each cell >= --target-context, flag-parse-time rejection with exit_reason='config_error', duplicate rejection, empty-axis = single-cell). The §7 / §5 conflict-resolution rule is now stated explicitly. - N-OQ-E.1 (MINOR) thermal/order threshold lacked sampling protocol: v0.3 adds a 10-paired-runs forward/reverse protocol with 60s inter-pair idle, mismatch_pairs/10 > 0.05 trigger threshold. Operators can close OQ-E without relitigating methodology. - O.1 (MINOR) residual v0.1-era wording drift: v0.3 closes four discrete sites — `tune_runs.spec_version` SQL comment, FR-H.2 "v0.1 normative contract" prose, NFR-3 stale `.bak-<unix-ts>` pattern, and §7's "MAY change in v0.2" disclaimer. Files: - specs/SPEC-013-cli-autotune.md (v0.2 → v0.3 LOCK candidate) - specs/SPEC-013-audit.md (codex round-2 output landed) - specs/AUDIT_SPEC_013_V0_3_PROMPT.md (NEW, narrow round-3 closure-confirmation audit prompt) Next step: fire AUDIT_SPEC_013_V0_3_PROMPT.md at codex for the round-3 LOCK-confirmation check. Expected outcome: LOCK with 0 new findings or ≤1 MINOR. If LOCK, push the branch and open the DRAFT PR. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…fold-in Round-3 codex audit (specs/SPEC-013-audit.md § Round 3) returned LOCK with 4 CLOSED / 0 PARTIAL across the round-2 findings, plus 0 CRITICAL anti-regression / 0 MAJOR new / 1 MINOR new. The single round-3 MINOR (O-V03.1) is editorial — FR-F.2's JSON example still showed "SPEC-013 v0.2" inside a v0.3 document. Codex explicitly said this does not block LOCK (the adjacent SQL comment already stated writers emit their own producing version), but recommended folding the fix in before implementation. This commit folds it: the JSON example now uses "SPEC-013 v<producing-version>" as a placeholder, and the spec_version bullet teaches the rule. Round-3 closures (from specs/SPEC-013-audit.md § Round 3): - N-D.1 CLOSED: NFR-4 admits both Shape A (`models pull` or equivalent) and Shape B (runtime online fallback during model load) HuggingFace pre-warm paths; carve-out scoped to autotune runs and weight fetches; AC-8 shape-neutral with explicit Shape A + Shape B + integrity-class variants. - Z-B.1 CLOSED: `--max-context-axis` parse contract lifted from non-normative §7 into binding FR-B.1 (absolute caps, sorted ascending, ≥ target-context, flag-parse-time rejection with exit_reason='config_error', duplicate rejection, empty-axis = single-cell); §7 vs §5 conflict-resolution rule explicit. - N-OQ-E.1 CLOSED: OQ-E thermal/order threshold has a measurable 10-paired-runs forward/reverse sampling protocol on air5 with 60s inter-pair idle and the mismatch_pairs/10 > 0.05 trigger. - O.1 CLOSED: all 4 named drift sites updated (tune_runs SQL comment, FR-H.2 prose, NFR-3 backup pattern, §7 disclaimer). Specs index updated: specs/README.md row for SPEC-013 now reads v0.3. Files: - specs/SPEC-013-cli-autotune.md (O-V03.1 editorial fold-in) - specs/SPEC-013-audit.md (codex round-3 LOCK verdict landed) - specs/README.md (SPEC-013 row → v0.3) Audit cycle complete after 3 codex rounds: v0.1 → v0.2 (7 MAJOR + 10 MINOR + 2 QUESTION closed) → v0.3 (1 MAJOR + 3 MINOR closed from round 2) → LOCK. Next step: push branch + open DRAFT PR. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Adds the operator-paste BUILD prompt that a fresh Codex CLI session uses to implement SPEC-013 v0.3 against the existing phase3-binary/ Swift package. The BUILD prompt picks Option A (Swift-native subcommand inside macprovider-cli) per SPEC-013 §10, rationale captured inline: single-binary install consistency with SPEC-003, drain semantics match existing patterns (UninstallCommand / SelfUpdate), future SPEC-011 warm-swap integration needs Swift-native. Shape A vs Shape B for FR-D pre-warm is left as the implementer's call — the binding contract is FR-D.1's measurement-isolation requirement, not the mechanism. The 11-step build sequence: 1. AutotuneCommand subcommand scaffolding + --dry-run 2. --no-join flag on ServeCommand (FR-E.2 precondition) 3. SQLite schema + DB layer (tune_trials + tune_runs + migration + transactional retention) 4. Provider lifecycle (start/stop/wait-ready, single-provider invariant) 5. Provider-conflict pre-flight (FR-E.1, launchd bootout/bootstrap) 6. Pre-warm (FR-D Shape A or Shape B, integrity vs transient classification per FR-D.2) 7. Stage 1 — feasibility iteration (FR-A, STOP-on-first-feasible, AC-17 biggest-fit guard) 8. Stage 2 — knob hill-climb (FR-B, _is_new_best verbatim from prototype) 9. Recommendation surface (FR-F: terminal block + JSON schema + RFC 8785 JCS recipe_hash + --apply atomic write) 10. Failure modes + signal handling (FR-H, exit_reason enum) 11. Acceptance test suite (AC-1 through AC-19) Branch strategy: build work happens on feat/cli-autotune-impl stacked off spec/cli-autotune-v1, so the implementing PR rebases cleanly onto main after the SPEC PR (#108) merges. Hard rules: do not modify any file under specs/ from the build branch (it's downstream of the SPEC PR); do not touch beta/coordinator/gateway; do not pivot to Option B without an Open Question; preserve the biggest-fit objective (AC-17 catches this if forgotten). The prompt is self-contained: severity definitions, required reading, step sequence, acceptance gate, hard rules, anti-rules, operator checkpoint cadence, and an Open Question template. Files: - specs/BUILD_SPEC_013_PROMPT.md (NEW) Next step: branch feat/cli-autotune-impl off spec/cli-autotune-v1 and fire BUILD_SPEC_013_PROMPT.md at codex via omc ask. Expected wall-clock: 1-2 weeks of session work for one new subcommand inside an existing binary. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This was referenced Jun 18, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
SPEC-013 v0.3 specifies a new
macprovider-cli autotunesubcommand that encodes the network's "use this Mac's capacity to its maximum useful capability" product strategy.The load-bearing decision: biggest-fit, not max-tps. Autotune wraps the PR #105 serving knobs (
--kv-bits,--max-context,--max-batch) in a two-stage pipeline. Stage 1 iterates a curated largest-first candidate model list and STOPS on the first model that passes the feasibility gate; Stage 2 hill-climbs the knobs WITHIN that chosen model. This is the explicit rejection of the PR #103 Python prototype's objective — its cartesian max-tps cell-search would push every capable Mac to serve the smallest model (the buyer could call any cloud API for that), destroying the network value. Throughput is the tiebreaker among knob settings, NOT the cross-model objective.A 16 GB Mac that can host a 7B SHOULD serve a 7B even though the 1B is faster on the same hardware. SPEC-013 is what encodes that.
What this is NOT
Audit cycle (codex, 3 rounds)
Per the project's house pattern, the SPEC went through a codex audit loop before opening this PR. Full report at specs/SPEC-013-audit.md.
Key round-1 closures were code-grounded fixes that would have broken on day one of implementation: the
--applyconfig keys (max_context_override/max_concurrency_overrideper Config.swift:239), the launchd label (live.streamvc.macproviderper SPEC-003 v0.9.2 §FR-C5), and thetune_trials.stageSQLite migration (ALTER TABLE ... ADD COLUMN ... INTEGER NOT NULL DEFAULT 1).What's in the SPEC
--no-joinso candidates never enter the coordinator pool)--apply), FR-G.1–2 (tune_trials+tune_runsSQLite schema), FR-H.1–4 (failure modes)TPS_TIE_EPSILONdefault,stage2_replicates, kv-bits axis vs default,stage1_replicatesstability, OQ-E thermal/cell-order bias). Each names a measurable decision threshold._is_new_bestknob tiebreak,tune_trialsschema, signal handling), what changes (the objective itself).Open questions (pending air5 n=3 replication run)
Each has a quantitative decision threshold; v0.4 either confirms placeholders or adjusts via a narrow PR with the air5 data attached:
TPS_TIE_EPSILONdefault given measured σ(tps)/μ(tps)stage2_replicatesminimum discriminable tps gap ≤ TPS_TIE_EPSILON × medianTest plan
Config.swiftYAML keys,live.streamvc.macproviderlaunchd label,ModelRuntime.configuration(for:)HF cache + online-fallback path,beta/autotune.pyprovider lifecycle +_is_new_bestRelated
--kv-bits,--max-context,--max-batch) that autotune wraps🤖 Generated with Claude Code