Skip to content

spec(cli): SPEC-013 v0.3 — macprovider-cli autotune subcommand#108

Merged
Augustas11 merged 5 commits into
mainfrom
spec/cli-autotune-v1
Jun 18, 2026
Merged

spec(cli): SPEC-013 v0.3 — macprovider-cli autotune subcommand#108
Augustas11 merged 5 commits into
mainfrom
spec/cli-autotune-v1

Conversation

@Augustas11

Copy link
Copy Markdown
Owner

Summary

SPEC-013 v0.3 specifies a new macprovider-cli autotune subcommand that encodes the network's "use this Mac's capacity to its maximum useful capability" product strategy.

The load-bearing decision: biggest-fit, not max-tps. Autotune wraps the PR #105 serving knobs (--kv-bits, --max-context, --max-batch) in a two-stage pipeline. Stage 1 iterates a curated largest-first candidate model list and STOPS on the first model that passes the feasibility gate; Stage 2 hill-climbs the knobs WITHIN that chosen model. This is the explicit rejection of the PR #103 Python prototype's objective — its cartesian max-tps cell-search would push every capable Mac to serve the smallest model (the buyer could call any cloud API for that), destroying the network value. Throughput is the tiebreaker among knob settings, NOT the cross-model objective.

A 16 GB Mac that can host a 7B SHOULD serve a 7B even though the 1B is faster on the same hardware. SPEC-013 is what encodes that.

What this is NOT

  • Not implementation. SPEC-only. No Swift code, no Python code, no tests, no deploy. Implementation choice between Option A (Swift-native subcommand) vs Option B (Python-wrapper around the prototype) is explicitly deferred to the implementing PR per §10.
  • Not merge-ready. This PR opens as DRAFT for spec review. The audit cycle has converged (3 codex rounds, see below), but reviewers should still pass on the locked v0.3 before any implementation PR begins.
  • Not a coordinator change. SPEC-002 v1.3.5 / SPEC-006 v0.8.3 are untouched; no wire protocol changes; no buyer-facing behavior changes; no billing implications. With autotune unused, provider behavior is byte-identical to pre-SPEC-013.

Audit cycle (codex, 3 rounds)

Per the project's house pattern, the SPEC went through a codex audit loop before opening this PR. Full report at specs/SPEC-013-audit.md.

Round Verdict Closure
1 (on v0.1) 0 CRITICAL / 7 MAJOR / 11 MINOR / 2 QUESTION v0.2 closes 17 of 19 (1 MINOR deferred to post-lock checklist)
2 (on v0.2) LOCK READY + 1 MAJOR new + 3 MINOR new v0.3 closes all 4
3 (on v0.3) LOCK + 0 MAJOR / 1 trivial MINOR folded into v0.3

Key round-1 closures were code-grounded fixes that would have broken on day one of implementation: the --apply config keys (max_context_override / max_concurrency_override per Config.swift:239), the launchd label (live.streamvc.macprovider per SPEC-003 v0.9.2 §FR-C5), and the tune_trials.stage SQLite migration (ALTER TABLE ... ADD COLUMN ... INTEGER NOT NULL DEFAULT 1).

What's in the SPEC

  • §3 architecture: two-stage pipeline with provider-lifecycle invariants (one provider at a time, --no-join so candidates never enter the coordinator pool)
  • §5 functional requirements: FR-A.1–4 (Stage 1), FR-B.1–3 (Stage 2), FR-C.1–3 (default curated candidate list), FR-D.1–2 (pre-warm contract with Shape A vs Shape B implementation choice + integrity-vs-transient failure split), FR-E.1–2 (provider-conflict safety, launchd drain), FR-F.1–3 (recommendation surface + JSON schema + opt-in --apply), FR-G.1–2 (tune_trials + tune_runs SQLite schema), FR-H.1–4 (failure modes)
  • §6 NFRs: wall-clock budget per RAM tier (~20 min on 8 GB → ~95 min on 64 GB+), single-slot resource posture, reversibility via collision-safe backups, nothing leaves the machine except FR-D.1 HuggingFace pre-warm fetches
  • §8 acceptance criteria: 19 ACs including AC-17 — the load-bearing "operator-supplied order is honored verbatim, no internal rerank" guard
  • §9 open questions: 5 OQs flagged for the in-flight air5 n=3 replication run (TPS_TIE_EPSILON default, stage2_replicates, kv-bits axis vs default, stage1_replicates stability, OQ-E thermal/cell-order bias). Each names a measurable decision threshold.
  • §11 out of scope for v1 / queued for v2: coordinator-served candidate list, recipe attestation, sticky-affinity from recipes, warm-swap-driven tuning. SPEC-014 provisionally reserved for the coordinator-served recommended catalog (cross-spec renumber per SPEC-011 §8).
  • §12 prototype migration note: what survives from PR spike: provider-side model-selection autotune loop (autoresearch) #103 (provider lifecycle, _is_new_best knob tiebreak, tune_trials schema, signal handling), what changes (the objective itself).

Open questions (pending air5 n=3 replication run)

Each has a quantitative decision threshold; v0.4 either confirms placeholders or adjusts via a narrow PR with the air5 data attached:

  • OQ-A TPS_TIE_EPSILON default given measured σ(tps)/μ(tps)
  • OQ-B stage2_replicates minimum discriminable tps gap ≤ TPS_TIE_EPSILON × median
  • OQ-C whether kv-bits stays a search axis or becomes a fixed default
  • OQ-D Stage 1 fit-determination needs N>1 replicates (false-fit/false-reject rate > 5% → escalate)
  • OQ-E thermal/cell-order bias (10 paired forward/reverse runs; mismatch_pairs/10 > 0.05 → randomize/cooldown)

Test plan

  • Audit cycle converged (3 codex rounds, LOCK verdict)
  • All FR/AC cross-references resolve within the document
  • Spec text grounded against actual code: Config.swift YAML keys, live.streamvc.macprovider launchd label, ModelRuntime.configuration(for:) HF cache + online-fallback path, beta/autotune.py provider lifecycle + _is_new_best
  • Spec review: human reviewer reads v0.3 end-to-end and approves LOCK (or files round-4 findings)
  • DO NOT MERGE until spec review is complete
  • No implementation work begins until this PR merges and §11 post-lock checklist runs (decision-log entry, SPEC-003 install-flow note, PR spike: provider-side model-selection autotune loop (autoresearch) #103 disposition)

Related

  • PR #105 — the serving knobs (--kv-bits, --max-context, --max-batch) that autotune wraps
  • PR #103 — the Python prototype; SPEC-013 §12 documents what survives and what's rejected. Disposition (close vs rebase) follows the implementation choice (Option A vs Option B).

🤖 Generated with Claude Code

Augustas11 and others added 5 commits June 18, 2026 14:06
Initial draft of the autotune subcommand spec + the round-1 codex
audit prompt. NOT for merge — this commit lives on the feature
branch only and the PR is held until the codex audit loop converges.

SPEC-013 wraps the PR #105 serve flags (--kv-bits, --max-context,
--max-batch) in a two-stage pipeline that encodes the "biggest-fit,
not max-tps" product strategy. Stage 1 iterates a curated
largest-first candidate list and STOPS on the first model that
passes the feasibility gate; Stage 2 hill-climbs knobs WITHIN the
chosen model. This is the load-bearing departure from the PR #103
Python prototype (whose cartesian max-tps loop would push every
capable Mac to serve the smallest model).

Four numerical defaults (TPS_TIE_EPSILON, stage1_replicates,
stage2_replicates, kv-bits axis-vs-default) are flagged as Open
Questions pending the in-flight air5 n=3 replication run; v0.2
either confirms placeholders or sends a narrow PR adjusting them.

Files:
- specs/SPEC-013-cli-autotune.md (new, v0.1 draft)
- specs/AUDIT_SPEC_013_PROMPT.md (new, round-1 codex audit prompt)
- specs/README.md (+1 row in the index table)

Next step: fire AUDIT_SPEC_013_PROMPT.md at codex, address findings
in v0.2, re-audit, loop until 0 CRITICAL / 0 MAJOR, then push + PR.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Round-1 codex audit (specs/SPEC-013-audit.md) returned 0 CRITICAL
/ 7 MAJOR / 11 MINOR / 2 QUESTION on v0.1, with verdict "not ready
to lock as drafted." v0.2 closes all 7 MAJORs, 10 of 11 MINORs,
and both QUESTIONs. The product framing (biggest-fit, not max-tps)
and two-stage architecture are unchanged — round 1 explicitly
preserved both.

MAJORs closed:
- A.1 fallback contradiction: replaced metrics-bearing `fallbacks`
  with NAME-ONLY `alternates` (the STOP-on-first-feasible rule
  meant smaller candidates were never probed; v0.1's fallback
  metrics were structurally impossible).
- D.1 `models pull` precondition was bigger than admitted: FR-D
  reframed as "weights cache-warm before probe; load-fetch
  latency excluded from gate-ttft-ms" with Shape A (explicit
  pull) vs Shape B (rely on runtime online-fallback + isolate
  measurement) implementation choice. No longer depends on a
  not-yet-existing subcommand.
- E.1 launchd label wrong: `com.macprovider.cli` →
  `live.streamvc.macprovider` (matching SPEC-003 v0.9.2 §FR-C5,
  install.sh, plist template). Drain sequence bound to
  `launchctl bootout/bootstrap gui/$UID/...`.
- F.1 `--apply` wrote wrong YAML keys: `max_context_tokens` /
  `max_batch` were the CLI flag names; actual YAML keys per
  Config.swift:239-241 are `max_context_override` /
  `max_concurrency_override`. JSON `knobs` object now uses YAML
  key names for round-trip into config.yaml; `serve_command`
  retains CLI flag names for shell paste.
- F.2 recipe_hash not deterministic: pinned to
  `sha256:<64-lowercase-hex>` + RFC 8785 JCS canonicalization +
  explicit hash input domain enumeration (machine + inputs +
  recommendation.model + recommendation.knobs; excludes run_id,
  timestamps, observed metrics).
- G.1 SQLite migration invalid: `ALTER TABLE tune_trials ADD
  COLUMN stage INTEGER NOT NULL DEFAULT 1` spelled out; new
  inserts MUST set stage=1 or stage=2 explicitly.
- J.1 no AC for operator-supplied order: added AC-17 with
  `--candidate-models 1B,32B` on a Mac where both fit — must
  pick 1B because operator order is the contract.

MINORs closed (10): B.1 (max-context-axis semantics), C.1 (CLI
summary kv-bits default), F.3 (backup naming collision-safe),
G.2 (transactional retention), H.1 (--resume removed from §7),
J.2 (AC-18 new), J.3 (AC-19 new + exit_reason enum),
K.1 (OQ-B/OQ-D quantitative thresholds), L.1 (prototype migration
note), M.1 (cross-spec renumber to SPEC-014).

QUESTIONs resolved (2): D.2 (signature vs network failure now
asymmetric — integrity aborts whole run, transient advances),
K.2 (added OQ-E flagging thermal/order bias with quantitative
threshold).

Deferred to post-lock: M.2 documentation checklist (decision-log
entry, SPEC-003 install note, PR #103 disposition) — captured as
a §11 checklist but not in the binding contract.

Files:
- specs/SPEC-013-cli-autotune.md (v0.1 → v0.2; +566 lines)
- specs/SPEC-013-audit.md (NEW, codex round-1 output)
- specs/AUDIT_SPEC_013_V0_2_PROMPT.md (NEW, round-2 audit prompt)

Next step: fire AUDIT_SPEC_013_V0_2_PROMPT.md at codex for the
round-2 closure check, address any new findings, repeat until
LOCK READY, then push + PR.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Round-2 codex audit (specs/SPEC-013-audit.md § Round 2) returned
LOCK READY with 17 CLOSED / 1 PARTIAL / 1 OVER-CLOSED on the
round-1 findings, plus 1 MAJOR new + 3 MINOR new. Codex
recommended a narrow v0.3 closing the 4 new findings before
implementation. v0.3 closes all 4. No architecture change.

Round-2 closures:
- N-D.1 (MAJOR) Shape B vs models-pull-only wording: v0.2's FR-D
  rewrite permitted Shape B (rely on runtime online-fallback +
  measurement isolation) but NFR-4's egress exception and AC-8
  still spoke only of `models pull`. v0.3 reworords NFR-4 to
  admit both Shape A and Shape B HuggingFace pre-warm paths;
  AC-8 is now shape-neutral with explicit Shape A (mocked pull
  exit non-zero) and Shape B (block egress + runtime fallback
  fails during load) variants. A new sub-variant explicitly
  tests the FR-D.2 integrity-class abort path.
- Z-B.1 (PARTIAL → CLOSED) `--max-context-axis` parse rules: v0.2
  put the parse rules in non-normative §7. v0.3 lifts them into
  FR-B.1 as a normative paragraph (absolute caps, sorted
  ascending after parse, each cell >= --target-context,
  flag-parse-time rejection with exit_reason='config_error',
  duplicate rejection, empty-axis = single-cell). The §7 /
  §5 conflict-resolution rule is now stated explicitly.
- N-OQ-E.1 (MINOR) thermal/order threshold lacked sampling
  protocol: v0.3 adds a 10-paired-runs forward/reverse protocol
  with 60s inter-pair idle, mismatch_pairs/10 > 0.05 trigger
  threshold. Operators can close OQ-E without relitigating
  methodology.
- O.1 (MINOR) residual v0.1-era wording drift: v0.3 closes four
  discrete sites — `tune_runs.spec_version` SQL comment,
  FR-H.2 "v0.1 normative contract" prose, NFR-3 stale
  `.bak-<unix-ts>` pattern, and §7's "MAY change in v0.2"
  disclaimer.

Files:
- specs/SPEC-013-cli-autotune.md (v0.2 → v0.3 LOCK candidate)
- specs/SPEC-013-audit.md (codex round-2 output landed)
- specs/AUDIT_SPEC_013_V0_3_PROMPT.md (NEW, narrow round-3
  closure-confirmation audit prompt)

Next step: fire AUDIT_SPEC_013_V0_3_PROMPT.md at codex for the
round-3 LOCK-confirmation check. Expected outcome: LOCK with
0 new findings or ≤1 MINOR. If LOCK, push the branch and open
the DRAFT PR.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…fold-in

Round-3 codex audit (specs/SPEC-013-audit.md § Round 3) returned
LOCK with 4 CLOSED / 0 PARTIAL across the round-2 findings, plus
0 CRITICAL anti-regression / 0 MAJOR new / 1 MINOR new.

The single round-3 MINOR (O-V03.1) is editorial — FR-F.2's JSON
example still showed "SPEC-013 v0.2" inside a v0.3 document.
Codex explicitly said this does not block LOCK (the adjacent SQL
comment already stated writers emit their own producing version),
but recommended folding the fix in before implementation. This
commit folds it: the JSON example now uses
"SPEC-013 v<producing-version>" as a placeholder, and the
spec_version bullet teaches the rule.

Round-3 closures (from specs/SPEC-013-audit.md § Round 3):
- N-D.1 CLOSED: NFR-4 admits both Shape A (`models pull` or
  equivalent) and Shape B (runtime online fallback during model
  load) HuggingFace pre-warm paths; carve-out scoped to autotune
  runs and weight fetches; AC-8 shape-neutral with explicit
  Shape A + Shape B + integrity-class variants.
- Z-B.1 CLOSED: `--max-context-axis` parse contract lifted from
  non-normative §7 into binding FR-B.1 (absolute caps, sorted
  ascending, ≥ target-context, flag-parse-time rejection with
  exit_reason='config_error', duplicate rejection, empty-axis
  = single-cell); §7 vs §5 conflict-resolution rule explicit.
- N-OQ-E.1 CLOSED: OQ-E thermal/order threshold has a measurable
  10-paired-runs forward/reverse sampling protocol on air5 with
  60s inter-pair idle and the mismatch_pairs/10 > 0.05 trigger.
- O.1 CLOSED: all 4 named drift sites updated (tune_runs
  SQL comment, FR-H.2 prose, NFR-3 backup pattern, §7
  disclaimer).

Specs index updated: specs/README.md row for SPEC-013 now reads
v0.3.

Files:
- specs/SPEC-013-cli-autotune.md (O-V03.1 editorial fold-in)
- specs/SPEC-013-audit.md (codex round-3 LOCK verdict landed)
- specs/README.md (SPEC-013 row → v0.3)

Audit cycle complete after 3 codex rounds: v0.1 → v0.2 (7 MAJOR
+ 10 MINOR + 2 QUESTION closed) → v0.3 (1 MAJOR + 3 MINOR closed
from round 2) → LOCK. Next step: push branch + open DRAFT PR.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Adds the operator-paste BUILD prompt that a fresh Codex CLI
session uses to implement SPEC-013 v0.3 against the existing
phase3-binary/ Swift package.

The BUILD prompt picks Option A (Swift-native subcommand inside
macprovider-cli) per SPEC-013 §10, rationale captured inline:
single-binary install consistency with SPEC-003, drain semantics
match existing patterns (UninstallCommand / SelfUpdate), future
SPEC-011 warm-swap integration needs Swift-native. Shape A vs
Shape B for FR-D pre-warm is left as the implementer's call —
the binding contract is FR-D.1's measurement-isolation
requirement, not the mechanism.

The 11-step build sequence:
1. AutotuneCommand subcommand scaffolding + --dry-run
2. --no-join flag on ServeCommand (FR-E.2 precondition)
3. SQLite schema + DB layer (tune_trials + tune_runs +
   migration + transactional retention)
4. Provider lifecycle (start/stop/wait-ready, single-provider
   invariant)
5. Provider-conflict pre-flight (FR-E.1, launchd bootout/bootstrap)
6. Pre-warm (FR-D Shape A or Shape B, integrity vs transient
   classification per FR-D.2)
7. Stage 1 — feasibility iteration (FR-A, STOP-on-first-feasible,
   AC-17 biggest-fit guard)
8. Stage 2 — knob hill-climb (FR-B, _is_new_best verbatim from
   prototype)
9. Recommendation surface (FR-F: terminal block + JSON schema +
   RFC 8785 JCS recipe_hash + --apply atomic write)
10. Failure modes + signal handling (FR-H, exit_reason enum)
11. Acceptance test suite (AC-1 through AC-19)

Branch strategy: build work happens on feat/cli-autotune-impl
stacked off spec/cli-autotune-v1, so the implementing PR rebases
cleanly onto main after the SPEC PR (#108) merges.

Hard rules: do not modify any file under specs/ from the build
branch (it's downstream of the SPEC PR); do not touch
beta/coordinator/gateway; do not pivot to Option B without an
Open Question; preserve the biggest-fit objective (AC-17 catches
this if forgotten).

The prompt is self-contained: severity definitions, required
reading, step sequence, acceptance gate, hard rules, anti-rules,
operator checkpoint cadence, and an Open Question template.

Files:
- specs/BUILD_SPEC_013_PROMPT.md (NEW)

Next step: branch feat/cli-autotune-impl off spec/cli-autotune-v1
and fire BUILD_SPEC_013_PROMPT.md at codex via omc ask. Expected
wall-clock: 1-2 weeks of session work for one new subcommand
inside an existing binary.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@Augustas11 Augustas11 marked this pull request as ready for review June 18, 2026 17:29
@Augustas11 Augustas11 merged commit e2ebcf1 into main Jun 18, 2026
5 checks passed
@Augustas11 Augustas11 deleted the spec/cli-autotune-v1 branch June 18, 2026 17:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant