Skip to content

spike: provider-side model-selection autotune loop (autoresearch)#103

Closed
Augustas11 wants to merge 7 commits into
mainfrom
spike/provider-model-autotune
Closed

spike: provider-side model-selection autotune loop (autoresearch)#103
Augustas11 wants to merge 7 commits into
mainfrom
spike/provider-model-autotune

Conversation

@Augustas11

Copy link
Copy Markdown
Owner

What

First real autoresearch loop for MacProvider: a keep/revert hill-climb over the model dimension (size × quant) that discovers the optimal servable model for a given Mac.

Not a benchmark — a genuine optimization loop: propose config → load via macprovider-cli serve --model X → measure tok/s + TTFT at a target context → fit gate (errors / OOM / TTFT > 60s ⇒ infeasible) → keep best feasible → log every trial → declare winner.

First hill-climb (8GB M1 Air)

# model ctx tps ttft_ms fits kept
1 Llama-3.2-1B-4bit 2000 9.7 2380 NEW BEST
2 Llama-3.2-1B-4bit 8000 2.4 11268
3 Llama-3.2-3B-4bit 2000 3.1 9960
4 Llama-3.2-3B-4bit 8000 0.4 94604 ❌ infeasible
5 Phi-3.5-mini-4bit 2000 0.5 83662 ❌ infeasible
6 Phi-3.5-mini-4bit 8000 ❌ infeasible

Winner on 8GB: Llama-3.2-1B-Instruct-4bit @ ctx=2000 → 9.7 tok/s. Fit gate rejected 3/6 candidates as expected.

Files

  • beta/autotune.py — the loop CLI. Reuses harness.fire_stream + sweep.build_padded_prompt/aggregate_cell unchanged.
  • New additive tune_trials SQLite table (existing tables untouched).
  • HTML report at beta/reports/autotune-<run_id>.html.

Status

Spike / draft — designed to be machine-agnostic so the next run can target a roomier Mac (air5: Qwen-Coder-7B @ 50k context) where the real per-hardware recipe surfaces. Knob exposure (KV-bits / batch / max-context as serve flags in the Swift binary) is the natural follow-up to widen the search space beyond model choice.

🤖 Generated with Claude Code

Augustas11 added a commit that referenced this pull request Jun 18, 2026
…knobs for autoresearch (#105)

Widens the autoresearch search space for beta/autotune.py (PR #103)
beyond just --model. All three knobs are real downstream wiring into
mlx-swift 2.29.1, not just CLI cosmetics.

- --kv-bits {4,8}: forwarded to MLXLMCommon.GenerateParameters.kvBits
  (both complete + stream call sites); preflight rejects anything but 4/8.
- --max-context <N>: extends the existing per-tier maxContextTokens
  cap; tokens are still rejected at the existing context_length_exceeded
  413 boundary, and we additionally pass maxKVSize=maxContextTokens to
  GenerateParameters so the KV cache (RotatingKVCache) honors the cap.
- --max-batch <N> (default 1, prior single-slot behavior preserved):
  lifts the previously-hardcoded AsyncSemaphore(value: 1) inside
  ModelRuntime to be configurable. Reuses the existing
  maxConcurrencyOverride config field that was already plumbed from
  YAML/env but never wired to the CLI or runtime.

All knobs are triple-exposed (CLI > env > YAML > default), matching the
house convention. Preflight (runServingKnobsPreflight) fails loud at
serve start instead of mid-inference on invalid values.

A bug fix is folded in: ServeCommand.run() was hardcoding
maxConcurrencyOverride: 1 when building ProviderCapacity, silently
ignoring the resolved config. The capacity now reflects --max-batch.

Tests: ServingKnobsConfigTests.swift adds 21 cases covering
config-resolution precedence (CLI > env > YAML), defaults preserved,
preflight rejection of invalid values, runtime threading of all three
knobs, and a regression on the context_length_exceeded gate. Total
suite: 219 -> 240, all passing.

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Augustas11 and others added 5 commits June 18, 2026 09:51
Adds three files for the Phase 2 buyer harness context/concurrency sweep:

- beta/sweep.py: CLI grid sweep over (context_target, concurrency) cells.
  Imports fire_stream from harness.py (SSE parser reuse). Writes results
  to new sweep_runs SQLite table. Gate: feasible = n_err==0 AND
  ttft_p95<=gate AND no stop_token_leak. Flags: --dry-run, --base-url
  (required, no remote default), --contexts, --concurrency overrides,
  --decode-control second pass, --gate-ttft-ms.

- beta/sweep_report.py: Reads sweep_runs for a sweep_id (or latest) and
  renders a self-contained HTML heatmap (green/red cells). Matches
  report.py single-file-HTML style. Drops into reports_dir.

- beta/mock_llm_server.py: Local SSE stub on port 18080 serving
  /v1/chat/completions. No remote traffic. Supports --error-rate flag
  to exercise the red/fail gate path. Used only for smoke-tests.

Smoke-tested: dry-run prints 28 cells; 4-cell real sweep (contexts
1000,2000 x conc 1,2) against local mock shows feasible=1 with
populated tps/ttft; error-rate=1.0 run confirms feasible=0 red path;
sweep_report.py renders correct green/red heatmap HTML.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Halts the sweep the moment a cell returns request errors (n_err > 0) —
on a memory-constrained node that almost always means OOM, and the
ctx-major grid would otherwise re-slam the box with every heavier cell.
TTFT-gate-only failures (slow, no errors) do not stop the sweep. Fixes
the end-of-run summary to report attempted (not total) cells when the
sweep aborts early.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
- harness.fire_stream/fire_nonstream: optional headers= (backward-compatible)
- sweep.py: --api-key/--api-key-file (Bearer for the gateway leg; reads
  ~/.config/macprovider/buyer-api-key by default), --model override (also
  pins the provider via model-routing on the gateway path), --max-tokens
  override for fast runs on slow/constrained nodes. Local/direct runs send
  no auth. Verified header threading with a capture server.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Add beta/autotune.py: a real provider-side optimization loop that
discovers the best servable model for a given Mac. Not a static
benchmark — it proposes each (model, context) config, measures agg
throughput_tps + ttft via a fixed workload, applies a feasibility gate
(request error / non-200 / OOM / TTFT-gate => fits=0), keeps the config
only if it beats the current best, logs one row per trial, and tracks
best-so-far. One provider served at a time (start -> wait-ready ->
fire -> pkill), never two at once.

Reuses harness.fire_stream (SSE metrics) and sweep.build_padded_prompt
+ sweep.aggregate_cell unchanged. New additive tune_trials SQLite table;
existing runs/adversarial_runs/sweep_runs untouched. Self-contained HTML
report mirrors sweep_report.py with the winner highlighted and the
best-so-far progression.

First hill-climb on the 8GB M1 Air (3 models x contexts 2000,8000):
WINNER = Llama-3.2-1B-Instruct-4bit @ 2000 (9.7 tok/s, ttft 2380ms).
Fit gate exercised: 1B fits at 8000 (2.4 tps) where 3B and Phi-3.5-mini
both fail at 8000; 3B@8000 and Phi@2000 completed but missed the 60s
TTFT gate (gated out). Provider stopped between every trial and at end.

Flags: --models --contexts --db-path --reports-dir --max-tokens
--ready-timeout --gate-ttft-ms --dry-run --report-only. Machine-agnostic.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ch axes

PR #105 exposed --kv-bits / --max-context / --max-batch as macprovider-cli
serve flags. Wire them into autotune.py as three OPTIONAL search axes
(--kv-bits-options / --max-context-options / --max-batch-options). Each
defaults to [None], so omitting all three preserves the original
model x context candidate space exactly (the original 6-trial 8GB Air run
still produces an identical 6-trial plan). When set, candidates are the
full cartesian, with the chosen knobs passed through to start_provider
and recorded in three additively-migrated tune_trials columns (kv_bits,
max_context_cap, max_batch). Legacy rows keep NULL in the new columns;
existing reports remain readable.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@Augustas11 Augustas11 force-pushed the spike/provider-model-autotune branch from 0014d91 to c13a91b Compare June 18, 2026 06:56
Augustas11 and others added 2 commits June 18, 2026 10:27
When a sweep cell failed, sweep_runs recorded n_err > 0 but notes was
NULL — the per-request HTTP status and error string from harness.py
were dropped on the floor. Caused a ctx=2000 production-gateway
misdiagnosis: a transient 503 provider_unavailable was misread as a
gateway streaming read-idle bug, with no way to confirm without
re-running.

aggregate_cell now collects up to 3 distinct (status, error[:80])
pairs from the per-request results, joins them into a ~200-char
summary, and exposes it via the existing notes column. notes stays
NULL on cells where every request succeeded.

Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
Two methodology fixes caught by the air5 24-trial hill-climb:

1. --replicates N (default 1, preserves single-shot behavior). When >1,
   fires N requests against ONE loaded provider per cell and aggregates
   by MEDIAN tps/ttft. The cell is feasible only if EVERY replicate is
   feasible — strict, befitting a 'recipe' meant to be applied as a
   recommendation. Provider is loaded once per cell and reused, so the
   extra cost is N-1 inferences (no extra model load). Recommended
   value: 3 when publishing a recipe (single-trial measurements drift
   10-15% from background CPU/GPU contention).

2. TTFT tiebreak in the keep-best decision (TPS_TIE_EPSILON = 2%). The
   old logic 'tps > best_tps' kept the FIRST trial in a tie band, even
   if a later trial had the same tps and a meaningfully better TTFT.
   Air5 hit this: 1B kv=8 mb=1 (10.9tps, 3.8s ttft) was kept over
   1B kv=8 mb=2 (10.9tps, 3.0s ttft). New _is_new_best() helper:
   strictly higher tps wins; within tie band, lower TTFT wins.
   Replaying air5's 24 trials through the new logic now picks the mb=2
   config (21% faster first-token).

Schema: additive replicates_n INTEGER column via the existing migration
mechanism. Existing rows keep NULL.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@Augustas11

Copy link
Copy Markdown
Owner Author

Closing without merge. This spike's objective (max-tps cartesian search over model × ctx × kv-bits × max-batch) is superseded by the v1 SPEC for macprovider-cli autotune, which adopts a biggest-fit-first objective (see .omc/prompts/spec-cli-autotune-v1.md in this repo; SPEC PR forthcoming). Two-stage search: model selection (largest-first iteration with feasibility gate) then knob hill-climb within the chosen model.

The spike branch stays accessible at spike/provider-model-autotune for code reference. The SPEC's implementing PR will reuse the durable bits:

  • Provider lifecycle (start_provider / stop_provider / wait_for_ready)
  • HF offline-mode handling (macprovider-cli doesn't auto-download — quirk discovered during runs)
  • tune_trials SQLite schema + additive migration pattern
  • _is_new_best() helper (still valid for stage-2 knob tiebreak under v1)
  • --replicates N median aggregation
  • HTML report rendering

Empirical data this spike produced (preserved in the SPEC's "Empirical findings" section):

  • 8GB MacBook Air 6-trial hill-climb (winner under old objective: 1B@2000)
  • air5 24-trial hill-climb at N=1 (originally claimed kv-bits=8 was universally better)
  • air5 24-trial hill-climb at N=3 (proved the kv-bits=8 finding was measurement noise; fit-gate determinations replicated 100%)

Key methodology lesson the spike surfaced: throughput measurements on mlx-swift have ≥20% trial-to-trial variance, dominating small knob-level deltas. Fit-gate determinations are stable. The v1 SPEC bakes this in (TPS_TIE_EPSILON raised to 10%, recommended publish-replicates N=5, no kv-bits prior).

@Augustas11 Augustas11 closed this Jun 18, 2026
Augustas11 added a commit that referenced this pull request Jun 18, 2026
* spec(cli): SPEC-013 v0.1 — macprovider-cli autotune subcommand

Initial draft of the autotune subcommand spec + the round-1 codex
audit prompt. NOT for merge — this commit lives on the feature
branch only and the PR is held until the codex audit loop converges.

SPEC-013 wraps the PR #105 serve flags (--kv-bits, --max-context,
--max-batch) in a two-stage pipeline that encodes the "biggest-fit,
not max-tps" product strategy. Stage 1 iterates a curated
largest-first candidate list and STOPS on the first model that
passes the feasibility gate; Stage 2 hill-climbs knobs WITHIN the
chosen model. This is the load-bearing departure from the PR #103
Python prototype (whose cartesian max-tps loop would push every
capable Mac to serve the smallest model).

Four numerical defaults (TPS_TIE_EPSILON, stage1_replicates,
stage2_replicates, kv-bits axis-vs-default) are flagged as Open
Questions pending the in-flight air5 n=3 replication run; v0.2
either confirms placeholders or sends a narrow PR adjusting them.

Files:
- specs/SPEC-013-cli-autotune.md (new, v0.1 draft)
- specs/AUDIT_SPEC_013_PROMPT.md (new, round-1 codex audit prompt)
- specs/README.md (+1 row in the index table)

Next step: fire AUDIT_SPEC_013_PROMPT.md at codex, address findings
in v0.2, re-audit, loop until 0 CRITICAL / 0 MAJOR, then push + PR.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* spec(cli): SPEC-013 v0.2 — round-1 codex audit response

Round-1 codex audit (specs/SPEC-013-audit.md) returned 0 CRITICAL
/ 7 MAJOR / 11 MINOR / 2 QUESTION on v0.1, with verdict "not ready
to lock as drafted." v0.2 closes all 7 MAJORs, 10 of 11 MINORs,
and both QUESTIONs. The product framing (biggest-fit, not max-tps)
and two-stage architecture are unchanged — round 1 explicitly
preserved both.

MAJORs closed:
- A.1 fallback contradiction: replaced metrics-bearing `fallbacks`
  with NAME-ONLY `alternates` (the STOP-on-first-feasible rule
  meant smaller candidates were never probed; v0.1's fallback
  metrics were structurally impossible).
- D.1 `models pull` precondition was bigger than admitted: FR-D
  reframed as "weights cache-warm before probe; load-fetch
  latency excluded from gate-ttft-ms" with Shape A (explicit
  pull) vs Shape B (rely on runtime online-fallback + isolate
  measurement) implementation choice. No longer depends on a
  not-yet-existing subcommand.
- E.1 launchd label wrong: `com.macprovider.cli` →
  `live.streamvc.macprovider` (matching SPEC-003 v0.9.2 §FR-C5,
  install.sh, plist template). Drain sequence bound to
  `launchctl bootout/bootstrap gui/$UID/...`.
- F.1 `--apply` wrote wrong YAML keys: `max_context_tokens` /
  `max_batch` were the CLI flag names; actual YAML keys per
  Config.swift:239-241 are `max_context_override` /
  `max_concurrency_override`. JSON `knobs` object now uses YAML
  key names for round-trip into config.yaml; `serve_command`
  retains CLI flag names for shell paste.
- F.2 recipe_hash not deterministic: pinned to
  `sha256:<64-lowercase-hex>` + RFC 8785 JCS canonicalization +
  explicit hash input domain enumeration (machine + inputs +
  recommendation.model + recommendation.knobs; excludes run_id,
  timestamps, observed metrics).
- G.1 SQLite migration invalid: `ALTER TABLE tune_trials ADD
  COLUMN stage INTEGER NOT NULL DEFAULT 1` spelled out; new
  inserts MUST set stage=1 or stage=2 explicitly.
- J.1 no AC for operator-supplied order: added AC-17 with
  `--candidate-models 1B,32B` on a Mac where both fit — must
  pick 1B because operator order is the contract.

MINORs closed (10): B.1 (max-context-axis semantics), C.1 (CLI
summary kv-bits default), F.3 (backup naming collision-safe),
G.2 (transactional retention), H.1 (--resume removed from §7),
J.2 (AC-18 new), J.3 (AC-19 new + exit_reason enum),
K.1 (OQ-B/OQ-D quantitative thresholds), L.1 (prototype migration
note), M.1 (cross-spec renumber to SPEC-014).

QUESTIONs resolved (2): D.2 (signature vs network failure now
asymmetric — integrity aborts whole run, transient advances),
K.2 (added OQ-E flagging thermal/order bias with quantitative
threshold).

Deferred to post-lock: M.2 documentation checklist (decision-log
entry, SPEC-003 install note, PR #103 disposition) — captured as
a §11 checklist but not in the binding contract.

Files:
- specs/SPEC-013-cli-autotune.md (v0.1 → v0.2; +566 lines)
- specs/SPEC-013-audit.md (NEW, codex round-1 output)
- specs/AUDIT_SPEC_013_V0_2_PROMPT.md (NEW, round-2 audit prompt)

Next step: fire AUDIT_SPEC_013_V0_2_PROMPT.md at codex for the
round-2 closure check, address any new findings, repeat until
LOCK READY, then push + PR.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* spec(cli): SPEC-013 v0.3 — round-2 audit response (LOCK candidate)

Round-2 codex audit (specs/SPEC-013-audit.md § Round 2) returned
LOCK READY with 17 CLOSED / 1 PARTIAL / 1 OVER-CLOSED on the
round-1 findings, plus 1 MAJOR new + 3 MINOR new. Codex
recommended a narrow v0.3 closing the 4 new findings before
implementation. v0.3 closes all 4. No architecture change.

Round-2 closures:
- N-D.1 (MAJOR) Shape B vs models-pull-only wording: v0.2's FR-D
  rewrite permitted Shape B (rely on runtime online-fallback +
  measurement isolation) but NFR-4's egress exception and AC-8
  still spoke only of `models pull`. v0.3 reworords NFR-4 to
  admit both Shape A and Shape B HuggingFace pre-warm paths;
  AC-8 is now shape-neutral with explicit Shape A (mocked pull
  exit non-zero) and Shape B (block egress + runtime fallback
  fails during load) variants. A new sub-variant explicitly
  tests the FR-D.2 integrity-class abort path.
- Z-B.1 (PARTIAL → CLOSED) `--max-context-axis` parse rules: v0.2
  put the parse rules in non-normative §7. v0.3 lifts them into
  FR-B.1 as a normative paragraph (absolute caps, sorted
  ascending after parse, each cell >= --target-context,
  flag-parse-time rejection with exit_reason='config_error',
  duplicate rejection, empty-axis = single-cell). The §7 /
  §5 conflict-resolution rule is now stated explicitly.
- N-OQ-E.1 (MINOR) thermal/order threshold lacked sampling
  protocol: v0.3 adds a 10-paired-runs forward/reverse protocol
  with 60s inter-pair idle, mismatch_pairs/10 > 0.05 trigger
  threshold. Operators can close OQ-E without relitigating
  methodology.
- O.1 (MINOR) residual v0.1-era wording drift: v0.3 closes four
  discrete sites — `tune_runs.spec_version` SQL comment,
  FR-H.2 "v0.1 normative contract" prose, NFR-3 stale
  `.bak-<unix-ts>` pattern, and §7's "MAY change in v0.2"
  disclaimer.

Files:
- specs/SPEC-013-cli-autotune.md (v0.2 → v0.3 LOCK candidate)
- specs/SPEC-013-audit.md (codex round-2 output landed)
- specs/AUDIT_SPEC_013_V0_3_PROMPT.md (NEW, narrow round-3
  closure-confirmation audit prompt)

Next step: fire AUDIT_SPEC_013_V0_3_PROMPT.md at codex for the
round-3 LOCK-confirmation check. Expected outcome: LOCK with
0 new findings or ≤1 MINOR. If LOCK, push the branch and open
the DRAFT PR.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* spec(cli): SPEC-013 v0.3 LOCK — round-3 codex confirmation + O-V03.1 fold-in

Round-3 codex audit (specs/SPEC-013-audit.md § Round 3) returned
LOCK with 4 CLOSED / 0 PARTIAL across the round-2 findings, plus
0 CRITICAL anti-regression / 0 MAJOR new / 1 MINOR new.

The single round-3 MINOR (O-V03.1) is editorial — FR-F.2's JSON
example still showed "SPEC-013 v0.2" inside a v0.3 document.
Codex explicitly said this does not block LOCK (the adjacent SQL
comment already stated writers emit their own producing version),
but recommended folding the fix in before implementation. This
commit folds it: the JSON example now uses
"SPEC-013 v<producing-version>" as a placeholder, and the
spec_version bullet teaches the rule.

Round-3 closures (from specs/SPEC-013-audit.md § Round 3):
- N-D.1 CLOSED: NFR-4 admits both Shape A (`models pull` or
  equivalent) and Shape B (runtime online fallback during model
  load) HuggingFace pre-warm paths; carve-out scoped to autotune
  runs and weight fetches; AC-8 shape-neutral with explicit
  Shape A + Shape B + integrity-class variants.
- Z-B.1 CLOSED: `--max-context-axis` parse contract lifted from
  non-normative §7 into binding FR-B.1 (absolute caps, sorted
  ascending, ≥ target-context, flag-parse-time rejection with
  exit_reason='config_error', duplicate rejection, empty-axis
  = single-cell); §7 vs §5 conflict-resolution rule explicit.
- N-OQ-E.1 CLOSED: OQ-E thermal/order threshold has a measurable
  10-paired-runs forward/reverse sampling protocol on air5 with
  60s inter-pair idle and the mismatch_pairs/10 > 0.05 trigger.
- O.1 CLOSED: all 4 named drift sites updated (tune_runs
  SQL comment, FR-H.2 prose, NFR-3 backup pattern, §7
  disclaimer).

Specs index updated: specs/README.md row for SPEC-013 now reads
v0.3.

Files:
- specs/SPEC-013-cli-autotune.md (O-V03.1 editorial fold-in)
- specs/SPEC-013-audit.md (codex round-3 LOCK verdict landed)
- specs/README.md (SPEC-013 row → v0.3)

Audit cycle complete after 3 codex rounds: v0.1 → v0.2 (7 MAJOR
+ 10 MINOR + 2 QUESTION closed) → v0.3 (1 MAJOR + 3 MINOR closed
from round 2) → LOCK. Next step: push branch + open DRAFT PR.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* spec(cli): SPEC-013 BUILD prompt — Option A Swift-native impl plan

Adds the operator-paste BUILD prompt that a fresh Codex CLI
session uses to implement SPEC-013 v0.3 against the existing
phase3-binary/ Swift package.

The BUILD prompt picks Option A (Swift-native subcommand inside
macprovider-cli) per SPEC-013 §10, rationale captured inline:
single-binary install consistency with SPEC-003, drain semantics
match existing patterns (UninstallCommand / SelfUpdate), future
SPEC-011 warm-swap integration needs Swift-native. Shape A vs
Shape B for FR-D pre-warm is left as the implementer's call —
the binding contract is FR-D.1's measurement-isolation
requirement, not the mechanism.

The 11-step build sequence:
1. AutotuneCommand subcommand scaffolding + --dry-run
2. --no-join flag on ServeCommand (FR-E.2 precondition)
3. SQLite schema + DB layer (tune_trials + tune_runs +
   migration + transactional retention)
4. Provider lifecycle (start/stop/wait-ready, single-provider
   invariant)
5. Provider-conflict pre-flight (FR-E.1, launchd bootout/bootstrap)
6. Pre-warm (FR-D Shape A or Shape B, integrity vs transient
   classification per FR-D.2)
7. Stage 1 — feasibility iteration (FR-A, STOP-on-first-feasible,
   AC-17 biggest-fit guard)
8. Stage 2 — knob hill-climb (FR-B, _is_new_best verbatim from
   prototype)
9. Recommendation surface (FR-F: terminal block + JSON schema +
   RFC 8785 JCS recipe_hash + --apply atomic write)
10. Failure modes + signal handling (FR-H, exit_reason enum)
11. Acceptance test suite (AC-1 through AC-19)

Branch strategy: build work happens on feat/cli-autotune-impl
stacked off spec/cli-autotune-v1, so the implementing PR rebases
cleanly onto main after the SPEC PR (#108) merges.

Hard rules: do not modify any file under specs/ from the build
branch (it's downstream of the SPEC PR); do not touch
beta/coordinator/gateway; do not pivot to Option B without an
Open Question; preserve the biggest-fit objective (AC-17 catches
this if forgotten).

The prompt is self-contained: severity definitions, required
reading, step sequence, acceptance gate, hard rules, anti-rules,
operator checkpoint cadence, and an Open Question template.

Files:
- specs/BUILD_SPEC_013_PROMPT.md (NEW)

Next step: branch feat/cli-autotune-impl off spec/cli-autotune-v1
and fire BUILD_SPEC_013_PROMPT.md at codex via omc ask. Expected
wall-clock: 1-2 weeks of session work for one new subcommand
inside an existing binary.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
@Augustas11

Copy link
Copy Markdown
Owner Author

Closing this spike — the v1 SPEC-013 surface has landed via:

The prototype's _is_new_best semantics (TPS-primary, TTFT-tiebreak-within-tie-band) were ported verbatim into Stage2HillClimb.swift, with Step 8's audit verifying the port branch-by-branch against the Python reference.

v0.4 work items (held for reference)

If/when a v0.4 SPEC bump happens, these items are the natural next steps. Re-reference this spike if useful:

  1. AC-17 size-parsed alternates — v1's RecommendationEmitter.alternates(...) uses position-based slicing, which mis-surfaces alternates for arbitrary operator orders like --candidate-models 1b,32b. Fix path: plumb size-parsed ordering through a new candidatesBySize: [String]? field on RecommendationInputs and extend AutotuneCommand.parseSizeB to scan arbitrary HF IDs for \d+B substrings. See implementation-notes.html Step 11 audit-response entry for the full deviation analysis.
  2. AC-6 real-subprocess detection — v1 has placeholder XCTSkip tests for the real launchd/foreground detection paths. The unit-level detection logic is already covered by ProviderConflictDetectorTests; the real-spawn harness is v2 scope.
  3. AC-7 real-subprocess + coordinator-pool observation — v1 unit-tests argv construction; full subprocess + coordinator pool observation is v2.
  4. AC-8 Shape A pull-subcommand variant — Step 6 selected Shape B (runtime online-fallback pre-warm classification); the Shape A subcommand-based variant is intentionally out of v1 scope.

The audit-loop discipline that landed v1 is memorialized at beta/DECISION_CRITERIA.md Entry 78 for future SPEC-013 v0.4 work to follow.

Closing as superseded.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant