spike: provider-side model-selection autotune loop (autoresearch)#103
spike: provider-side model-selection autotune loop (autoresearch)#103Augustas11 wants to merge 7 commits into
Conversation
…knobs for autoresearch (#105) Widens the autoresearch search space for beta/autotune.py (PR #103) beyond just --model. All three knobs are real downstream wiring into mlx-swift 2.29.1, not just CLI cosmetics. - --kv-bits {4,8}: forwarded to MLXLMCommon.GenerateParameters.kvBits (both complete + stream call sites); preflight rejects anything but 4/8. - --max-context <N>: extends the existing per-tier maxContextTokens cap; tokens are still rejected at the existing context_length_exceeded 413 boundary, and we additionally pass maxKVSize=maxContextTokens to GenerateParameters so the KV cache (RotatingKVCache) honors the cap. - --max-batch <N> (default 1, prior single-slot behavior preserved): lifts the previously-hardcoded AsyncSemaphore(value: 1) inside ModelRuntime to be configurable. Reuses the existing maxConcurrencyOverride config field that was already plumbed from YAML/env but never wired to the CLI or runtime. All knobs are triple-exposed (CLI > env > YAML > default), matching the house convention. Preflight (runServingKnobsPreflight) fails loud at serve start instead of mid-inference on invalid values. A bug fix is folded in: ServeCommand.run() was hardcoding maxConcurrencyOverride: 1 when building ProviderCapacity, silently ignoring the resolved config. The capacity now reflects --max-batch. Tests: ServingKnobsConfigTests.swift adds 21 cases covering config-resolution precedence (CLI > env > YAML), defaults preserved, preflight rejection of invalid values, runtime threading of all three knobs, and a regression on the context_length_exceeded gate. Total suite: 219 -> 240, all passing. Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Adds three files for the Phase 2 buyer harness context/concurrency sweep: - beta/sweep.py: CLI grid sweep over (context_target, concurrency) cells. Imports fire_stream from harness.py (SSE parser reuse). Writes results to new sweep_runs SQLite table. Gate: feasible = n_err==0 AND ttft_p95<=gate AND no stop_token_leak. Flags: --dry-run, --base-url (required, no remote default), --contexts, --concurrency overrides, --decode-control second pass, --gate-ttft-ms. - beta/sweep_report.py: Reads sweep_runs for a sweep_id (or latest) and renders a self-contained HTML heatmap (green/red cells). Matches report.py single-file-HTML style. Drops into reports_dir. - beta/mock_llm_server.py: Local SSE stub on port 18080 serving /v1/chat/completions. No remote traffic. Supports --error-rate flag to exercise the red/fail gate path. Used only for smoke-tests. Smoke-tested: dry-run prints 28 cells; 4-cell real sweep (contexts 1000,2000 x conc 1,2) against local mock shows feasible=1 with populated tps/ttft; error-rate=1.0 run confirms feasible=0 red path; sweep_report.py renders correct green/red heatmap HTML. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Halts the sweep the moment a cell returns request errors (n_err > 0) — on a memory-constrained node that almost always means OOM, and the ctx-major grid would otherwise re-slam the box with every heavier cell. TTFT-gate-only failures (slow, no errors) do not stop the sweep. Fixes the end-of-run summary to report attempted (not total) cells when the sweep aborts early. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
- harness.fire_stream/fire_nonstream: optional headers= (backward-compatible) - sweep.py: --api-key/--api-key-file (Bearer for the gateway leg; reads ~/.config/macprovider/buyer-api-key by default), --model override (also pins the provider via model-routing on the gateway path), --max-tokens override for fast runs on slow/constrained nodes. Local/direct runs send no auth. Verified header threading with a capture server. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Add beta/autotune.py: a real provider-side optimization loop that discovers the best servable model for a given Mac. Not a static benchmark — it proposes each (model, context) config, measures agg throughput_tps + ttft via a fixed workload, applies a feasibility gate (request error / non-200 / OOM / TTFT-gate => fits=0), keeps the config only if it beats the current best, logs one row per trial, and tracks best-so-far. One provider served at a time (start -> wait-ready -> fire -> pkill), never two at once. Reuses harness.fire_stream (SSE metrics) and sweep.build_padded_prompt + sweep.aggregate_cell unchanged. New additive tune_trials SQLite table; existing runs/adversarial_runs/sweep_runs untouched. Self-contained HTML report mirrors sweep_report.py with the winner highlighted and the best-so-far progression. First hill-climb on the 8GB M1 Air (3 models x contexts 2000,8000): WINNER = Llama-3.2-1B-Instruct-4bit @ 2000 (9.7 tok/s, ttft 2380ms). Fit gate exercised: 1B fits at 8000 (2.4 tps) where 3B and Phi-3.5-mini both fail at 8000; 3B@8000 and Phi@2000 completed but missed the 60s TTFT gate (gated out). Provider stopped between every trial and at end. Flags: --models --contexts --db-path --reports-dir --max-tokens --ready-timeout --gate-ttft-ms --dry-run --report-only. Machine-agnostic. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ch axes PR #105 exposed --kv-bits / --max-context / --max-batch as macprovider-cli serve flags. Wire them into autotune.py as three OPTIONAL search axes (--kv-bits-options / --max-context-options / --max-batch-options). Each defaults to [None], so omitting all three preserves the original model x context candidate space exactly (the original 6-trial 8GB Air run still produces an identical 6-trial plan). When set, candidates are the full cartesian, with the chosen knobs passed through to start_provider and recorded in three additively-migrated tune_trials columns (kv_bits, max_context_cap, max_batch). Legacy rows keep NULL in the new columns; existing reports remain readable. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
0014d91 to
c13a91b
Compare
When a sweep cell failed, sweep_runs recorded n_err > 0 but notes was NULL — the per-request HTTP status and error string from harness.py were dropped on the floor. Caused a ctx=2000 production-gateway misdiagnosis: a transient 503 provider_unavailable was misread as a gateway streaming read-idle bug, with no way to confirm without re-running. aggregate_cell now collects up to 3 distinct (status, error[:80]) pairs from the per-request results, joins them into a ~200-char summary, and exposes it via the existing notes column. notes stays NULL on cells where every request succeeded. Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
Two methodology fixes caught by the air5 24-trial hill-climb: 1. --replicates N (default 1, preserves single-shot behavior). When >1, fires N requests against ONE loaded provider per cell and aggregates by MEDIAN tps/ttft. The cell is feasible only if EVERY replicate is feasible — strict, befitting a 'recipe' meant to be applied as a recommendation. Provider is loaded once per cell and reused, so the extra cost is N-1 inferences (no extra model load). Recommended value: 3 when publishing a recipe (single-trial measurements drift 10-15% from background CPU/GPU contention). 2. TTFT tiebreak in the keep-best decision (TPS_TIE_EPSILON = 2%). The old logic 'tps > best_tps' kept the FIRST trial in a tie band, even if a later trial had the same tps and a meaningfully better TTFT. Air5 hit this: 1B kv=8 mb=1 (10.9tps, 3.8s ttft) was kept over 1B kv=8 mb=2 (10.9tps, 3.0s ttft). New _is_new_best() helper: strictly higher tps wins; within tie band, lower TTFT wins. Replaying air5's 24 trials through the new logic now picks the mb=2 config (21% faster first-token). Schema: additive replicates_n INTEGER column via the existing migration mechanism. Existing rows keep NULL. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
|
Closing without merge. This spike's objective (max-tps cartesian search over model × ctx × kv-bits × max-batch) is superseded by the v1 SPEC for The spike branch stays accessible at
Empirical data this spike produced (preserved in the SPEC's "Empirical findings" section):
Key methodology lesson the spike surfaced: throughput measurements on mlx-swift have ≥20% trial-to-trial variance, dominating small knob-level deltas. Fit-gate determinations are stable. The v1 SPEC bakes this in (TPS_TIE_EPSILON raised to 10%, recommended publish-replicates N=5, no kv-bits prior). |
* spec(cli): SPEC-013 v0.1 — macprovider-cli autotune subcommand Initial draft of the autotune subcommand spec + the round-1 codex audit prompt. NOT for merge — this commit lives on the feature branch only and the PR is held until the codex audit loop converges. SPEC-013 wraps the PR #105 serve flags (--kv-bits, --max-context, --max-batch) in a two-stage pipeline that encodes the "biggest-fit, not max-tps" product strategy. Stage 1 iterates a curated largest-first candidate list and STOPS on the first model that passes the feasibility gate; Stage 2 hill-climbs knobs WITHIN the chosen model. This is the load-bearing departure from the PR #103 Python prototype (whose cartesian max-tps loop would push every capable Mac to serve the smallest model). Four numerical defaults (TPS_TIE_EPSILON, stage1_replicates, stage2_replicates, kv-bits axis-vs-default) are flagged as Open Questions pending the in-flight air5 n=3 replication run; v0.2 either confirms placeholders or sends a narrow PR adjusting them. Files: - specs/SPEC-013-cli-autotune.md (new, v0.1 draft) - specs/AUDIT_SPEC_013_PROMPT.md (new, round-1 codex audit prompt) - specs/README.md (+1 row in the index table) Next step: fire AUDIT_SPEC_013_PROMPT.md at codex, address findings in v0.2, re-audit, loop until 0 CRITICAL / 0 MAJOR, then push + PR. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * spec(cli): SPEC-013 v0.2 — round-1 codex audit response Round-1 codex audit (specs/SPEC-013-audit.md) returned 0 CRITICAL / 7 MAJOR / 11 MINOR / 2 QUESTION on v0.1, with verdict "not ready to lock as drafted." v0.2 closes all 7 MAJORs, 10 of 11 MINORs, and both QUESTIONs. The product framing (biggest-fit, not max-tps) and two-stage architecture are unchanged — round 1 explicitly preserved both. MAJORs closed: - A.1 fallback contradiction: replaced metrics-bearing `fallbacks` with NAME-ONLY `alternates` (the STOP-on-first-feasible rule meant smaller candidates were never probed; v0.1's fallback metrics were structurally impossible). - D.1 `models pull` precondition was bigger than admitted: FR-D reframed as "weights cache-warm before probe; load-fetch latency excluded from gate-ttft-ms" with Shape A (explicit pull) vs Shape B (rely on runtime online-fallback + isolate measurement) implementation choice. No longer depends on a not-yet-existing subcommand. - E.1 launchd label wrong: `com.macprovider.cli` → `live.streamvc.macprovider` (matching SPEC-003 v0.9.2 §FR-C5, install.sh, plist template). Drain sequence bound to `launchctl bootout/bootstrap gui/$UID/...`. - F.1 `--apply` wrote wrong YAML keys: `max_context_tokens` / `max_batch` were the CLI flag names; actual YAML keys per Config.swift:239-241 are `max_context_override` / `max_concurrency_override`. JSON `knobs` object now uses YAML key names for round-trip into config.yaml; `serve_command` retains CLI flag names for shell paste. - F.2 recipe_hash not deterministic: pinned to `sha256:<64-lowercase-hex>` + RFC 8785 JCS canonicalization + explicit hash input domain enumeration (machine + inputs + recommendation.model + recommendation.knobs; excludes run_id, timestamps, observed metrics). - G.1 SQLite migration invalid: `ALTER TABLE tune_trials ADD COLUMN stage INTEGER NOT NULL DEFAULT 1` spelled out; new inserts MUST set stage=1 or stage=2 explicitly. - J.1 no AC for operator-supplied order: added AC-17 with `--candidate-models 1B,32B` on a Mac where both fit — must pick 1B because operator order is the contract. MINORs closed (10): B.1 (max-context-axis semantics), C.1 (CLI summary kv-bits default), F.3 (backup naming collision-safe), G.2 (transactional retention), H.1 (--resume removed from §7), J.2 (AC-18 new), J.3 (AC-19 new + exit_reason enum), K.1 (OQ-B/OQ-D quantitative thresholds), L.1 (prototype migration note), M.1 (cross-spec renumber to SPEC-014). QUESTIONs resolved (2): D.2 (signature vs network failure now asymmetric — integrity aborts whole run, transient advances), K.2 (added OQ-E flagging thermal/order bias with quantitative threshold). Deferred to post-lock: M.2 documentation checklist (decision-log entry, SPEC-003 install note, PR #103 disposition) — captured as a §11 checklist but not in the binding contract. Files: - specs/SPEC-013-cli-autotune.md (v0.1 → v0.2; +566 lines) - specs/SPEC-013-audit.md (NEW, codex round-1 output) - specs/AUDIT_SPEC_013_V0_2_PROMPT.md (NEW, round-2 audit prompt) Next step: fire AUDIT_SPEC_013_V0_2_PROMPT.md at codex for the round-2 closure check, address any new findings, repeat until LOCK READY, then push + PR. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * spec(cli): SPEC-013 v0.3 — round-2 audit response (LOCK candidate) Round-2 codex audit (specs/SPEC-013-audit.md § Round 2) returned LOCK READY with 17 CLOSED / 1 PARTIAL / 1 OVER-CLOSED on the round-1 findings, plus 1 MAJOR new + 3 MINOR new. Codex recommended a narrow v0.3 closing the 4 new findings before implementation. v0.3 closes all 4. No architecture change. Round-2 closures: - N-D.1 (MAJOR) Shape B vs models-pull-only wording: v0.2's FR-D rewrite permitted Shape B (rely on runtime online-fallback + measurement isolation) but NFR-4's egress exception and AC-8 still spoke only of `models pull`. v0.3 reworords NFR-4 to admit both Shape A and Shape B HuggingFace pre-warm paths; AC-8 is now shape-neutral with explicit Shape A (mocked pull exit non-zero) and Shape B (block egress + runtime fallback fails during load) variants. A new sub-variant explicitly tests the FR-D.2 integrity-class abort path. - Z-B.1 (PARTIAL → CLOSED) `--max-context-axis` parse rules: v0.2 put the parse rules in non-normative §7. v0.3 lifts them into FR-B.1 as a normative paragraph (absolute caps, sorted ascending after parse, each cell >= --target-context, flag-parse-time rejection with exit_reason='config_error', duplicate rejection, empty-axis = single-cell). The §7 / §5 conflict-resolution rule is now stated explicitly. - N-OQ-E.1 (MINOR) thermal/order threshold lacked sampling protocol: v0.3 adds a 10-paired-runs forward/reverse protocol with 60s inter-pair idle, mismatch_pairs/10 > 0.05 trigger threshold. Operators can close OQ-E without relitigating methodology. - O.1 (MINOR) residual v0.1-era wording drift: v0.3 closes four discrete sites — `tune_runs.spec_version` SQL comment, FR-H.2 "v0.1 normative contract" prose, NFR-3 stale `.bak-<unix-ts>` pattern, and §7's "MAY change in v0.2" disclaimer. Files: - specs/SPEC-013-cli-autotune.md (v0.2 → v0.3 LOCK candidate) - specs/SPEC-013-audit.md (codex round-2 output landed) - specs/AUDIT_SPEC_013_V0_3_PROMPT.md (NEW, narrow round-3 closure-confirmation audit prompt) Next step: fire AUDIT_SPEC_013_V0_3_PROMPT.md at codex for the round-3 LOCK-confirmation check. Expected outcome: LOCK with 0 new findings or ≤1 MINOR. If LOCK, push the branch and open the DRAFT PR. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * spec(cli): SPEC-013 v0.3 LOCK — round-3 codex confirmation + O-V03.1 fold-in Round-3 codex audit (specs/SPEC-013-audit.md § Round 3) returned LOCK with 4 CLOSED / 0 PARTIAL across the round-2 findings, plus 0 CRITICAL anti-regression / 0 MAJOR new / 1 MINOR new. The single round-3 MINOR (O-V03.1) is editorial — FR-F.2's JSON example still showed "SPEC-013 v0.2" inside a v0.3 document. Codex explicitly said this does not block LOCK (the adjacent SQL comment already stated writers emit their own producing version), but recommended folding the fix in before implementation. This commit folds it: the JSON example now uses "SPEC-013 v<producing-version>" as a placeholder, and the spec_version bullet teaches the rule. Round-3 closures (from specs/SPEC-013-audit.md § Round 3): - N-D.1 CLOSED: NFR-4 admits both Shape A (`models pull` or equivalent) and Shape B (runtime online fallback during model load) HuggingFace pre-warm paths; carve-out scoped to autotune runs and weight fetches; AC-8 shape-neutral with explicit Shape A + Shape B + integrity-class variants. - Z-B.1 CLOSED: `--max-context-axis` parse contract lifted from non-normative §7 into binding FR-B.1 (absolute caps, sorted ascending, ≥ target-context, flag-parse-time rejection with exit_reason='config_error', duplicate rejection, empty-axis = single-cell); §7 vs §5 conflict-resolution rule explicit. - N-OQ-E.1 CLOSED: OQ-E thermal/order threshold has a measurable 10-paired-runs forward/reverse sampling protocol on air5 with 60s inter-pair idle and the mismatch_pairs/10 > 0.05 trigger. - O.1 CLOSED: all 4 named drift sites updated (tune_runs SQL comment, FR-H.2 prose, NFR-3 backup pattern, §7 disclaimer). Specs index updated: specs/README.md row for SPEC-013 now reads v0.3. Files: - specs/SPEC-013-cli-autotune.md (O-V03.1 editorial fold-in) - specs/SPEC-013-audit.md (codex round-3 LOCK verdict landed) - specs/README.md (SPEC-013 row → v0.3) Audit cycle complete after 3 codex rounds: v0.1 → v0.2 (7 MAJOR + 10 MINOR + 2 QUESTION closed) → v0.3 (1 MAJOR + 3 MINOR closed from round 2) → LOCK. Next step: push branch + open DRAFT PR. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * spec(cli): SPEC-013 BUILD prompt — Option A Swift-native impl plan Adds the operator-paste BUILD prompt that a fresh Codex CLI session uses to implement SPEC-013 v0.3 against the existing phase3-binary/ Swift package. The BUILD prompt picks Option A (Swift-native subcommand inside macprovider-cli) per SPEC-013 §10, rationale captured inline: single-binary install consistency with SPEC-003, drain semantics match existing patterns (UninstallCommand / SelfUpdate), future SPEC-011 warm-swap integration needs Swift-native. Shape A vs Shape B for FR-D pre-warm is left as the implementer's call — the binding contract is FR-D.1's measurement-isolation requirement, not the mechanism. The 11-step build sequence: 1. AutotuneCommand subcommand scaffolding + --dry-run 2. --no-join flag on ServeCommand (FR-E.2 precondition) 3. SQLite schema + DB layer (tune_trials + tune_runs + migration + transactional retention) 4. Provider lifecycle (start/stop/wait-ready, single-provider invariant) 5. Provider-conflict pre-flight (FR-E.1, launchd bootout/bootstrap) 6. Pre-warm (FR-D Shape A or Shape B, integrity vs transient classification per FR-D.2) 7. Stage 1 — feasibility iteration (FR-A, STOP-on-first-feasible, AC-17 biggest-fit guard) 8. Stage 2 — knob hill-climb (FR-B, _is_new_best verbatim from prototype) 9. Recommendation surface (FR-F: terminal block + JSON schema + RFC 8785 JCS recipe_hash + --apply atomic write) 10. Failure modes + signal handling (FR-H, exit_reason enum) 11. Acceptance test suite (AC-1 through AC-19) Branch strategy: build work happens on feat/cli-autotune-impl stacked off spec/cli-autotune-v1, so the implementing PR rebases cleanly onto main after the SPEC PR (#108) merges. Hard rules: do not modify any file under specs/ from the build branch (it's downstream of the SPEC PR); do not touch beta/coordinator/gateway; do not pivot to Option B without an Open Question; preserve the biggest-fit objective (AC-17 catches this if forgotten). The prompt is self-contained: severity definitions, required reading, step sequence, acceptance gate, hard rules, anti-rules, operator checkpoint cadence, and an Open Question template. Files: - specs/BUILD_SPEC_013_PROMPT.md (NEW) Next step: branch feat/cli-autotune-impl off spec/cli-autotune-v1 and fire BUILD_SPEC_013_PROMPT.md at codex via omc ask. Expected wall-clock: 1-2 weeks of session work for one new subcommand inside an existing binary. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
|
Closing this spike — the v1 SPEC-013 surface has landed via:
The prototype's v0.4 work items (held for reference)If/when a v0.4 SPEC bump happens, these items are the natural next steps. Re-reference this spike if useful:
The audit-loop discipline that landed v1 is memorialized at beta/DECISION_CRITERIA.md Entry 78 for future SPEC-013 v0.4 work to follow. Closing as superseded. |
What
First real autoresearch loop for MacProvider: a keep/revert hill-climb over the model dimension (size × quant) that discovers the optimal servable model for a given Mac.
Not a benchmark — a genuine optimization loop: propose config → load via
macprovider-cli serve --model X→ measure tok/s + TTFT at a target context → fit gate (errors / OOM / TTFT > 60s ⇒ infeasible) → keep best feasible → log every trial → declare winner.First hill-climb (8GB M1 Air)
Winner on 8GB:
Llama-3.2-1B-Instruct-4bit @ ctx=2000 → 9.7 tok/s. Fit gate rejected 3/6 candidates as expected.Files
beta/autotune.py— the loop CLI. Reusesharness.fire_stream+sweep.build_padded_prompt/aggregate_cellunchanged.tune_trialsSQLite table (existing tables untouched).beta/reports/autotune-<run_id>.html.Status
Spike / draft — designed to be machine-agnostic so the next run can target a roomier Mac (air5: Qwen-Coder-7B @ 50k context) where the real per-hardware recipe surfaces. Knob exposure (KV-bits / batch / max-context as
serveflags in the Swift binary) is the natural follow-up to widen the search space beyond model choice.🤖 Generated with Claude Code