Skip to content

feat(report): show failed sweep cells with provenance#11

Open
dutifulbob wants to merge 7 commits into
mainfrom
feat/report-failed-sweep-cells
Open

feat(report): show failed sweep cells with provenance#11
dutifulbob wants to merge 7 commits into
mainfrom
feat/report-failed-sweep-cells

Conversation

@dutifulbob

Copy link
Copy Markdown
Member

Opened on behalf of Onur Solmaz (osolmaz). Work is ready for review.

Summary

Benchmark sweeps can have valid partial failures, especially when high-concurrency or high-context points hit memory limits.
This change keeps mixed-success sweeps reportable instead of treating one failed point as a failed overall run.
It also makes failed and skipped cells visible in the main HTML grid, and lets a reviewer click a cell to see the exact model/profile/workload settings and commands behind that result.

What Changed

The runner now distinguishes partial workload failures from all-failed runs.
A sweep with at least one completed measurement can finish as completed while still preserving failed/skipped measurements in the artifact.

  • Changed final run error handling so partial sweep failures do not force a nonzero run error.
  • Changed stored run status so mixed completed/failed sweeps are marked completed, while all-failed sweeps remain failed.
  • Added failed/skipped labels such as mem floor, not ready, and server exit to headline report cells.
  • Added native HTML detail disclosures for report cells with run, model, profile, workload, shape, profile config, vLLM serve command, benchmark command, engine args, serve JSON, and env JSON.
  • Removed the command loading cap so larger sweeps keep command provenance available for every measurement.

Testing

I tested the changed runner behavior, the HTML rendering path, and the required local quality gates.
The dry benchmark path was also checked without starting a real model server.

  • go test ./internal/report ./internal/vllmbench
  • go test ./...
  • go vet ./...
  • npx -y @simpledoc/simpledoc check
  • go run github.com/dutifuldev/slophammer/go/cmd/slophammer-go@v0.4.0 check .
  • go run ./cmd/localperf bench run --dry-run --spec examples/diffusiongemma-vllm-standard/spec.json --profile 4k-reference --workload claim-repro-1k-out1024 --concurrency 1 --run-dir /tmp/localperf-onecase-dry
  • go run ./cmd/localperf artifact check /tmp/localperf-onecase-dry.sqlite
  • go run ./cmd/localperf artifact render /tmp/localperf-onecase-dry.sqlite --output /tmp/localperf-onecase-dry.html

Risks

The main behavior change is intentional: a run with partial failed points no longer exits as failed if at least one point completed.
All-failed runs and infrastructure errors still fail, so completely broken sweeps should not look successful.

The HTML report now includes more command/config provenance in hidden cell details.
That can make very large reports somewhat larger, but it is needed for clickable per-cell auditability.

@dutifulbob

Copy link
Copy Markdown
Member Author

Final report:

  • Implemented failed/skipped benchmark cells in the HTML report with native click-through details for status, reason, profile config, vLLM serve command, benchmark command, engine args, serve JSON, and env JSON.
  • Partial benchmark-point failures stay reportable; all-failed sweeps and fatal run errors still fail the run. Fatal partial runs persist as failed.
  • Aggregate repeat/model-level rows now say they are aggregates and no longer claim first-repeat run or command provenance.
  • Dry-run planned rows stay neutral instead of rendering as failure labels.
  • Secret-bearing command argv values are redacted before rendering in standalone HTML reports.
  • Review loop: latest codex review --base main completed with no findings.
  • Validation passed: go test ./...; go vet ./...; npx -y @simpledoc/simpledoc check; go run github.com/dutifuldev/slophammer/go/cmd/slophammer-go@v0.4.0 check .; one-case dry localperf bench run; localperf artifact check; localperf artifact render; dry-render grep for planned failure labels.
  • CI: green (go checks passed).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants