feat(report): show failed sweep cells with provenance#11
Open
dutifulbob wants to merge 7 commits into
Open
Conversation
Member
Author
|
Final report:
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Opened on behalf of Onur Solmaz (
osolmaz). Work is ready for review.Summary
Benchmark sweeps can have valid partial failures, especially when high-concurrency or high-context points hit memory limits.
This change keeps mixed-success sweeps reportable instead of treating one failed point as a failed overall run.
It also makes failed and skipped cells visible in the main HTML grid, and lets a reviewer click a cell to see the exact model/profile/workload settings and commands behind that result.
What Changed
The runner now distinguishes partial workload failures from all-failed runs.
A sweep with at least one completed measurement can finish as completed while still preserving failed/skipped measurements in the artifact.
completed, while all-failed sweeps remainfailed.mem floor,not ready, andserver exitto headline report cells.Testing
I tested the changed runner behavior, the HTML rendering path, and the required local quality gates.
The dry benchmark path was also checked without starting a real model server.
go test ./internal/report ./internal/vllmbenchgo test ./...go vet ./...npx -y @simpledoc/simpledoc checkgo run github.com/dutifuldev/slophammer/go/cmd/slophammer-go@v0.4.0 check .go run ./cmd/localperf bench run --dry-run --spec examples/diffusiongemma-vllm-standard/spec.json --profile 4k-reference --workload claim-repro-1k-out1024 --concurrency 1 --run-dir /tmp/localperf-onecase-drygo run ./cmd/localperf artifact check /tmp/localperf-onecase-dry.sqlitego run ./cmd/localperf artifact render /tmp/localperf-onecase-dry.sqlite --output /tmp/localperf-onecase-dry.htmlRisks
The main behavior change is intentional: a run with partial failed points no longer exits as failed if at least one point completed.
All-failed runs and infrastructure errors still fail, so completely broken sweeps should not look successful.
The HTML report now includes more command/config provenance in hidden cell details.
That can make very large reports somewhat larger, but it is needed for clickable per-cell auditability.