Fix benchmark bugs across multiple benchmarks by simonrosenberg · Pull Request #475 · OpenHands/benchmarks

simonrosenberg · 2026-03-03T14:51:34Z

Summary

Extracts the benchmark-only bug fixes from the NeMo Evaluator PR (#455) into a standalone PR. These are independent improvements to existing benchmarks that don't depend on or relate to NeMo integration.

Changes

Multi-SWE-Bench (`multiswebench/`)

build_images.py: Use .get("version", "") instead of instance["version"] to avoid KeyError on instances missing the version field; call .lower() on Docker image names to avoid registry rejection of mixed-case tags; switch from get_dataset to download_and_concat_dataset to fix non-standard HuggingFace dataset structure; pass base_image_to_custom_tag_fn to build_all_images
eval_infer.py: Re-raise errors from evaluation subprocess instead of silently swallowing them; wrap args.input_file in Path() to fix AttributeError when it's a plain string

OpenAgentSafety (`openagentsafety/`)

build_images.py: Extract get_image_name() helper so the image name is configurable via EVAL_AGENT_SERVER_IMAGE and IMAGE_TAG_PREFIX env vars instead of being hardcoded; defer get_vendor_sdk_commit() until after the existing-image check to avoid unnecessary git calls
run_infer.py: Wrap pd.isna() calls in try/except to prevent crashes on non-scalar types (dicts, lists); add Pydantic model_dump() support in NumpyEncoder; simplify NPC config writing (single heredoc command instead of tempfile + upload); add try/except fallback to pre-built image when on-the-fly build fails; use get_image_name() instead of hardcoded image strings; add max_retries pass-through to evaluator; add generate_report() to produce .report.json output

SWE-Bench (`swebench/`)

eval_infer.py: Add default values from EVAL_DEFAULTS to run_swebench_evaluation() function parameters, enabling direct programmatic invocation without passing every argument; add explicit default for --timeout CLI flag

SWE-Bench Multimodal (`swebenchmultimodal/`)

eval_infer.py: Document the modal parameter in run_swebench_multimodal_evaluation() docstring

SWT-Bench (`swtbench/`)

run_infer.py: Lazy-import _base_slug from openhands.agent_server.docker.build to avoid git subprocess checks that fail when installed as a package outside a git repo

GAIA (`gaia/`)

run_infer.py: Pass max_retries through to the evaluator (was silently ignored before)

Shared utilities (`utils/`)

build_utils.py: Import TargetType from benchmarks.swebench.constants instead of openhands.agent_server.docker.build; lazy-import BuildOptions and build to avoid git subprocess failures when installed outside a git repo

Why these changes are necessary

These fixes address real bugs that cause crashes or incorrect behavior during benchmark evaluation:

Silent error swallowing in multi-swe-bench evaluation makes debugging impossible
KeyError crashes on missing version fields prevent building images for certain instances
Mixed-case Docker tags are rejected by container registries
pd.isna() on dicts/lists crashes the OpenAgentSafety JSON encoder
Hardcoded image names prevent using pre-built or custom images
Top-level git-dependent imports break when benchmarks are installed as a pip package
Missing max_retries pass-through means the retry config was silently ignored

Test plan

Verify pre-commit checks pass
Run SWE-bench eval with default args
Run Multi-SWE-bench image build
Run OpenAgentSafety inference

🤖 Generated with Claude Code

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

all-hands-bot

🔴 Critical regression found - This PR fixes real bugs but introduces a new crash. See inline comments for details.

benchmarks/swtbench/run_infer.py

benchmarks/openagentsafety/run_infer.py

benchmarks/multiswebench/build_images.py

benchmarks/swtbench/run_infer.py

benchmarks/gaia/run_infer.py

Co-authored-by: OpenHands Bot <contact@all-hands.dev>

- Replace try/except pd.isna() with explicit isinstance guard — dicts and lists are already handled by prior elif branches, so pd.isna() is only called on scalar-like values - Add comment documenting that the heredoc approach embeds config in the bash command, but this is acceptable since the config contains NPC scenario data, not secrets (API keys are resolved via env vars) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

The generate_report() function was treating all non-error instances as resolved, leaving unresolved_ids always empty. Now uses the same logic as eval_infer.py: an instance is resolved only when final_score.result > 0 and result == total. Added tests covering: mixed results, empty file, missing file, malformed JSON lines, and missing final_score. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

all-hands-bot

Taste Rating: 🟢 Good taste - Pragmatic fixes for real bugs with proper tests

The critical unresolved_ids bug from the previous review is fixed, and the new tests actually exercise real code paths (not just mocks). Re-raising errors instead of swallowing them is exactly right. The lazy imports are justified workarounds with clear comments explaining why they're needed.

Verdict: ✅ Ship it

Key Insight: These are textbook pragmatic bug fixes - simple solutions that solve real crashes without over-engineering.

The explicit default= and updated help/comment were unnecessary since parser.set_defaults(**EVAL_DEFAULTS) already covers timeout. Keep only the function signature defaults on run_swebench_evaluation(). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

The f-string and concatenation forms are functionally identical. Keep the original f-string. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Resolve conflicts after benchmark bug-fix PR #475 was merged to main: - Take main's malformed JSON error handling in multiswebench/build_images.py - Take main's cleaner pd.isna() guard (isinstance instead of try/except) - Take main's heredoc security tradeoff comment - Take main's fixed generate_report resolution logic (final_score check) - Take main's updated generate_report docstring Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Fix benchmark bugs: error handling, image naming, lazy imports, and more

a878d5c

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

simonrosenberg requested a review from all-hands-bot March 3, 2026 14:53

simonrosenberg self-assigned this Mar 3, 2026

all-hands-bot requested changes Mar 3, 2026

View reviewed changes

simonrosenberg and others added 3 commits March 3, 2026 11:57

Update benchmarks/swtbench/run_infer.py

d35321e

Co-authored-by: OpenHands Bot <contact@all-hands.dev>

Update benchmarks/multiswebench/build_images.py

e60e6f3

Co-authored-by: OpenHands Bot <contact@all-hands.dev>

simonrosenberg requested a review from all-hands-bot March 3, 2026 15:02

This comment was marked as outdated.

Sign in to view

simonrosenberg requested a review from all-hands-bot March 3, 2026 15:15

all-hands-bot approved these changes Mar 3, 2026

View reviewed changes

simonrosenberg and others added 3 commits March 3, 2026 12:33

Revert unnecessary eval_note style change in swtbench

b1cecbc

The f-string and concatenation forms are functionally identical. Keep the original f-string. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Fix pre-commit: rename ambiguous variable, fix ruff formatting

f578fd6

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

simonrosenberg merged commit f3e9a99 into main Mar 3, 2026
2 checks passed

simonrosenberg deleted the fix/benchmark-bug-fixes branch March 3, 2026 15:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix benchmark bugs across multiple benchmarks#475

Fix benchmark bugs across multiple benchmarks#475
simonrosenberg merged 8 commits intomainfrom
fix/benchmark-bug-fixes

simonrosenberg commented Mar 3, 2026

Uh oh!

all-hands-bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

This comment was marked as outdated.

Uh oh!

all-hands-bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

simonrosenberg commented Mar 3, 2026

Summary

Changes

Multi-SWE-Bench (multiswebench/)

OpenAgentSafety (openagentsafety/)

SWE-Bench (swebench/)

SWE-Bench Multimodal (swebenchmultimodal/)

SWT-Bench (swtbench/)

GAIA (gaia/)

Shared utilities (utils/)

Why these changes are necessary

Test plan

Uh oh!

all-hands-bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

This comment was marked as outdated.

Uh oh!

all-hands-bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Multi-SWE-Bench (`multiswebench/`)

OpenAgentSafety (`openagentsafety/`)

SWE-Bench (`swebench/`)

SWE-Bench Multimodal (`swebenchmultimodal/`)

SWT-Bench (`swtbench/`)

GAIA (`gaia/`)

Shared utilities (`utils/`)