Skip to content

Fix benchmark bugs across multiple benchmarks#475

Merged
simonrosenberg merged 8 commits intomainfrom
fix/benchmark-bug-fixes
Mar 3, 2026
Merged

Fix benchmark bugs across multiple benchmarks#475
simonrosenberg merged 8 commits intomainfrom
fix/benchmark-bug-fixes

Conversation

@simonrosenberg
Copy link
Collaborator

Summary

Extracts the benchmark-only bug fixes from the NeMo Evaluator PR (#455) into a standalone PR. These are independent improvements to existing benchmarks that don't depend on or relate to NeMo integration.

Changes

Multi-SWE-Bench (multiswebench/)

  • build_images.py: Use .get("version", "") instead of instance["version"] to avoid KeyError on instances missing the version field; call .lower() on Docker image names to avoid registry rejection of mixed-case tags; switch from get_dataset to download_and_concat_dataset to fix non-standard HuggingFace dataset structure; pass base_image_to_custom_tag_fn to build_all_images
  • eval_infer.py: Re-raise errors from evaluation subprocess instead of silently swallowing them; wrap args.input_file in Path() to fix AttributeError when it's a plain string

OpenAgentSafety (openagentsafety/)

  • build_images.py: Extract get_image_name() helper so the image name is configurable via EVAL_AGENT_SERVER_IMAGE and IMAGE_TAG_PREFIX env vars instead of being hardcoded; defer get_vendor_sdk_commit() until after the existing-image check to avoid unnecessary git calls
  • run_infer.py: Wrap pd.isna() calls in try/except to prevent crashes on non-scalar types (dicts, lists); add Pydantic model_dump() support in NumpyEncoder; simplify NPC config writing (single heredoc command instead of tempfile + upload); add try/except fallback to pre-built image when on-the-fly build fails; use get_image_name() instead of hardcoded image strings; add max_retries pass-through to evaluator; add generate_report() to produce .report.json output

SWE-Bench (swebench/)

  • eval_infer.py: Add default values from EVAL_DEFAULTS to run_swebench_evaluation() function parameters, enabling direct programmatic invocation without passing every argument; add explicit default for --timeout CLI flag

SWE-Bench Multimodal (swebenchmultimodal/)

  • eval_infer.py: Document the modal parameter in run_swebench_multimodal_evaluation() docstring

SWT-Bench (swtbench/)

  • run_infer.py: Lazy-import _base_slug from openhands.agent_server.docker.build to avoid git subprocess checks that fail when installed as a package outside a git repo

GAIA (gaia/)

  • run_infer.py: Pass max_retries through to the evaluator (was silently ignored before)

Shared utilities (utils/)

  • build_utils.py: Import TargetType from benchmarks.swebench.constants instead of openhands.agent_server.docker.build; lazy-import BuildOptions and build to avoid git subprocess failures when installed outside a git repo

Why these changes are necessary

These fixes address real bugs that cause crashes or incorrect behavior during benchmark evaluation:

  1. Silent error swallowing in multi-swe-bench evaluation makes debugging impossible
  2. KeyError crashes on missing version fields prevent building images for certain instances
  3. Mixed-case Docker tags are rejected by container registries
  4. pd.isna() on dicts/lists crashes the OpenAgentSafety JSON encoder
  5. Hardcoded image names prevent using pre-built or custom images
  6. Top-level git-dependent imports break when benchmarks are installed as a pip package
  7. Missing max_retries pass-through means the retry config was silently ignored

Test plan

  • Verify pre-commit checks pass
  • Run SWE-bench eval with default args
  • Run Multi-SWE-bench image build
  • Run OpenAgentSafety inference

🤖 Generated with Claude Code

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@simonrosenberg simonrosenberg self-assigned this Mar 3, 2026
Copy link
Collaborator

@all-hands-bot all-hands-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Critical regression found - This PR fixes real bugs but introduces a new crash. See inline comments for details.

simonrosenberg and others added 3 commits March 3, 2026 11:57
Co-authored-by: OpenHands Bot <contact@all-hands.dev>
Co-authored-by: OpenHands Bot <contact@all-hands.dev>
- Replace try/except pd.isna() with explicit isinstance guard — dicts
  and lists are already handled by prior elif branches, so pd.isna()
  is only called on scalar-like values
- Add comment documenting that the heredoc approach embeds config in
  the bash command, but this is acceptable since the config contains
  NPC scenario data, not secrets (API keys are resolved via env vars)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
all-hands-bot

This comment was marked as outdated.

The generate_report() function was treating all non-error instances as
resolved, leaving unresolved_ids always empty. Now uses the same logic
as eval_infer.py: an instance is resolved only when
final_score.result > 0 and result == total.

Added tests covering: mixed results, empty file, missing file,
malformed JSON lines, and missing final_score.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Collaborator

@all-hands-bot all-hands-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Taste Rating: 🟢 Good taste - Pragmatic fixes for real bugs with proper tests

The critical unresolved_ids bug from the previous review is fixed, and the new tests actually exercise real code paths (not just mocks). Re-raising errors instead of swallowing them is exactly right. The lazy imports are justified workarounds with clear comments explaining why they're needed.

Verdict: ✅ Ship it

Key Insight: These are textbook pragmatic bug fixes - simple solutions that solve real crashes without over-engineering.

simonrosenberg and others added 3 commits March 3, 2026 12:33
The explicit default= and updated help/comment were unnecessary since
parser.set_defaults(**EVAL_DEFAULTS) already covers timeout. Keep only
the function signature defaults on run_swebench_evaluation().

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The f-string and concatenation forms are functionally identical.
Keep the original f-string.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@simonrosenberg simonrosenberg merged commit f3e9a99 into main Mar 3, 2026
2 checks passed
@simonrosenberg simonrosenberg deleted the fix/benchmark-bug-fixes branch March 3, 2026 15:46
simonrosenberg added a commit that referenced this pull request Mar 3, 2026
Resolve conflicts after benchmark bug-fix PR #475 was merged to main:
- Take main's malformed JSON error handling in multiswebench/build_images.py
- Take main's cleaner pd.isna() guard (isinstance instead of try/except)
- Take main's heredoc security tradeoff comment
- Take main's fixed generate_report resolution logic (final_score check)
- Take main's updated generate_report docstring

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants