Fix benchmark bugs across multiple benchmarks#475
Merged
simonrosenberg merged 8 commits intomainfrom Mar 3, 2026
Merged
Conversation
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
all-hands-bot
requested changes
Mar 3, 2026
Collaborator
all-hands-bot
left a comment
There was a problem hiding this comment.
🔴 Critical regression found - This PR fixes real bugs but introduces a new crash. See inline comments for details.
Co-authored-by: OpenHands Bot <contact@all-hands.dev>
Co-authored-by: OpenHands Bot <contact@all-hands.dev>
- Replace try/except pd.isna() with explicit isinstance guard — dicts and lists are already handled by prior elif branches, so pd.isna() is only called on scalar-like values - Add comment documenting that the heredoc approach embeds config in the bash command, but this is acceptable since the config contains NPC scenario data, not secrets (API keys are resolved via env vars) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The generate_report() function was treating all non-error instances as resolved, leaving unresolved_ids always empty. Now uses the same logic as eval_infer.py: an instance is resolved only when final_score.result > 0 and result == total. Added tests covering: mixed results, empty file, missing file, malformed JSON lines, and missing final_score. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
all-hands-bot
approved these changes
Mar 3, 2026
Collaborator
all-hands-bot
left a comment
There was a problem hiding this comment.
Taste Rating: 🟢 Good taste - Pragmatic fixes for real bugs with proper tests
The critical unresolved_ids bug from the previous review is fixed, and the new tests actually exercise real code paths (not just mocks). Re-raising errors instead of swallowing them is exactly right. The lazy imports are justified workarounds with clear comments explaining why they're needed.
Verdict: ✅ Ship it
Key Insight: These are textbook pragmatic bug fixes - simple solutions that solve real crashes without over-engineering.
The explicit default= and updated help/comment were unnecessary since parser.set_defaults(**EVAL_DEFAULTS) already covers timeout. Keep only the function signature defaults on run_swebench_evaluation(). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The f-string and concatenation forms are functionally identical. Keep the original f-string. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
simonrosenberg
added a commit
that referenced
this pull request
Mar 3, 2026
Resolve conflicts after benchmark bug-fix PR #475 was merged to main: - Take main's malformed JSON error handling in multiswebench/build_images.py - Take main's cleaner pd.isna() guard (isinstance instead of try/except) - Take main's heredoc security tradeoff comment - Take main's fixed generate_report resolution logic (final_score check) - Take main's updated generate_report docstring Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Extracts the benchmark-only bug fixes from the NeMo Evaluator PR (#455) into a standalone PR. These are independent improvements to existing benchmarks that don't depend on or relate to NeMo integration.
Changes
Multi-SWE-Bench (
multiswebench/)build_images.py: Use.get("version", "")instead ofinstance["version"]to avoidKeyErroron instances missing the version field; call.lower()on Docker image names to avoid registry rejection of mixed-case tags; switch fromget_datasettodownload_and_concat_datasetto fix non-standard HuggingFace dataset structure; passbase_image_to_custom_tag_fntobuild_all_imageseval_infer.py: Re-raise errors from evaluation subprocess instead of silently swallowing them; wrapargs.input_fileinPath()to fixAttributeErrorwhen it's a plain stringOpenAgentSafety (
openagentsafety/)build_images.py: Extractget_image_name()helper so the image name is configurable viaEVAL_AGENT_SERVER_IMAGEandIMAGE_TAG_PREFIXenv vars instead of being hardcoded; deferget_vendor_sdk_commit()until after the existing-image check to avoid unnecessary git callsrun_infer.py: Wrappd.isna()calls in try/except to prevent crashes on non-scalar types (dicts, lists); add Pydanticmodel_dump()support inNumpyEncoder; simplify NPC config writing (single heredoc command instead of tempfile + upload); add try/except fallback to pre-built image when on-the-fly build fails; useget_image_name()instead of hardcoded image strings; addmax_retriespass-through to evaluator; addgenerate_report()to produce.report.jsonoutputSWE-Bench (
swebench/)eval_infer.py: Add default values fromEVAL_DEFAULTStorun_swebench_evaluation()function parameters, enabling direct programmatic invocation without passing every argument; add explicit default for--timeoutCLI flagSWE-Bench Multimodal (
swebenchmultimodal/)eval_infer.py: Document themodalparameter inrun_swebench_multimodal_evaluation()docstringSWT-Bench (
swtbench/)run_infer.py: Lazy-import_base_slugfromopenhands.agent_server.docker.buildto avoid git subprocess checks that fail when installed as a package outside a git repoGAIA (
gaia/)run_infer.py: Passmax_retriesthrough to the evaluator (was silently ignored before)Shared utilities (
utils/)build_utils.py: ImportTargetTypefrombenchmarks.swebench.constantsinstead ofopenhands.agent_server.docker.build; lazy-importBuildOptionsandbuildto avoid git subprocess failures when installed outside a git repoWhy these changes are necessary
These fixes address real bugs that cause crashes or incorrect behavior during benchmark evaluation:
pd.isna()on dicts/lists crashes the OpenAgentSafety JSON encodermax_retriespass-through means the retry config was silently ignoredTest plan
🤖 Generated with Claude Code