NeMo Evaluator Integration by simonrosenberg · Pull Request #455 · OpenHands/benchmarks

simonrosenberg · 2026-02-26T12:47:54Z

Summary

Adds NVIDIA NeMo Evaluator integration so OpenHands benchmarks can be run via NeMo's evaluation framework. The new nemo_evaluator package is an optional dependency (pip install .[nemo]) that bridges NeMo to the existing benchmark CLI entrypoints.

Note: Several cross-cutting refactors that were originally part of this PR have been extracted and already merged separately:

Rename image_exists → remote_image_exists (Rename image_exists to remote_image_exists #471)

Replace uv run python with sys.executable in eval scripts (Replace uv run python with sys.executable in eval scripts #472)

Centralize IMAGE_TAG_PREFIX for Docker image tags (Centralize IMAGE_TAG_PREFIX for Docker image tags #473)

New: `nemo_evaluator/openhands_benchmarks/` package

File	Purpose
`framework.yml`	NeMo framework definition with 16 named evaluation configs (SWE-bench variants, GAIA, Commit0, Multi-SWE-bench per language, SWT-bench, SWE-bench Multimodal, OpenAgentSafety)
`run_benchmark.py`	Orchestrator that maps NeMo parameters to existing benchmark CLI entrypoints, runs inference then evaluation
`generate_llm_config.py`	Generates LLM config JSON from CLI args (resolves API key env vars, strips `/chat/completions` suffix for LiteLLM)
`output.py`	Parses `*.report.json` output into NeMo's `EvaluationResult` format (accuracy = resolved/submitted)

New console script entrypoints: generate-llm-config, run-benchmark.

`pyproject.toml`

Optional dependency: nemo = ["nemo_evaluator>=0.1.92,<0.2"]
New entrypoints: generate-llm-config, run-benchmark
Package discovery: ["benchmarks*", "nemo_evaluator*"]
Package data: *.yml for nemo_evaluator, *.j2/Dockerfile*/*.json for benchmarks
Removed exclude = ["legacy"] from ruff config (moved to .pre-commit-config.yaml exclude)

`.pre-commit-config.yaml`

Added nemo_evaluator/ to the exclude pattern (NeMo-specific code uses its own conventions)

Previous PRs

Continues from #412 (closed). Refactors extracted into separate PRs:

image_exists → remote_image_exists rename (Rename image_exists to remote_image_exists #471)
uv run python → sys.executable (Replace uv run python with sys.executable in eval scripts #472)
Centralized IMAGE_TAG_PREFIX (Centralize IMAGE_TAG_PREFIX for Docker image tags #473)
load_llm_config shared utility (Extract shared load_llm_config utility #453)
Multi-SWE-Bench dataset name check (Fix Multi-SWE-Bench dataset name check #454)

The PR had a broken run_timeout parameter (immediately overwritten by env var). Revert to main's clean implementation and drop the dead argument from all 6 benchmark call sites.

PR #389 intentionally removed the eval_ prefix from run_ids in swtbench and swebenchmultimodal. The NeMo PR had re-added it. Revert to the current convention (bare stem, no prefix).

Instead of patching every eval script to add a "benchmark" field to .report.json, write a nemo_metadata.json from run_benchmark.py and read it from output.py. This keeps the NeMo concern in NeMo-specific code and avoids touching 7 existing benchmark eval scripts.

These choices are only used for argparse validation, never shown to users. Use absolute paths unconditionally instead of a try/except fallback for pip-installed packages.

Main branch reads the timeout from the CONVERSATION_TIMEOUT env var in fake_user_response.py. The CLI arg added by this PR was dead code since we reverted fake_user_response.py to main. Remove the full chain: args_parser, EvalMetadata field, all benchmark main() assignments, run_benchmark.py passthrough, and framework.yml config.

This is not a bug fix and not needed for NeMo. Main already skips failed samples by default. The feature added fail-fast behavior that would be harmful for long NeMo runs, and framework.yml had it configured to stop on first failure. Removed: SampleFailedError class, CLI arg, EvalMetadata field, all benchmark main() assignments, run_benchmark.py passthrough, and framework.yml config.

The uv check-and-fallback pattern is unnecessary: when running inside the venv (whether via uv or pip), sys.executable already points to the correct Python interpreter with all dependencies available. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Revert unnecessary restructuring (moved try/except blocks, removed default values, reformatted signatures) and keep only the essential changes: replace uv invocations with sys.executable. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

The --modal/--no-modal flag for swebench multimodal eval is being added in a separate PR (#452). Remove the modal plumbing from this PR to avoid depending on unreleased changes. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Fix import sorting (ruff I001) across all changed files - Remove unused imports (ruff F401): LLM in 6 run_infer.py files - Remove unused variable build_error (ruff F841) in swtbench - Fix undefined sdk_short_sha reference in swebench/run_infer.py - Fix formatting (ruff format) in 6 files - Fix trailing blank line (pycodestyle W391) in __init__.py Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Pick up the split parameter change from main (default removed from function signature, args.split passed explicitly at call site). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Import TargetType from local benchmarks.swebench.constants instead of the removed openhands.agent_server.docker.build import. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

These scripts are only used by the NeMo integration. Move them to nemo_evaluator/openhands_benchmarks/ to keep NeMo-specific code together. They can be moved to benchmarks/scripts/ later if they prove generally useful. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Move nemo_evaluator from required dependencies to [project.optional-dependencies] so it can be installed via pip install openhands-benchmarks[nemo]. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

"pkg*" already matches "pkg" via fnmatch glob. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Resolve api_key_env and strip /chat/completions from base_url at config generation time rather than load time. This keeps the shared load_llm_config() utility free of NeMo-specific concerns. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Adopt centralized Docker image auto-detection helpers (create_docker_workspace, ensure_local_image, local_image_exists) from PR #456 while keeping IMAGE_TAG_PREFIX naming from this branch. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

The DelegateTool still exists in the SDK and get_default_tools does not include it. Removing the opt-in delegation code silently breaks the --enable-delegation CLI flag. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

The FinishAction isinstance check is the correct way to extract the agent's final answer. The replacement using string matching on class names and wrong attributes was strictly worse. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

The precise startswith("bytedance-research/Multi-SWE-Bench") check from PR #454 is already correct. The loose substring match is unnecessary and could false-positive. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Remove try/except "unknown" fallback from get_sdk_sha() — if the submodule is missing, fail immediately instead of producing a useless image tag that fails later at Docker pull time - Migrate swefficiency/run_infer.py to IMAGE_TAG_PREFIX, removing the scattered os.getenv("SDK_SHORT_SHA") override pattern - Migrate modal_patches.py to IMAGE_TAG_PREFIX for consistency - Document the purpose of IMAGE_TAG_PREFIX in version.py Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

all-hands-bot

Testing Gap: The new nemo_evaluator package has no test coverage.

all-hands-bot · 2026-03-02T17:26:31Z

nemo_evaluator/openhands_benchmarks/generate_llm_config.py

@@ -0,0 +1,98 @@
+from __future__ import annotations


🟠 Important: No tests for new integration package.

You've added 3 new modules (generate_llm_config.py, run_benchmark.py, output.py) totaling ~400 lines with zero test coverage. This is an integration shim to NeMo - if it breaks, debugging will be painful.

Minimum viable tests:

test_generate_config() - verify JSON output format, env var resolution, URL stripping

test_build_infer_cmd() - verify command building for each benchmark type

test_parse_output() - verify report.json parsing and accuracy calculation

"Testing and practice sometimes clash. Testing wins. Every single time." - If this breaks in production, you'll wish you had tests to reproduce the issue locally.

Add tests in nemo_evaluator/tests/ before merging.

pyproject.toml

Co-authored-by: OpenHands Bot <contact@all-hands.dev>

Replace if-chain for benchmark-specific CLI flags with a declarative dict mapping benchmark names to parameter extraction functions. New benchmarks can now add their parameters without modifying _build_infer_cmd. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Extend the data-driven approach to _build_eval_cmd and env var setup: - BENCHMARK_EVAL_PARAMS: declares per-benchmark eval CLI flags - BENCHMARK_ENV_VARS: declares per-benchmark env vars for inference All benchmark-specific logic is now in declarative dicts at the top of the file. The functions _build_infer_cmd, _build_eval_cmd, and main() are fully generic — adding a new benchmark only requires updating the data tables. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

simonrosenberg · 2026-03-02T19:46:21Z

Nemo Evaluator Integration Validation Results

Triggered CI benchmarks with eval_limit=1 to validate the Nemo Evaluator integration.

Results: 5/6 Benchmarks Passed ✅

Benchmark	Status	Evaluation Run
gaia	✅ Success	https://github.com/OpenHands/evaluation/actions/runs/22590769837
swebench	✅ Success	https://github.com/OpenHands/evaluation/actions/runs/22590769394
swtbench	✅ Success	https://github.com/OpenHands/evaluation/actions/runs/22590775404
commit0	✅ Success	https://github.com/OpenHands/evaluation/actions/runs/22590778789
swebenchmultimodal	✅ Success	https://github.com/OpenHands/evaluation/actions/runs/22590785406
multiswebench	❌ Failed	Dataset schema issue (unrelated to this PR)

Configuration

SDK Branch: main (c31cdee)
Model: claude-sonnet-4-5-20250929
Eval Limit: 1 instance per benchmark

Note on multiswebench failure

The multiswebench failure is due to a known HuggingFace dataset schema incompatibility with bytedance-research/Multi-SWE-Bench - not related to this PR. Fix available in PR #304.

…tting - Remove duplicate try block and duplicate lines in prepare_workspace() - Fix ruff formatting for type hints in run_benchmark.py - Update uv.lock Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

All benchmark runners previously hardcoded SDK_SHORT_SHA for Docker image tags, with each file duplicating an os.getenv("SDK_SHORT_SHA", SDK_SHORT_SHA) override pattern. This centralizes the tag prefix into a single constant. - New IMAGE_TAG_PREFIX = os.getenv("IMAGE_TAG_PREFIX", SDK_SHORT_SHA) in version.py - All runners import and use IMAGE_TAG_PREFIX directly with no per-file env var fallback - EVAL_AGENT_SERVER_IMAGE in constants.py is now overridable via OPENHANDS_EVAL_AGENT_SERVER_IMAGE env var - modal_patches.py updated to propagate IMAGE_TAG_PREFIX (replaces SDK_SHA/SDK_SHORT_SHA pair) Extracted from #455. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

juanmichelini · 2026-03-02T20:32:12Z

Results: 5/6 Benchmarks Passed ✅

notice that for swebench, swtbench and commit0 the single instance was resolved.

- Add back ToolPresetType import - Restore get_tools_for_preset() function supporting default/gemini/planning presets - Restore tool_preset parameter in EvalMetadata - Restore tool preset logging Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

juanmichelini · 2026-03-02T20:35:02Z

On second pass the evaluations were run with branch main and not branch nemo-evaluator

Update test_benchmark_metrics_collection to use get_tools_for_preset mock target for swebench benchmark, consistent with the restored function in run_infer.py. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

simonrosenberg · 2026-03-02T20:37:26Z

On second pass the evaluations were run with branch main and not branch nemo-evaluator

damn it

juanmichelini · 2026-03-02T20:38:05Z

@OpenHands see software-agent-sdk, run 5 eval job actions there, one for each of these benchmarks: commit0, gaia, swebench, swebenchmultimodal, gaia. for each of those run a single instance, in all use trigger reason "testing benchmarks branch nemo-evaluator", make sure to set the benchmark branch param nemo-evaluator so that we can test that PR. Run it, the software agent sdk action will run an evaluation job. Once you have links to the 5 evaluation jobs post a comment here with the links.

openhands-ai · 2026-03-02T20:38:28Z

I'm on it! juanmichelini can track my progress at all-hands.dev

juanmichelini · 2026-03-02T20:42:23Z

Evaluation Jobs Triggered

Successfully triggered 5 evaluation jobs testing the nemo-evaluator branch with eval_limit=1:

Benchmark	Evaluation Job Link
commit0	https://github.com/OpenHands/evaluation/actions/runs/22594781791
gaia	https://github.com/OpenHands/evaluation/actions/runs/22594775753
swebench	https://github.com/OpenHands/evaluation/actions/runs/22594781250
swebenchmultimodal	https://github.com/OpenHands/evaluation/actions/runs/22594790024
swtbench	https://github.com/OpenHands/evaluation/actions/runs/22594786078

All jobs are configured with:

SDK version: v1.11.5
Benchmarks branch: nemo-evaluator
Eval limit: 1 instance per benchmark
Trigger reason: testing benchmarks branch nemo-evaluator

openhands-ai · 2026-03-02T20:43:18Z

Final Summary

I successfully completed your request to trigger evaluation jobs for testing the nemo-evaluator branch (PR #455).

Actions Completed:

✅ Triggered 5 evaluation job actions in the OpenHands/software-agent-sdk repository using the "Run Eval" workflow:

commit0
gaia
swebench
swebenchmultimodal
swtbench

Each job was configured with:

SDK version: v1.11.5
Benchmarks branch: nemo-evaluator
Eval limit: 1 instance per benchmark
Trigger reason: "testing benchmarks branch nemo-evaluator"

✅ Posted a comment on PR #455 with links to all 5 evaluation jobs:

commit0: https://github.com/OpenHands/evaluation/actions/runs/22594781791
gaia: https://github.com/OpenHands/evaluation/actions/runs/22594775753
swebench: https://github.com/OpenHands/evaluation/actions/runs/22594781250
swebenchmultimodal: https://github.com/OpenHands/evaluation/actions/runs/22594790024
swtbench: https://github.com/OpenHands/evaluation/actions/runs/22594786078

Comment URL: #455 (comment)

All evaluation jobs are now running and will test the NeMo Evaluator integration changes on the nemo-evaluator branch.

* Centralize IMAGE_TAG_PREFIX for Docker image tags All benchmark runners previously hardcoded SDK_SHORT_SHA for Docker image tags, with each file duplicating an os.getenv("SDK_SHORT_SHA", SDK_SHORT_SHA) override pattern. This centralizes the tag prefix into a single constant. - New IMAGE_TAG_PREFIX = os.getenv("IMAGE_TAG_PREFIX", SDK_SHORT_SHA) in version.py - All runners import and use IMAGE_TAG_PREFIX directly with no per-file env var fallback - EVAL_AGENT_SERVER_IMAGE in constants.py is now overridable via OPENHANDS_EVAL_AGENT_SERVER_IMAGE env var - modal_patches.py updated to propagate IMAGE_TAG_PREFIX (replaces SDK_SHA/SDK_SHORT_SHA pair) Extracted from #455. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Add backward compatibility for SDK_SHORT_SHA env var Address review feedback: the env var rename from SDK_SHORT_SHA to IMAGE_TAG_PREFIX is a breaking change for existing users. - version.py: fall back to SDK_SHORT_SHA env var when IMAGE_TAG_PREFIX is not set, emitting a DeprecationWarning - modal_patches.py: continue propagating SDK_SHORT_SHA to modal execution environments alongside IMAGE_TAG_PREFIX - modal_patches.py: _get_image_tag_prefix() fallback also checks SDK_SHORT_SHA when IMAGE_TAG_PREFIX is unset Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Revert "Add backward compatibility for SDK_SHORT_SHA env var" This reverts commit 1d3f778. * Add runtime deprecation warning for SDK_SHORT_SHA env var - version.py: fall back to SDK_SHORT_SHA env var when IMAGE_TAG_PREFIX is not set, emitting a DeprecationWarning so users know to migrate - modal_patches.py: _get_image_tag_prefix() fallback also checks SDK_SHORT_SHA when IMAGE_TAG_PREFIX is unset Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Revert "Add runtime deprecation warning for SDK_SHORT_SHA env var" This reverts commit 80b38b2. * Update benchmarks/utils/version.py Co-authored-by: OpenHands Bot <contact@all-hands.dev> * Fix inconsistent log wording and add IMAGE_TAG_PREFIX tests - swebench/run_infer.py: change "image tag prefix:" to "tag prefix:" for consistency with all other benchmark runners - tests/test_version.py: add unit tests for IMAGE_TAG_PREFIX resolution covering default, env var override, deprecated SDK_SHORT_SHA fallback, and precedence when both env vars are set Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Update uv.lock for SDK submodule v1.11.5 Regenerate lock file to match the current vendor/software-agent-sdk submodule version (1.10.0 → 1.11.5). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Fix ruff format in version.py Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Update benchmarks/utils/modal_patches.py Co-authored-by: OpenHands Bot <contact@all-hands.dev> * Fix indentation in modal_patches.py for IMAGE_TAG_PREFIX env vars Three lines were accidentally left-aligned at column 0 instead of being indented inside the _inject_modal_sitecustomize() function body. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: OpenHands Bot <contact@all-hands.dev>

Resolve conflicts after sub-PRs #471, #472, #473 were merged to main: - Take main's SDK_SHORT_SHA deprecation handling in version.py - Take main's backward-compat SDK_SHORT_SHA in modal_patches.py - Take main's log message wording in swebench/run_infer.py - Remove redundant prompt_dir setup (superseded by add_prompt_path_argument) - Keep lazy imports for git-dependent modules (build_utils.py, swtbench/run_infer.py) - Take main's comment cleanup in swtbench/eval_infer.py Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Resolve conflicts after benchmark bug-fix PR #475 was merged to main: - Take main's malformed JSON error handling in multiswebench/build_images.py - Take main's cleaner pd.isna() guard (isinstance instead of try/except) - Take main's heredoc security tradeoff comment - Take main's fixed generate_report resolution logic (final_score check) - Take main's updated generate_report docstring Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

set_defaults(**EVAL_DEFAULTS) already handles the timeout default. The merge auto-resolved this the wrong way; restore main's version. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

all-hands-bot

Taste Rating: 🟡 Acceptable - NeMo integration works but reveals abstraction issues.

Key Insight: Dict-of-lambdas returning tuple-lists with empty-string sentinel values is a code smell. This pattern works but signals the abstraction is leaking - as you add more benchmarks, maintenance cost grows non-linearly.

Critical Gap: Still no tests for 400+ lines of integration code (unresolved from previous review). This is NeMo glue - when it breaks, debugging will be painful without tests.

all-hands-bot · 2026-03-03T18:02:34Z

.pre-commit-config.yaml

        pass_filenames: true
        always_run: false
-        exclude: ^legacy/
+        exclude: ^(legacy|nemo_evaluator)/


🔴 Critical: Excluding nemo_evaluator from pre-commit bypasses type checking and linting.

You've exempted this code from pyright and ruff. This means type errors and code quality issues will slip through. NeMo has its own conventions, but type safety and basic linting shouldn't be negotiable.

If NeMo's conventions conflict with your project's, document the specific conflicts and exclude only those rules. Blanket exclusion is how technical debt accumulates silently.

all-hands-bot · 2026-03-03T18:02:34Z

nemo_evaluator/openhands_benchmarks/run_benchmark.py

+BENCHMARK_INFER_PARAMS: dict[
+    str, Callable[[argparse.Namespace], list[tuple[str, str]]]
+] = {
+    "gaia": lambda args: [("--level", args.level)] if args.level else [],
+    "commit0": lambda args: [("--repo-split", args.repo_split)]
+    if args.repo_split
+    else [],
+    "multiswebench": lambda args: [("--lang", args.language)] if args.language else [],
+}


🟠 Important: Dict-of-lambdas is overengineered for simple conditional args.

This pattern:

BENCHMARK_INFER_PARAMS: dict[str, Callable[[...], list[tuple[str, str]]]] = { "gaia": lambda args: [("--level", args.level)] if args.level else [], }

Could be:

def get_benchmark_args(benchmark: str, args: Namespace) -> list[str]: """Returns CLI args for benchmark-specific parameters.""" if benchmark == "gaia" and args.level: return ["--level", args.level] elif benchmark == "commit0" and args.repo_split: return ["--repo-split", args.repo_split] # ... return []

Simpler data structures → simpler code. The dict-of-lambdas doesn't buy you anything except indirection.

all-hands-bot · 2026-03-03T18:02:34Z

nemo_evaluator/openhands_benchmarks/run_benchmark.py

+        for flag, value in BENCHMARK_EVAL_PARAMS[benchmark](args, output_jsonl):
+            if value:
+                cmd.extend([flag, value])
+            else:
+                cmd.append(flag)


🟡 Suggestion: Empty string as sentinel for bare flags is a code smell.

Using ("--modal", "") to mean "bare flag" requires this special-case check. Better approach: separate the flag building from the tuple structure, or use a simple boolean/None pattern.

Example:

for item in params: if isinstance(item, tuple): # flag with value cmd.extend(item) else: # bare flag cmd.append(item)

Or just handle it where the data is created instead of using empty strings.

all-hands-bot · 2026-03-03T18:02:34Z

nemo_evaluator/openhands_benchmarks/run_benchmark.py

+        )
+        sys.exit(1)
+
+    output_jsonl = output_files[-1]  # Use the latest one


🟠 Important: sorted()[-1] assumes lexicographic ordering gives "latest" - fragile.

If output_dir contains multiple runs with timestamps or run IDs in different formats, sorted() might not pick the latest. You're relying on filesystem/naming conventions that aren't enforced.

Better: Either error if len(output_files) != 1 (strict), or use file modification time:

output_jsonl = max(output_files, key=lambda p: p.stat().st_mtime)

Theory: "we'll only have one output.jsonl per run"
Practice: Multi-attempt runs, retries, or manual re-runs will break this assumption.

all-hands-bot · 2026-03-03T18:02:34Z

nemo_evaluator/openhands_benchmarks/output.py

+    output_path = pathlib.Path(output_dir)
+
+    # Find any .report.json file (all benchmarks use this naming convention)
+    report_files = sorted(output_path.rglob("*.report.json"))


🟢 Good taste: Explicit check for multiple report files prevents ambiguity.

Unlike run_benchmark.py:247 which silently picks the last file, this correctly errors on ambiguous input (line 19-23). This is the right behavior for validation logic - fail fast and explicit rather than guessing.

all-hands-bot · 2026-03-03T18:02:34Z

nemo_evaluator/openhands_benchmarks/output.py

+    metadata_file = output_path / "nemo_metadata.json"
+    if not metadata_file.exists():
+        raise FileNotFoundError(
+            f"nemo_metadata.json not found in {output_dir}. "
+            "Make sure the benchmark was run via run_benchmark.py."
+        )
+    metadata = json.loads(metadata_file.read_text(encoding="utf-8"))
+    task_name = metadata["benchmark"]


🟡 Suggestion: Tight coupling via metadata file is pragmatic but document the contract.

output.py requires run_benchmark.py to write nemo_metadata.json. This works, but the coupling is implicit - nothing enforces that run_benchmark.py actually writes it, or what schema it uses.

Add a comment or type definition for the metadata schema:

# Expected schema: {"benchmark": str} metadata = json.loads(metadata_file.read_text(encoding="utf-8"))

Or use a simple dataclass/TypedDict to document the contract.

This comment was marked as outdated.

Sign in to view

simonrosenberg force-pushed the nemo-evaluator branch from c63273d to 8bfdd29 Compare February 26, 2026 12:56

e-dobrowolska and others added 18 commits February 26, 2026 10:00

nemo-evaluator implementation

4898c4e

remove redundant VERSION file

f118e3a

Revert fake_user_response.py to main, remove run_timeout= from callers

82386bc

The PR had a broken run_timeout parameter (immediately overwritten by env var). Revert to main's clean implementation and drop the dead argument from all 6 benchmark call sites.

Revert eval_ prefix on run_id to match main convention

2c962f2

PR #389 intentionally removed the eval_ prefix from run_ids in swtbench and swebenchmultimodal. The NeMo PR had re-added it. Revert to the current convention (bare stem, no prefix).

Simplify prompt choices: drop relative_to(Path.cwd())

698c685

These choices are only used for argparse validation, never shown to users. Use absolute paths unconditionally instead of a try/except fallback for pip-installed packages.

Sync swtbench split parameter with main

d1c8c5f

Pick up the split parameter change from main (default removed from function signature, args.split passed explicitly at call site). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Restore TargetType annotations in build_utils.py

2480654

Import TargetType from local benchmarks.swebench.constants instead of the removed openhands.agent_server.docker.build import. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Make nemo_evaluator an optional dependency

6ec0d35

Move nemo_evaluator from required dependencies to [project.optional-dependencies] so it can be installed via pip install openhands-benchmarks[nemo]. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Remove redundant package include patterns

8c6cf37

"pkg*" already matches "pkg" via fnmatch glob. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

simonrosenberg force-pushed the nemo-evaluator branch from 8bfdd29 to 0ff388e Compare February 26, 2026 13:04

simonrosenberg and others added 8 commits March 2, 2026 09:22

Restore --depth 1 shallow clone for reward hacking prevention

c6834eb

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Restore DelegateTool import and enable_delegation support

673b855

The DelegateTool still exists in the SDK and get_default_tools does not include it. Removing the opt-in delegation code silently breaks the --enable-delegation CLI flag. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Restore summarize_instance calls for post-evaluation logging

fc03186

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Remove extraneous comment added by PR

861b33b

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Revert Multi-SWE-Bench check to use startswith (matches main)

42990d7

The precise startswith("bytedance-research/Multi-SWE-Bench") check from PR #454 is already correct. The loose substring match is unnecessary and could false-positive. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

simonrosenberg force-pushed the nemo-evaluator branch from ca0a242 to b0620fa Compare March 2, 2026 14:48

all-hands-bot reviewed Mar 2, 2026

View reviewed changes

simonrosenberg and others added 4 commits March 2, 2026 14:36

Update benchmarks/openagentsafety/run_infer.py

2e6ee1b

Co-authored-by: OpenHands Bot <contact@all-hands.dev>

Update pyproject.toml

7425fad

Co-authored-by: OpenHands Bot <contact@all-hands.dev>

Fix syntax error in openagentsafety/run_infer.py and pre-commit forma…

0f88bb2

…tting - Remove duplicate try block and duplicate lines in prepare_workspace() - Fix ruff formatting for type hints in run_benchmark.py - Update uv.lock Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

This was referenced Mar 2, 2026

Fix prompt template path listing when CWD differs from project root #470

Closed

Rename image_exists to remote_image_exists #471

Merged

Replace uv run python with sys.executable in eval scripts #472

Merged

simonrosenberg mentioned this pull request Mar 2, 2026

Centralize IMAGE_TAG_PREFIX for Docker image tags #473

Merged

3 tasks

Fix test mock target for swebench get_tools_for_preset

603a439

Update test_benchmark_metrics_collection to use get_tools_for_preset mock target for swebench benchmark, consistent with the restored function in run_infer.py. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

simonrosenberg mentioned this pull request Mar 3, 2026

Fix prompt path resolution when CWD differs from project root #474

Merged

3 tasks

simonrosenberg mentioned this pull request Mar 3, 2026

Fix benchmark bugs across multiple benchmarks #475

Merged

4 tasks

simonrosenberg and others added 2 commits March 3, 2026 12:48

Revert redundant argparse default for --timeout in swebench eval

82a63ad

set_defaults(**EVAL_DEFAULTS) already handles the timeout default. The merge auto-resolved this the wrong way; restore main's version. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

simonrosenberg linked an issue Mar 3, 2026 that may be closed by this pull request

Nemo Evaluator Integration with OpenHands #447

Open

simonrosenberg requested a review from all-hands-bot March 3, 2026 17:59

all-hands-bot reviewed Mar 3, 2026

View reviewed changes

Conversation

simonrosenberg commented Feb 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

New: nemo_evaluator/openhands_benchmarks/ package

pyproject.toml

.pre-commit-config.yaml

Previous PRs

Uh oh!

This comment was marked as outdated.

Uh oh!

all-hands-bot left a comment

Choose a reason for hiding this comment

Uh oh!

all-hands-bot Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

simonrosenberg commented Mar 2, 2026

Nemo Evaluator Integration Validation Results

Results: 5/6 Benchmarks Passed ✅

Configuration

Note on multiswebench failure

Uh oh!

juanmichelini commented Mar 2, 2026

Results: 5/6 Benchmarks Passed ✅

Uh oh!

juanmichelini commented Mar 2, 2026

Uh oh!

simonrosenberg commented Mar 2, 2026

Uh oh!

juanmichelini commented Mar 2, 2026

Uh oh!

openhands-ai bot commented Mar 2, 2026

Uh oh!

juanmichelini commented Mar 2, 2026

Evaluation Jobs Triggered

Uh oh!

openhands-ai bot commented Mar 2, 2026

Final Summary

Actions Completed:

Uh oh!

all-hands-bot left a comment

Choose a reason for hiding this comment

Uh oh!

all-hands-bot Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

all-hands-bot Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

all-hands-bot Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

all-hands-bot Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

all-hands-bot Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

all-hands-bot Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

simonrosenberg commented Feb 26, 2026 •

edited

Loading

New: `nemo_evaluator/openhands_benchmarks/` package

`pyproject.toml`

`.pre-commit-config.yaml`