Skip to content

NeMo Evaluator Integration#455

Open
simonrosenberg wants to merge 42 commits intomainfrom
nemo-evaluator
Open

NeMo Evaluator Integration#455
simonrosenberg wants to merge 42 commits intomainfrom
nemo-evaluator

Conversation

@simonrosenberg
Copy link
Collaborator

@simonrosenberg simonrosenberg commented Feb 26, 2026

Summary

Adds NVIDIA NeMo Evaluator integration so OpenHands benchmarks can be run via NeMo's evaluation framework. The new nemo_evaluator package is an optional dependency (pip install .[nemo]) that bridges NeMo to the existing benchmark CLI entrypoints.

Note: Several cross-cutting refactors that were originally part of this PR have been extracted and already merged separately:

New: nemo_evaluator/openhands_benchmarks/ package

File Purpose
framework.yml NeMo framework definition with 16 named evaluation configs (SWE-bench variants, GAIA, Commit0, Multi-SWE-bench per language, SWT-bench, SWE-bench Multimodal, OpenAgentSafety)
run_benchmark.py Orchestrator that maps NeMo parameters to existing benchmark CLI entrypoints, runs inference then evaluation
generate_llm_config.py Generates LLM config JSON from CLI args (resolves API key env vars, strips /chat/completions suffix for LiteLLM)
output.py Parses *.report.json output into NeMo's EvaluationResult format (accuracy = resolved/submitted)

New console script entrypoints: generate-llm-config, run-benchmark.

pyproject.toml

  • Optional dependency: nemo = ["nemo_evaluator>=0.1.92,<0.2"]
  • New entrypoints: generate-llm-config, run-benchmark
  • Package discovery: ["benchmarks*", "nemo_evaluator*"]
  • Package data: *.yml for nemo_evaluator, *.j2/Dockerfile*/*.json for benchmarks
  • Removed exclude = ["legacy"] from ruff config (moved to .pre-commit-config.yaml exclude)

.pre-commit-config.yaml

  • Added nemo_evaluator/ to the exclude pattern (NeMo-specific code uses its own conventions)

Previous PRs

Continues from #412 (closed). Refactors extracted into separate PRs:

all-hands-bot

This comment was marked as outdated.

e-dobrowolska and others added 18 commits February 26, 2026 10:00
The PR had a broken run_timeout parameter (immediately overwritten by env
var). Revert to main's clean implementation and drop the dead argument
from all 6 benchmark call sites.
PR #389 intentionally removed the eval_ prefix from run_ids in swtbench
and swebenchmultimodal. The NeMo PR had re-added it. Revert to the
current convention (bare stem, no prefix).
Instead of patching every eval script to add a "benchmark" field to
.report.json, write a nemo_metadata.json from run_benchmark.py and
read it from output.py. This keeps the NeMo concern in NeMo-specific
code and avoids touching 7 existing benchmark eval scripts.
These choices are only used for argparse validation, never shown to
users. Use absolute paths unconditionally instead of a try/except
fallback for pip-installed packages.
Main branch reads the timeout from the CONVERSATION_TIMEOUT env var
in fake_user_response.py. The CLI arg added by this PR was dead code
since we reverted fake_user_response.py to main. Remove the full chain:
args_parser, EvalMetadata field, all benchmark main() assignments,
run_benchmark.py passthrough, and framework.yml config.
This is not a bug fix and not needed for NeMo. Main already skips
failed samples by default. The feature added fail-fast behavior that
would be harmful for long NeMo runs, and framework.yml had it
configured to stop on first failure.

Removed: SampleFailedError class, CLI arg, EvalMetadata field,
all benchmark main() assignments, run_benchmark.py passthrough,
and framework.yml config.
The uv check-and-fallback pattern is unnecessary: when running inside
the venv (whether via uv or pip), sys.executable already points to the
correct Python interpreter with all dependencies available.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Revert unnecessary restructuring (moved try/except blocks, removed
default values, reformatted signatures) and keep only the essential
changes: replace uv invocations with sys.executable.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The --modal/--no-modal flag for swebench multimodal eval is being added
in a separate PR (#452). Remove the modal plumbing from this PR to
avoid depending on unreleased changes.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Fix import sorting (ruff I001) across all changed files
- Remove unused imports (ruff F401): LLM in 6 run_infer.py files
- Remove unused variable build_error (ruff F841) in swtbench
- Fix undefined sdk_short_sha reference in swebench/run_infer.py
- Fix formatting (ruff format) in 6 files
- Fix trailing blank line (pycodestyle W391) in __init__.py

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Pick up the split parameter change from main (default removed from
function signature, args.split passed explicitly at call site).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Import TargetType from local benchmarks.swebench.constants instead of
the removed openhands.agent_server.docker.build import.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
These scripts are only used by the NeMo integration. Move them to
nemo_evaluator/openhands_benchmarks/ to keep NeMo-specific code
together. They can be moved to benchmarks/scripts/ later if they
prove generally useful.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Move nemo_evaluator from required dependencies to
[project.optional-dependencies] so it can be installed
via pip install openhands-benchmarks[nemo].

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
"pkg*" already matches "pkg" via fnmatch glob.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Resolve api_key_env and strip /chat/completions from base_url at config
generation time rather than load time. This keeps the shared
load_llm_config() utility free of NeMo-specific concerns.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
simonrosenberg and others added 8 commits March 2, 2026 09:22
Adopt centralized Docker image auto-detection helpers (create_docker_workspace,
ensure_local_image, local_image_exists) from PR #456 while keeping IMAGE_TAG_PREFIX
naming from this branch.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The DelegateTool still exists in the SDK and get_default_tools does not
include it. Removing the opt-in delegation code silently breaks the
--enable-delegation CLI flag.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The FinishAction isinstance check is the correct way to extract the
agent's final answer. The replacement using string matching on class
names and wrong attributes was strictly worse.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The precise startswith("bytedance-research/Multi-SWE-Bench") check
from PR #454 is already correct. The loose substring match is
unnecessary and could false-positive.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Remove try/except "unknown" fallback from get_sdk_sha() — if the
  submodule is missing, fail immediately instead of producing a useless
  image tag that fails later at Docker pull time
- Migrate swefficiency/run_infer.py to IMAGE_TAG_PREFIX, removing the
  scattered os.getenv("SDK_SHORT_SHA") override pattern
- Migrate modal_patches.py to IMAGE_TAG_PREFIX for consistency
- Document the purpose of IMAGE_TAG_PREFIX in version.py

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Collaborator

@all-hands-bot all-hands-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Testing Gap: The new nemo_evaluator package has no test coverage.

@@ -0,0 +1,98 @@
from __future__ import annotations
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟠 Important: No tests for new integration package.

You've added 3 new modules (generate_llm_config.py, run_benchmark.py, output.py) totaling ~400 lines with zero test coverage. This is an integration shim to NeMo - if it breaks, debugging will be painful.

Minimum viable tests:

  1. test_generate_config() - verify JSON output format, env var resolution, URL stripping
  2. test_build_infer_cmd() - verify command building for each benchmark type
  3. test_parse_output() - verify report.json parsing and accuracy calculation

"Testing and practice sometimes clash. Testing wins. Every single time." - If this breaks in production, you'll wish you had tests to reproduce the issue locally.

Add tests in nemo_evaluator/tests/ before merging.

simonrosenberg and others added 4 commits March 2, 2026 14:36
Co-authored-by: OpenHands Bot <contact@all-hands.dev>
Co-authored-by: OpenHands Bot <contact@all-hands.dev>
Replace if-chain for benchmark-specific CLI flags with a declarative
dict mapping benchmark names to parameter extraction functions. New
benchmarks can now add their parameters without modifying _build_infer_cmd.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Extend the data-driven approach to _build_eval_cmd and env var setup:
- BENCHMARK_EVAL_PARAMS: declares per-benchmark eval CLI flags
- BENCHMARK_ENV_VARS: declares per-benchmark env vars for inference

All benchmark-specific logic is now in declarative dicts at the top of
the file. The functions _build_infer_cmd, _build_eval_cmd, and main()
are fully generic — adding a new benchmark only requires updating the
data tables.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@simonrosenberg
Copy link
Collaborator Author

Nemo Evaluator Integration Validation Results

Triggered CI benchmarks with eval_limit=1 to validate the Nemo Evaluator integration.

Results: 5/6 Benchmarks Passed ✅

Benchmark Status Evaluation Run
gaia ✅ Success https://github.com/OpenHands/evaluation/actions/runs/22590769837
swebench ✅ Success https://github.com/OpenHands/evaluation/actions/runs/22590769394
swtbench ✅ Success https://github.com/OpenHands/evaluation/actions/runs/22590775404
commit0 ✅ Success https://github.com/OpenHands/evaluation/actions/runs/22590778789
swebenchmultimodal ✅ Success https://github.com/OpenHands/evaluation/actions/runs/22590785406
multiswebench ❌ Failed Dataset schema issue (unrelated to this PR)

Configuration

  • SDK Branch: main (c31cdee)
  • Model: claude-sonnet-4-5-20250929
  • Eval Limit: 1 instance per benchmark

Note on multiswebench failure

The multiswebench failure is due to a known HuggingFace dataset schema incompatibility with bytedance-research/Multi-SWE-Bench - not related to this PR. Fix available in PR #304.

…tting

- Remove duplicate try block and duplicate lines in prepare_workspace()
- Fix ruff formatting for type hints in run_benchmark.py
- Update uv.lock

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
simonrosenberg added a commit that referenced this pull request Mar 2, 2026
All benchmark runners previously hardcoded SDK_SHORT_SHA for Docker
image tags, with each file duplicating an
os.getenv("SDK_SHORT_SHA", SDK_SHORT_SHA) override pattern. This
centralizes the tag prefix into a single constant.

- New IMAGE_TAG_PREFIX = os.getenv("IMAGE_TAG_PREFIX", SDK_SHORT_SHA)
  in version.py
- All runners import and use IMAGE_TAG_PREFIX directly with no per-file
  env var fallback
- EVAL_AGENT_SERVER_IMAGE in constants.py is now overridable via
  OPENHANDS_EVAL_AGENT_SERVER_IMAGE env var
- modal_patches.py updated to propagate IMAGE_TAG_PREFIX (replaces
  SDK_SHA/SDK_SHORT_SHA pair)

Extracted from #455.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@juanmichelini
Copy link
Collaborator

Results: 5/6 Benchmarks Passed ✅

notice that for swebench, swtbench and commit0 the single instance was resolved.

- Add back ToolPresetType import
- Restore get_tools_for_preset() function supporting default/gemini/planning presets
- Restore tool_preset parameter in EvalMetadata
- Restore tool preset logging

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@juanmichelini
Copy link
Collaborator

On second pass the evaluations were run with branch main and not branch nemo-evaluator

Update test_benchmark_metrics_collection to use get_tools_for_preset
mock target for swebench benchmark, consistent with the restored
function in run_infer.py.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@simonrosenberg
Copy link
Collaborator Author

On second pass the evaluations were run with branch main and not branch nemo-evaluator

damn it

@juanmichelini
Copy link
Collaborator

@OpenHands see software-agent-sdk, run 5 eval job actions there, one for each of these benchmarks: commit0, gaia, swebench, swebenchmultimodal, gaia. for each of those run a single instance, in all use trigger reason "testing benchmarks branch nemo-evaluator", make sure to set the benchmark branch param nemo-evaluator so that we can test that PR. Run it, the software agent sdk action will run an evaluation job. Once you have links to the 5 evaluation jobs post a comment here with the links.

@openhands-ai
Copy link

openhands-ai bot commented Mar 2, 2026

I'm on it! juanmichelini can track my progress at all-hands.dev

Copy link
Collaborator

Evaluation Jobs Triggered

Successfully triggered 5 evaluation jobs testing the nemo-evaluator branch with eval_limit=1:

Benchmark Evaluation Job Link
commit0 https://github.com/OpenHands/evaluation/actions/runs/22594781791
gaia https://github.com/OpenHands/evaluation/actions/runs/22594775753
swebench https://github.com/OpenHands/evaluation/actions/runs/22594781250
swebenchmultimodal https://github.com/OpenHands/evaluation/actions/runs/22594790024
swtbench https://github.com/OpenHands/evaluation/actions/runs/22594786078

All jobs are configured with:

  • SDK version: v1.11.5
  • Benchmarks branch: nemo-evaluator
  • Eval limit: 1 instance per benchmark
  • Trigger reason: testing benchmarks branch nemo-evaluator

@openhands-ai
Copy link

openhands-ai bot commented Mar 2, 2026

Final Summary

I successfully completed your request to trigger evaluation jobs for testing the nemo-evaluator branch (PR #455).

Actions Completed:

Triggered 5 evaluation job actions in the OpenHands/software-agent-sdk repository using the "Run Eval" workflow:

  • commit0
  • gaia
  • swebench
  • swebenchmultimodal
  • swtbench

Each job was configured with:

  • SDK version: v1.11.5
  • Benchmarks branch: nemo-evaluator
  • Eval limit: 1 instance per benchmark
  • Trigger reason: "testing benchmarks branch nemo-evaluator"

Posted a comment on PR #455 with links to all 5 evaluation jobs:

Comment URL: #455 (comment)

All evaluation jobs are now running and will test the NeMo Evaluator integration changes on the nemo-evaluator branch.

simonrosenberg added a commit that referenced this pull request Mar 3, 2026
* Centralize IMAGE_TAG_PREFIX for Docker image tags

All benchmark runners previously hardcoded SDK_SHORT_SHA for Docker
image tags, with each file duplicating an
os.getenv("SDK_SHORT_SHA", SDK_SHORT_SHA) override pattern. This
centralizes the tag prefix into a single constant.

- New IMAGE_TAG_PREFIX = os.getenv("IMAGE_TAG_PREFIX", SDK_SHORT_SHA)
  in version.py
- All runners import and use IMAGE_TAG_PREFIX directly with no per-file
  env var fallback
- EVAL_AGENT_SERVER_IMAGE in constants.py is now overridable via
  OPENHANDS_EVAL_AGENT_SERVER_IMAGE env var
- modal_patches.py updated to propagate IMAGE_TAG_PREFIX (replaces
  SDK_SHA/SDK_SHORT_SHA pair)

Extracted from #455.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Add backward compatibility for SDK_SHORT_SHA env var

Address review feedback: the env var rename from SDK_SHORT_SHA to
IMAGE_TAG_PREFIX is a breaking change for existing users.

- version.py: fall back to SDK_SHORT_SHA env var when IMAGE_TAG_PREFIX
  is not set, emitting a DeprecationWarning
- modal_patches.py: continue propagating SDK_SHORT_SHA to modal
  execution environments alongside IMAGE_TAG_PREFIX
- modal_patches.py: _get_image_tag_prefix() fallback also checks
  SDK_SHORT_SHA when IMAGE_TAG_PREFIX is unset

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Revert "Add backward compatibility for SDK_SHORT_SHA env var"

This reverts commit 1d3f778.

* Add runtime deprecation warning for SDK_SHORT_SHA env var

- version.py: fall back to SDK_SHORT_SHA env var when IMAGE_TAG_PREFIX
  is not set, emitting a DeprecationWarning so users know to migrate
- modal_patches.py: _get_image_tag_prefix() fallback also checks
  SDK_SHORT_SHA when IMAGE_TAG_PREFIX is unset

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Revert "Add runtime deprecation warning for SDK_SHORT_SHA env var"

This reverts commit 80b38b2.

* Update benchmarks/utils/version.py

Co-authored-by: OpenHands Bot <contact@all-hands.dev>

* Fix inconsistent log wording and add IMAGE_TAG_PREFIX tests

- swebench/run_infer.py: change "image tag prefix:" to "tag prefix:"
  for consistency with all other benchmark runners
- tests/test_version.py: add unit tests for IMAGE_TAG_PREFIX resolution
  covering default, env var override, deprecated SDK_SHORT_SHA fallback,
  and precedence when both env vars are set

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Update uv.lock for SDK submodule v1.11.5

Regenerate lock file to match the current vendor/software-agent-sdk
submodule version (1.10.0 → 1.11.5).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Fix ruff format in version.py

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Update benchmarks/utils/modal_patches.py

Co-authored-by: OpenHands Bot <contact@all-hands.dev>

* Fix indentation in modal_patches.py for IMAGE_TAG_PREFIX env vars

Three lines were accidentally left-aligned at column 0 instead of being
indented inside the _inject_modal_sitecustomize() function body.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: OpenHands Bot <contact@all-hands.dev>
Resolve conflicts after sub-PRs #471, #472, #473 were merged to main:
- Take main's SDK_SHORT_SHA deprecation handling in version.py
- Take main's backward-compat SDK_SHORT_SHA in modal_patches.py
- Take main's log message wording in swebench/run_infer.py
- Remove redundant prompt_dir setup (superseded by add_prompt_path_argument)
- Keep lazy imports for git-dependent modules (build_utils.py, swtbench/run_infer.py)
- Take main's comment cleanup in swtbench/eval_infer.py

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
simonrosenberg and others added 2 commits March 3, 2026 12:48
Resolve conflicts after benchmark bug-fix PR #475 was merged to main:
- Take main's malformed JSON error handling in multiswebench/build_images.py
- Take main's cleaner pd.isna() guard (isinstance instead of try/except)
- Take main's heredoc security tradeoff comment
- Take main's fixed generate_report resolution logic (final_score check)
- Take main's updated generate_report docstring

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
set_defaults(**EVAL_DEFAULTS) already handles the timeout default.
The merge auto-resolved this the wrong way; restore main's version.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@simonrosenberg simonrosenberg linked an issue Mar 3, 2026 that may be closed by this pull request
Copy link
Collaborator

@all-hands-bot all-hands-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Taste Rating: 🟡 Acceptable - NeMo integration works but reveals abstraction issues.

Key Insight: Dict-of-lambdas returning tuple-lists with empty-string sentinel values is a code smell. This pattern works but signals the abstraction is leaking - as you add more benchmarks, maintenance cost grows non-linearly.

Critical Gap: Still no tests for 400+ lines of integration code (unresolved from previous review). This is NeMo glue - when it breaks, debugging will be painful without tests.

pass_filenames: true
always_run: false
exclude: ^legacy/
exclude: ^(legacy|nemo_evaluator)/
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Critical: Excluding nemo_evaluator from pre-commit bypasses type checking and linting.

You've exempted this code from pyright and ruff. This means type errors and code quality issues will slip through. NeMo has its own conventions, but type safety and basic linting shouldn't be negotiable.

If NeMo's conventions conflict with your project's, document the specific conflicts and exclude only those rules. Blanket exclusion is how technical debt accumulates silently.

Comment on lines +37 to +45
BENCHMARK_INFER_PARAMS: dict[
str, Callable[[argparse.Namespace], list[tuple[str, str]]]
] = {
"gaia": lambda args: [("--level", args.level)] if args.level else [],
"commit0": lambda args: [("--repo-split", args.repo_split)]
if args.repo_split
else [],
"multiswebench": lambda args: [("--lang", args.language)] if args.language else [],
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟠 Important: Dict-of-lambdas is overengineered for simple conditional args.

This pattern:

BENCHMARK_INFER_PARAMS: dict[str, Callable[[...], list[tuple[str, str]]]] = {
    "gaia": lambda args: [("--level", args.level)] if args.level else [],
}

Could be:

def get_benchmark_args(benchmark: str, args: Namespace) -> list[str]:
    """Returns CLI args for benchmark-specific parameters."""
    if benchmark == "gaia" and args.level:
        return ["--level", args.level]
    elif benchmark == "commit0" and args.repo_split:
        return ["--repo-split", args.repo_split]
    # ...
    return []

Simpler data structures → simpler code. The dict-of-lambdas doesn't buy you anything except indirection.

Comment on lines +133 to +137
for flag, value in BENCHMARK_EVAL_PARAMS[benchmark](args, output_jsonl):
if value:
cmd.extend([flag, value])
else:
cmd.append(flag)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Suggestion: Empty string as sentinel for bare flags is a code smell.

Using ("--modal", "") to mean "bare flag" requires this special-case check. Better approach: separate the flag building from the tuple structure, or use a simple boolean/None pattern.

Example:

for item in params:
    if isinstance(item, tuple):  # flag with value
        cmd.extend(item)
    else:  # bare flag
        cmd.append(item)

Or just handle it where the data is created instead of using empty strings.

)
sys.exit(1)

output_jsonl = output_files[-1] # Use the latest one
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟠 Important: sorted()[-1] assumes lexicographic ordering gives "latest" - fragile.

If output_dir contains multiple runs with timestamps or run IDs in different formats, sorted() might not pick the latest. You're relying on filesystem/naming conventions that aren't enforced.

Better: Either error if len(output_files) != 1 (strict), or use file modification time:

output_jsonl = max(output_files, key=lambda p: p.stat().st_mtime)

Theory: "we'll only have one output.jsonl per run"
Practice: Multi-attempt runs, retries, or manual re-runs will break this assumption.

output_path = pathlib.Path(output_dir)

# Find any .report.json file (all benchmarks use this naming convention)
report_files = sorted(output_path.rglob("*.report.json"))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟢 Good taste: Explicit check for multiple report files prevents ambiguity.

Unlike run_benchmark.py:247 which silently picks the last file, this correctly errors on ambiguous input (line 19-23). This is the right behavior for validation logic - fail fast and explicit rather than guessing.

Comment on lines +28 to +35
metadata_file = output_path / "nemo_metadata.json"
if not metadata_file.exists():
raise FileNotFoundError(
f"nemo_metadata.json not found in {output_dir}. "
"Make sure the benchmark was run via run_benchmark.py."
)
metadata = json.loads(metadata_file.read_text(encoding="utf-8"))
task_name = metadata["benchmark"]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Suggestion: Tight coupling via metadata file is pragmatic but document the contract.

output.py requires run_benchmark.py to write nemo_metadata.json. This works, but the coupling is implicit - nothing enforces that run_benchmark.py actually writes it, or what schema it uses.

Add a comment or type definition for the metadata schema:

# Expected schema: {"benchmark": str}
metadata = json.loads(metadata_file.read_text(encoding="utf-8"))

Or use a simple dataclass/TypedDict to document the contract.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Nemo Evaluator Integration with OpenHands

4 participants