Conversation
c63273d to
8bfdd29
Compare
The PR had a broken run_timeout parameter (immediately overwritten by env var). Revert to main's clean implementation and drop the dead argument from all 6 benchmark call sites.
PR #389 intentionally removed the eval_ prefix from run_ids in swtbench and swebenchmultimodal. The NeMo PR had re-added it. Revert to the current convention (bare stem, no prefix).
Instead of patching every eval script to add a "benchmark" field to .report.json, write a nemo_metadata.json from run_benchmark.py and read it from output.py. This keeps the NeMo concern in NeMo-specific code and avoids touching 7 existing benchmark eval scripts.
These choices are only used for argparse validation, never shown to users. Use absolute paths unconditionally instead of a try/except fallback for pip-installed packages.
Main branch reads the timeout from the CONVERSATION_TIMEOUT env var in fake_user_response.py. The CLI arg added by this PR was dead code since we reverted fake_user_response.py to main. Remove the full chain: args_parser, EvalMetadata field, all benchmark main() assignments, run_benchmark.py passthrough, and framework.yml config.
This is not a bug fix and not needed for NeMo. Main already skips failed samples by default. The feature added fail-fast behavior that would be harmful for long NeMo runs, and framework.yml had it configured to stop on first failure. Removed: SampleFailedError class, CLI arg, EvalMetadata field, all benchmark main() assignments, run_benchmark.py passthrough, and framework.yml config.
The uv check-and-fallback pattern is unnecessary: when running inside the venv (whether via uv or pip), sys.executable already points to the correct Python interpreter with all dependencies available. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Revert unnecessary restructuring (moved try/except blocks, removed default values, reformatted signatures) and keep only the essential changes: replace uv invocations with sys.executable. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The --modal/--no-modal flag for swebench multimodal eval is being added in a separate PR (#452). Remove the modal plumbing from this PR to avoid depending on unreleased changes. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Fix import sorting (ruff I001) across all changed files - Remove unused imports (ruff F401): LLM in 6 run_infer.py files - Remove unused variable build_error (ruff F841) in swtbench - Fix undefined sdk_short_sha reference in swebench/run_infer.py - Fix formatting (ruff format) in 6 files - Fix trailing blank line (pycodestyle W391) in __init__.py Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Pick up the split parameter change from main (default removed from function signature, args.split passed explicitly at call site). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Import TargetType from local benchmarks.swebench.constants instead of the removed openhands.agent_server.docker.build import. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
These scripts are only used by the NeMo integration. Move them to nemo_evaluator/openhands_benchmarks/ to keep NeMo-specific code together. They can be moved to benchmarks/scripts/ later if they prove generally useful. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Move nemo_evaluator from required dependencies to [project.optional-dependencies] so it can be installed via pip install openhands-benchmarks[nemo]. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
"pkg*" already matches "pkg" via fnmatch glob. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Resolve api_key_env and strip /chat/completions from base_url at config generation time rather than load time. This keeps the shared load_llm_config() utility free of NeMo-specific concerns. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
8bfdd29 to
0ff388e
Compare
Adopt centralized Docker image auto-detection helpers (create_docker_workspace, ensure_local_image, local_image_exists) from PR #456 while keeping IMAGE_TAG_PREFIX naming from this branch. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The DelegateTool still exists in the SDK and get_default_tools does not include it. Removing the opt-in delegation code silently breaks the --enable-delegation CLI flag. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The FinishAction isinstance check is the correct way to extract the agent's final answer. The replacement using string matching on class names and wrong attributes was strictly worse. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The precise startswith("bytedance-research/Multi-SWE-Bench") check
from PR #454 is already correct. The loose substring match is
unnecessary and could false-positive.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Remove try/except "unknown" fallback from get_sdk_sha() — if the
submodule is missing, fail immediately instead of producing a useless
image tag that fails later at Docker pull time
- Migrate swefficiency/run_infer.py to IMAGE_TAG_PREFIX, removing the
scattered os.getenv("SDK_SHORT_SHA") override pattern
- Migrate modal_patches.py to IMAGE_TAG_PREFIX for consistency
- Document the purpose of IMAGE_TAG_PREFIX in version.py
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
ca0a242 to
b0620fa
Compare
all-hands-bot
left a comment
There was a problem hiding this comment.
Testing Gap: The new nemo_evaluator package has no test coverage.
| @@ -0,0 +1,98 @@ | |||
| from __future__ import annotations | |||
There was a problem hiding this comment.
🟠 Important: No tests for new integration package.
You've added 3 new modules (generate_llm_config.py, run_benchmark.py, output.py) totaling ~400 lines with zero test coverage. This is an integration shim to NeMo - if it breaks, debugging will be painful.
Minimum viable tests:
test_generate_config()- verify JSON output format, env var resolution, URL strippingtest_build_infer_cmd()- verify command building for each benchmark typetest_parse_output()- verify report.json parsing and accuracy calculation
"Testing and practice sometimes clash. Testing wins. Every single time." - If this breaks in production, you'll wish you had tests to reproduce the issue locally.
Add tests in nemo_evaluator/tests/ before merging.
Co-authored-by: OpenHands Bot <contact@all-hands.dev>
Co-authored-by: OpenHands Bot <contact@all-hands.dev>
Replace if-chain for benchmark-specific CLI flags with a declarative dict mapping benchmark names to parameter extraction functions. New benchmarks can now add their parameters without modifying _build_infer_cmd. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Extend the data-driven approach to _build_eval_cmd and env var setup: - BENCHMARK_EVAL_PARAMS: declares per-benchmark eval CLI flags - BENCHMARK_ENV_VARS: declares per-benchmark env vars for inference All benchmark-specific logic is now in declarative dicts at the top of the file. The functions _build_infer_cmd, _build_eval_cmd, and main() are fully generic — adding a new benchmark only requires updating the data tables. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Nemo Evaluator Integration Validation ResultsTriggered CI benchmarks with Results: 5/6 Benchmarks Passed ✅
Configuration
Note on multiswebench failureThe |
…tting - Remove duplicate try block and duplicate lines in prepare_workspace() - Fix ruff formatting for type hints in run_benchmark.py - Update uv.lock Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
All benchmark runners previously hardcoded SDK_SHORT_SHA for Docker
image tags, with each file duplicating an
os.getenv("SDK_SHORT_SHA", SDK_SHORT_SHA) override pattern. This
centralizes the tag prefix into a single constant.
- New IMAGE_TAG_PREFIX = os.getenv("IMAGE_TAG_PREFIX", SDK_SHORT_SHA)
in version.py
- All runners import and use IMAGE_TAG_PREFIX directly with no per-file
env var fallback
- EVAL_AGENT_SERVER_IMAGE in constants.py is now overridable via
OPENHANDS_EVAL_AGENT_SERVER_IMAGE env var
- modal_patches.py updated to propagate IMAGE_TAG_PREFIX (replaces
SDK_SHA/SDK_SHORT_SHA pair)
Extracted from #455.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
notice that for swebench, swtbench and commit0 the single instance was resolved. |
- Add back ToolPresetType import - Restore get_tools_for_preset() function supporting default/gemini/planning presets - Restore tool_preset parameter in EvalMetadata - Restore tool preset logging Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
|
On second pass the evaluations were run with branch main and not branch nemo-evaluator |
Update test_benchmark_metrics_collection to use get_tools_for_preset mock target for swebench benchmark, consistent with the restored function in run_infer.py. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
damn it |
|
@OpenHands see software-agent-sdk, run 5 eval job actions there, one for each of these benchmarks: commit0, gaia, swebench, swebenchmultimodal, gaia. for each of those run a single instance, in all use trigger reason "testing benchmarks branch nemo-evaluator", make sure to set the benchmark branch param nemo-evaluator so that we can test that PR. Run it, the software agent sdk action will run an evaluation job. Once you have links to the 5 evaluation jobs post a comment here with the links. |
|
I'm on it! juanmichelini can track my progress at all-hands.dev |
Evaluation Jobs TriggeredSuccessfully triggered 5 evaluation jobs testing the
All jobs are configured with:
|
Final SummaryI successfully completed your request to trigger evaluation jobs for testing the Actions Completed:✅ Triggered 5 evaluation job actions in the OpenHands/software-agent-sdk repository using the "Run Eval" workflow:
Each job was configured with:
✅ Posted a comment on PR #455 with links to all 5 evaluation jobs:
Comment URL: #455 (comment) All evaluation jobs are now running and will test the NeMo Evaluator integration changes on the |
* Centralize IMAGE_TAG_PREFIX for Docker image tags
All benchmark runners previously hardcoded SDK_SHORT_SHA for Docker
image tags, with each file duplicating an
os.getenv("SDK_SHORT_SHA", SDK_SHORT_SHA) override pattern. This
centralizes the tag prefix into a single constant.
- New IMAGE_TAG_PREFIX = os.getenv("IMAGE_TAG_PREFIX", SDK_SHORT_SHA)
in version.py
- All runners import and use IMAGE_TAG_PREFIX directly with no per-file
env var fallback
- EVAL_AGENT_SERVER_IMAGE in constants.py is now overridable via
OPENHANDS_EVAL_AGENT_SERVER_IMAGE env var
- modal_patches.py updated to propagate IMAGE_TAG_PREFIX (replaces
SDK_SHA/SDK_SHORT_SHA pair)
Extracted from #455.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* Add backward compatibility for SDK_SHORT_SHA env var
Address review feedback: the env var rename from SDK_SHORT_SHA to
IMAGE_TAG_PREFIX is a breaking change for existing users.
- version.py: fall back to SDK_SHORT_SHA env var when IMAGE_TAG_PREFIX
is not set, emitting a DeprecationWarning
- modal_patches.py: continue propagating SDK_SHORT_SHA to modal
execution environments alongside IMAGE_TAG_PREFIX
- modal_patches.py: _get_image_tag_prefix() fallback also checks
SDK_SHORT_SHA when IMAGE_TAG_PREFIX is unset
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* Revert "Add backward compatibility for SDK_SHORT_SHA env var"
This reverts commit 1d3f778.
* Add runtime deprecation warning for SDK_SHORT_SHA env var
- version.py: fall back to SDK_SHORT_SHA env var when IMAGE_TAG_PREFIX
is not set, emitting a DeprecationWarning so users know to migrate
- modal_patches.py: _get_image_tag_prefix() fallback also checks
SDK_SHORT_SHA when IMAGE_TAG_PREFIX is unset
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* Revert "Add runtime deprecation warning for SDK_SHORT_SHA env var"
This reverts commit 80b38b2.
* Update benchmarks/utils/version.py
Co-authored-by: OpenHands Bot <contact@all-hands.dev>
* Fix inconsistent log wording and add IMAGE_TAG_PREFIX tests
- swebench/run_infer.py: change "image tag prefix:" to "tag prefix:"
for consistency with all other benchmark runners
- tests/test_version.py: add unit tests for IMAGE_TAG_PREFIX resolution
covering default, env var override, deprecated SDK_SHORT_SHA fallback,
and precedence when both env vars are set
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* Update uv.lock for SDK submodule v1.11.5
Regenerate lock file to match the current vendor/software-agent-sdk
submodule version (1.10.0 → 1.11.5).
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* Fix ruff format in version.py
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* Update benchmarks/utils/modal_patches.py
Co-authored-by: OpenHands Bot <contact@all-hands.dev>
* Fix indentation in modal_patches.py for IMAGE_TAG_PREFIX env vars
Three lines were accidentally left-aligned at column 0 instead of being
indented inside the _inject_modal_sitecustomize() function body.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: OpenHands Bot <contact@all-hands.dev>
Resolve conflicts after sub-PRs #471, #472, #473 were merged to main: - Take main's SDK_SHORT_SHA deprecation handling in version.py - Take main's backward-compat SDK_SHORT_SHA in modal_patches.py - Take main's log message wording in swebench/run_infer.py - Remove redundant prompt_dir setup (superseded by add_prompt_path_argument) - Keep lazy imports for git-dependent modules (build_utils.py, swtbench/run_infer.py) - Take main's comment cleanup in swtbench/eval_infer.py Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Resolve conflicts after benchmark bug-fix PR #475 was merged to main: - Take main's malformed JSON error handling in multiswebench/build_images.py - Take main's cleaner pd.isna() guard (isinstance instead of try/except) - Take main's heredoc security tradeoff comment - Take main's fixed generate_report resolution logic (final_score check) - Take main's updated generate_report docstring Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
set_defaults(**EVAL_DEFAULTS) already handles the timeout default. The merge auto-resolved this the wrong way; restore main's version. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
all-hands-bot
left a comment
There was a problem hiding this comment.
Taste Rating: 🟡 Acceptable - NeMo integration works but reveals abstraction issues.
Key Insight: Dict-of-lambdas returning tuple-lists with empty-string sentinel values is a code smell. This pattern works but signals the abstraction is leaking - as you add more benchmarks, maintenance cost grows non-linearly.
Critical Gap: Still no tests for 400+ lines of integration code (unresolved from previous review). This is NeMo glue - when it breaks, debugging will be painful without tests.
| pass_filenames: true | ||
| always_run: false | ||
| exclude: ^legacy/ | ||
| exclude: ^(legacy|nemo_evaluator)/ |
There was a problem hiding this comment.
🔴 Critical: Excluding nemo_evaluator from pre-commit bypasses type checking and linting.
You've exempted this code from pyright and ruff. This means type errors and code quality issues will slip through. NeMo has its own conventions, but type safety and basic linting shouldn't be negotiable.
If NeMo's conventions conflict with your project's, document the specific conflicts and exclude only those rules. Blanket exclusion is how technical debt accumulates silently.
| BENCHMARK_INFER_PARAMS: dict[ | ||
| str, Callable[[argparse.Namespace], list[tuple[str, str]]] | ||
| ] = { | ||
| "gaia": lambda args: [("--level", args.level)] if args.level else [], | ||
| "commit0": lambda args: [("--repo-split", args.repo_split)] | ||
| if args.repo_split | ||
| else [], | ||
| "multiswebench": lambda args: [("--lang", args.language)] if args.language else [], | ||
| } |
There was a problem hiding this comment.
🟠 Important: Dict-of-lambdas is overengineered for simple conditional args.
This pattern:
BENCHMARK_INFER_PARAMS: dict[str, Callable[[...], list[tuple[str, str]]]] = {
"gaia": lambda args: [("--level", args.level)] if args.level else [],
}Could be:
def get_benchmark_args(benchmark: str, args: Namespace) -> list[str]:
"""Returns CLI args for benchmark-specific parameters."""
if benchmark == "gaia" and args.level:
return ["--level", args.level]
elif benchmark == "commit0" and args.repo_split:
return ["--repo-split", args.repo_split]
# ...
return []Simpler data structures → simpler code. The dict-of-lambdas doesn't buy you anything except indirection.
| for flag, value in BENCHMARK_EVAL_PARAMS[benchmark](args, output_jsonl): | ||
| if value: | ||
| cmd.extend([flag, value]) | ||
| else: | ||
| cmd.append(flag) |
There was a problem hiding this comment.
🟡 Suggestion: Empty string as sentinel for bare flags is a code smell.
Using ("--modal", "") to mean "bare flag" requires this special-case check. Better approach: separate the flag building from the tuple structure, or use a simple boolean/None pattern.
Example:
for item in params:
if isinstance(item, tuple): # flag with value
cmd.extend(item)
else: # bare flag
cmd.append(item)Or just handle it where the data is created instead of using empty strings.
| ) | ||
| sys.exit(1) | ||
|
|
||
| output_jsonl = output_files[-1] # Use the latest one |
There was a problem hiding this comment.
🟠 Important: sorted()[-1] assumes lexicographic ordering gives "latest" - fragile.
If output_dir contains multiple runs with timestamps or run IDs in different formats, sorted() might not pick the latest. You're relying on filesystem/naming conventions that aren't enforced.
Better: Either error if len(output_files) != 1 (strict), or use file modification time:
output_jsonl = max(output_files, key=lambda p: p.stat().st_mtime)Theory: "we'll only have one output.jsonl per run"
Practice: Multi-attempt runs, retries, or manual re-runs will break this assumption.
| output_path = pathlib.Path(output_dir) | ||
|
|
||
| # Find any .report.json file (all benchmarks use this naming convention) | ||
| report_files = sorted(output_path.rglob("*.report.json")) |
There was a problem hiding this comment.
🟢 Good taste: Explicit check for multiple report files prevents ambiguity.
Unlike run_benchmark.py:247 which silently picks the last file, this correctly errors on ambiguous input (line 19-23). This is the right behavior for validation logic - fail fast and explicit rather than guessing.
| metadata_file = output_path / "nemo_metadata.json" | ||
| if not metadata_file.exists(): | ||
| raise FileNotFoundError( | ||
| f"nemo_metadata.json not found in {output_dir}. " | ||
| "Make sure the benchmark was run via run_benchmark.py." | ||
| ) | ||
| metadata = json.loads(metadata_file.read_text(encoding="utf-8")) | ||
| task_name = metadata["benchmark"] |
There was a problem hiding this comment.
🟡 Suggestion: Tight coupling via metadata file is pragmatic but document the contract.
output.py requires run_benchmark.py to write nemo_metadata.json. This works, but the coupling is implicit - nothing enforces that run_benchmark.py actually writes it, or what schema it uses.
Add a comment or type definition for the metadata schema:
# Expected schema: {"benchmark": str}
metadata = json.loads(metadata_file.read_text(encoding="utf-8"))Or use a simple dataclass/TypedDict to document the contract.
Summary
Adds NVIDIA NeMo Evaluator integration so OpenHands benchmarks can be run via NeMo's evaluation framework. The new
nemo_evaluatorpackage is an optional dependency (pip install .[nemo]) that bridges NeMo to the existing benchmark CLI entrypoints.New:
nemo_evaluator/openhands_benchmarks/packageframework.ymlrun_benchmark.pygenerate_llm_config.py/chat/completionssuffix for LiteLLM)output.py*.report.jsonoutput into NeMo'sEvaluationResultformat (accuracy = resolved/submitted)New console script entrypoints:
generate-llm-config,run-benchmark.pyproject.tomlnemo = ["nemo_evaluator>=0.1.92,<0.2"]generate-llm-config,run-benchmark["benchmarks*", "nemo_evaluator*"]*.ymlfor nemo_evaluator,*.j2/Dockerfile*/*.jsonfor benchmarksexclude = ["legacy"]from ruff config (moved to.pre-commit-config.yamlexclude).pre-commit-config.yamlnemo_evaluator/to the exclude pattern (NeMo-specific code uses its own conventions)Previous PRs
Continues from #412 (closed). Refactors extracted into separate PRs:
image_exists→remote_image_existsrename (Rename image_exists to remote_image_exists #471)uv run python→sys.executable(Replaceuv run pythonwithsys.executablein eval scripts #472)IMAGE_TAG_PREFIX(Centralize IMAGE_TAG_PREFIX for Docker image tags #473)load_llm_configshared utility (Extract shared load_llm_config utility #453)