refactor(swtbench): regroup all hyperparameters in constants.py by simonrosenberg · Pull Request #372 · OpenHands/benchmarks

simonrosenberg · 2026-01-27T18:11:56Z

Summary

This PR creates a single source of truth for all SWTBench hyperparameters and constant values by introducing benchmarks/swtbench/constants.py.

Fixes #364

Changes

1. Created `benchmarks/swtbench/constants.py`

This new file contains all constant values organized into logical categories:

Type Definitions: TargetType literal type alias for build targets
Docker/Image Related Constants: Docker image prefixes, agent server image base, prebaked registry, build targets, image tags
Build Mode: BuildMode enum with API and CLI options
Dataset Related Constants: Default dataset, split, model name
Environment Variable Names: All environment variable names used in the module
Default Values: Default values for various parameters (skip build, runtime API URL, startup timeout, workers, etc.)
File/Directory Paths: Repository directory, evaluation results directory, report filenames
Git/Repository Related Constants: SWT-bench repo URL, git user configuration
Patch Processing Constants: Files to remove from patches (as immutable tuple)
Environment Setup Commands: Default environment setup commands (as immutable tuple)

2. Updated all swtbench modules to import from constants.py

run_infer.py
eval_infer.py
build_eval_env_images.py
image_utils.py

3. Type Safety Improvements

All constants use typing.Final annotations for immutability
Mutable lists converted to immutable tuples:
- SETUP_FILES_TO_REMOVE
- BUILD_MODE_CHOICES
- DEFAULT_ENV_SETUP_COMMANDS
BuildMode enum for type-safe build mode selection
TargetType literal type for build targets (matches openhands.sdk.workspace.TargetType)
Numeric constants use proper int types instead of strings:
- DEFAULT_STARTUP_TIMEOUT: 600 (int)
- DEFAULT_EVAL_WORKERS: 12 (int)

Benefits

Single Source of Truth: All hyperparameters are now in one place, making it easy to review and modify them
Better Maintainability: Changes to constants only need to be made in one location
Improved Discoverability: Developers can easily find all configurable values
Type Safety: Constants are properly typed with Final annotations and appropriate types
Immutability: Collection constants use tuples to prevent accidental mutation
Pyright Compliance: All type checks pass with strict mode

This commit creates a single source of truth for all SWTBench hyperparameters and constant values by: 1. Creating benchmarks/swtbench/constants.py with all constants: - Docker/Image related constants (prefixes, registries, build targets) - Dataset related constants (default dataset, split, model name) - Environment variable names - Default values for various parameters - File/directory paths (repo dir, evaluation results dir, report filename) - Git/repository related constants - Patch processing constants - Environment setup commands 2. Updating all swtbench modules to import from constants.py: - run_infer.py - eval_infer.py - build_eval_env_images.py - image_utils.py This makes it easier to check and modify hyperparameters for the benchmark as they are now centralized in one location. Fixes #364 Co-authored-by: openhands <openhands@all-hands.dev>

…ility - Add typing.Final annotations to all constants for type safety - Convert mutable lists to immutable tuples: - SETUP_FILES_TO_REMOVE - BUILD_MODE_CHOICES - DEFAULT_ENV_SETUP_COMMANDS - Add BuildMode enum for type-safe build mode selection - Convert string numeric constants to proper int types: - DEFAULT_STARTUP_TIMEOUT: '600' -> 600 - DEFAULT_EVAL_WORKERS: '12' -> 12 - Update callers to handle type changes: - eval_infer.py: workers parameter now int, add type=int to argparse - run_infer.py: convert tuple to list for env_setup_commands, convert int to str for os.getenv default Co-authored-by: openhands <openhands@all-hands.dev>

openhands-ai · 2026-01-28T14:23:51Z

Looks like there are a few issues preventing this PR from being merged!

GitHub Actions are failing:
- Pre-commit checks

If you'd like me to help, just leave a comment, like

@OpenHands please fix the failing actions on PR #372 at branch `openhands/swtbench-constants-refactor`

Feel free to include any additional details that might help me get this PR into a better state.

_{^{You can manage your notification settings}}

- Add TargetType alias to constants.py matching openhands.sdk.workspace.TargetType - Update DEFAULT_BUILD_TARGET to use TargetType instead of str - Update run_infer.py to use TargetType for build_target variable and function parameter Co-authored-by: openhands <openhands@all-hands.dev>

simonrosenberg · 2026-01-28T14:45:05Z

@OpenHands
/codereview-roasted
Make sure this PR doesn't alter any constant or behavior
Suggest next step refactoring for cleaner code

openhands-ai · 2026-01-28T14:45:15Z

I'm on it! simonrosenberg can track my progress at all-hands.dev

openhands-ai · 2026-01-28T14:48:19Z

🔴 Code Review: PR #372 - SWTBench Constants Refactor

Taste Rating: 🟡 Acceptable - Works but has one behavioral change that needs attention

[CRITICAL ISSUES] (Must fix - these break fundamental principles)

[benchmarks/swtbench/eval_infer.py, Lines 246-248] ⚠️ Breaking Change: The workers parameter type changed from str to int:

# Original:
def run_swtbench_evaluation(..., workers: str = "12") -> None:

# New:
def run_swtbench_evaluation(..., workers: int = DEFAULT_EVAL_WORKERS) -> None:

While this works at runtime (Python is dynamically typed), it's a function signature change that could break type-checked callers. The argparse also changed from accepting a string to type=int.

Recommendation: If this is intentional (improving type correctness), document it in the PR description as a breaking change. Otherwise, keep workers: str and use DEFAULT_EVAL_WORKERS_STR: Final[str] = "12".

[IMPROVEMENT OPPORTUNITIES] (Should fix - violates good taste)

[benchmarks/swtbench/constants.py, Line 35] 🔍 Magic Number: IMAGE_NAME_SEPARATOR = "1776" - This cryptic value needs a comment explaining its origin. Is it a date? A convention from SWE-bench? Future maintainers will be confused.

[benchmarks/swtbench/constants.py, Lines 38-49] 🔧 Inconsistent Enum Usage: You define BuildMode enum but then use .value everywhere:

DEFAULT_BUILD_MODE: Final[str] = BuildMode.CLI.value  # Why not just BuildMode.CLI?
BUILD_MODE_CHOICES: Final[Tuple[str, ...]] = tuple(m.value for m in BuildMode)

Consider using the enum directly and converting to string only at the argparse boundary.

[benchmarks/swtbench/constants.py, Line 13] 🔧 Inconsistent Type Definition: TargetType is a Literal while BuildMode is an Enum. For consistency, consider making TargetType an enum too, or document why they differ.

[STYLE NOTES] (Minor observations)

[benchmarks/swtbench/constants.py] 📝 The heavy use of section separators (# ====) is verbose. Consider using a single blank line between sections - the code is already well-organized.

[benchmarks/swtbench/eval_infer.py, Line 214] ✅ SETUP_FILES_TO_REMOVE changed from list to tuple - this is fine since remove_files_from_patch accepts any iterable.

VERDICT:

❌ Needs minor rework: The workers type change from str to int is a behavioral/API change that should be explicitly acknowledged or reverted.

KEY INSIGHT:

The refactoring successfully centralizes constants without changing their values, but inadvertently "improves" the workers parameter type from str to int - which, while arguably better, is a breaking change that violates the "don't alter behavior" requirement.

Suggested Next Steps for Cleaner Code:

Fix the workers type issue - either revert to str or document as intentional breaking change
Add docstring to IMAGE_NAME_SEPARATOR explaining the "1776" convention

Consider grouping related constants into @dataclass(frozen=True) for better organization:

@dataclass(frozen=True)
class DockerConfig:
    swebench_prefix: str = "docker.io/swebench/"
    swtbench_prefix: str = "docker.io/swtbench/"
    agent_server_base: str = "ghcr.io/all-hands-ai/agent-server"

Unify type definitions - either all Literals or all Enums for consistency
Consider adding a __all__ export list to make the public API explicit

View full conversation

Remove the BuildMode enum class and replace with simple string constants. The enum was only used to generate choices and default values, which can be done more simply with a tuple and string constant. Co-authored-by: openhands <openhands@all-hands.dev>

openhands-ai bot mentioned this pull request Jan 27, 2026

Regroup all swtbench hyper parameters in a single source of truth benchmarks/swtbench/constants.py #364

Closed

simonrosenberg force-pushed the openhands/swtbench-constants-refactor branch from e275ab3 to 328490e Compare January 28, 2026 10:16

OpenHands deleted a comment from openhands-ai bot Jan 28, 2026

simonrosenberg marked this pull request as draft January 29, 2026 20:12

simonrosenberg closed this Feb 13, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor(swtbench): regroup all hyperparameters in constants.py#372

refactor(swtbench): regroup all hyperparameters in constants.py#372
simonrosenberg wants to merge 4 commits intomainfrom
openhands/swtbench-constants-refactor

simonrosenberg commented Jan 27, 2026 •

edited

Loading

Uh oh!

openhands-ai bot commented Jan 28, 2026

Uh oh!

simonrosenberg commented Jan 28, 2026

Uh oh!

openhands-ai bot commented Jan 28, 2026

Uh oh!

openhands-ai bot commented Jan 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

simonrosenberg commented Jan 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

1. Created benchmarks/swtbench/constants.py

2. Updated all swtbench modules to import from constants.py

3. Type Safety Improvements

Benefits

Uh oh!

openhands-ai bot commented Jan 28, 2026

Uh oh!

simonrosenberg commented Jan 28, 2026

Uh oh!

openhands-ai bot commented Jan 28, 2026

Uh oh!

openhands-ai bot commented Jan 28, 2026

🔴 Code Review: PR #372 - SWTBench Constants Refactor

Taste Rating: 🟡 Acceptable - Works but has one behavioral change that needs attention

[CRITICAL ISSUES] (Must fix - these break fundamental principles)

[IMPROVEMENT OPPORTUNITIES] (Should fix - violates good taste)

[STYLE NOTES] (Minor observations)

VERDICT:

KEY INSIGHT:

Suggested Next Steps for Cleaner Code:

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

simonrosenberg commented Jan 27, 2026 •

edited

Loading

1. Created `benchmarks/swtbench/constants.py`