fix: Reuse pre-built SDK sdist to speed up image builds by juanmichelini · Pull Request #499 · OpenHands/benchmarks

juanmichelini · 2026-03-10T00:32:21Z

Root Cause Analysis:

Trigger: Commit updated the submodule to a new commit ().
Tag Mismatch: Benchmark image tags include the SDK SHA. Because the SHA changed, returns for all 500 images, forcing a full rebuild of every image.
Build Overhead: Each image build calls , which runs .
The Bottleneck: takes about 15 seconds per run and outputs verbose logs (warnings).
- Execution Time: 500 images * 15 seconds = ~2 hours of pure CPU/IO overhead just for packaging the SDK.
- Logging Spam: 500 images * ~1200 lines = ~600,000 lines of logs. This overwhelmed the previous attempt to fix it via log suppression (PR Disable rich logging by default to fix multiprocessing deadlock #487), because suppressing logs doesn't remove the 15-second execution time.

The Fix:
This PR optimizes the build process by reusing the SDK :

Build Once: It builds the SDK package once at the start of the batch build process.
Reuse: It monkeypatches the build process in worker threads to skip running and instead copy the pre-built .
Result: This eliminates the 15-second overhead per image (saving ~2 hours) and completely removes the logging spam.

This reduces the step from ~15s (noisy) to <1s (silent).

Why not change the SDK?
The SDK change that introduced is valid for standalone builds. However, for mass-build scenarios like benchmarks, we need this optimization without forcing changes upstream in the SDK repo. Monkeypatching is a pragmatic solution to inject this optimization locally within the benchmarks tooling.

Avoids running 'uv build' 500 times (once per image), saving ~15s per image and eliminating logging spam.

juanmichelini · 2026-03-10T21:56:58Z

🚨 Evaluation OOM Failure Report

I tested this branch with two evaluation runs, and both failed with Out-Of-Memory (OOM) errors before any evaluations could start.

Test Runs

First Attempt: Workflow 22906638498
- Pod: eval-22906638498-qwen3-5-fl-pntvh
- Result: Disappeared after 1h11m with 0 output (likely OOM)
Second Attempt: Workflow 22921567775
- Pod: eval-22921567775-qwen3-5-fl-vfrwq
- Benchmarks Build: ✅ Completed successfully (50m48s)
- Result: Pod evicted due to OOM after 3 minutes

OOM Details (Pod: eval-22921567775-qwen3-5-fl-vfrwq)

Kubernetes Event:

Warning   Evicted   pod/eval-22921567775-qwen3-5-fl-vfrwq   
The node was low on resource: memory. Threshold quantity: 100Mi, 
available: 79036Ki. Container eval-container was using 11461612Ki, 
request is 8Gi, has larger consumption of memory.

Memory Stats:

Container Request: 8 GB
Container Usage: 11.4 GB (143% of request!)
Available on Node: Only 79 MB
Result: Pod EVICTED

Timeline:

20:47 UTC - Benchmarks build completed, evaluation started
20:47-20:50 UTC - Pod installing system packages (apt-get)
20:50 UTC - Pod evicted (OOM)
Duration: Only 3 minutes from start to eviction

Last Captured Log Lines:

Get:3 http://deb.debian.org/debian-security trixie-security InRelease [43.4 kB]
Get:4 http://deb.debian.org/debian trixie/main amd64 Packages [9670 kB]
Get:5 http://deb.debian.org/debian trixie-updates/main amd64 Packages [5412 B]
Get:6 http://deb.debian.org/debian-security trixie-security/main amd64 Packages [108 kB]
Fetched 9875 kB in 1s (10.8 MB/s)

The pod died very early - during initial package installation, before any evaluation logic ran.

Code Changes Analysis

I compared this branch with main and found these key changes:

SDK Sdist Pre-building (the optimization this PR introduces)
- Creates temp directory: Path(tempfile.mkdtemp(prefix="oh-sdist-cache-"))
- Builds SDK once, reuses for all images
- ⚠️ Potential issue: If /tmp is mounted as tmpfs (in-memory), this could contribute to memory usage
Image Check Function Rename
- remote_image_exists() → image_exists()
- ⚠️ Potential issue: May change behavior in how images are checked/loaded during evaluation

Theories on Root Cause

Most Likely: The memory spike happens during evaluation initialization, possibly due to:

Aggressive Docker image pulling/loading (loading many images at once instead of streaming)
Temp directory accumulation in memory (if tmpfs)
Image metadata caching without cleanup
The image_exists() behavior change affecting evaluation startup

Evidence Supporting This:

✅ OOM happens in 3 minutes (very fast)
✅ Dies during setup phase, not during evaluation
✅ Consistent across both test attempts
✅ 11.4 GB = reasonable size for Docker images/cache
✅ Never reaches evaluation code (0, 0, 0, 0 output)

Comparison with Working Evals

Other pods running on main branch:

eval-22906539712-glm-5-ffvfq: Running 7h+ with output 25, 40, 0, 0 ✅
eval-22922402157-claude-4-6-nzvlj: Running 1h+ with output 3, 3, 0, 0 ✅

These successfully run evaluations without OOM issues.

Questions for Debugging

Does the evaluation code also call build_sdist_package() or create temp directories?
- If yes, and /tmp is tmpfs, that's memory pressure
Does image_exists() behavior differ from remote_image_exists() during evaluation?
- Could it be checking/loading images differently?
Are there any changes in how Docker images are pulled/loaded in eval startup?
- The 11.4 GB in 3 minutes suggests aggressive image operations

Is there a way to test this locally with memory profiling?

# Monitor memory during eval startup
watch -n 1 'free -h; docker stats --no-stream'

Recommendation

The build optimization looks good (benchmarks build succeeded), but something in this branch causes evaluation pods to OOM during initialization. I'd suggest:

Profile memory usage during evaluation startup on this branch
Compare with main branch behavior
Check if temp directories or image caching are the culprit
Consider if this optimization should only apply to builds, not evaluations

Full investigation details available in these files:

fix-build-sdist-reuse_failure_analysis.md
oom_root_cause_analysis.md
Workflow runs linked above

juanmichelini · 2026-03-10T22:50:03Z

@OpenHands please analyze why we got OOM twice and then update the PR to address the most likely cause.
the image_exists seems unlikely to affect, since it is not part of this PR diff.

openhands-ai · 2026-03-10T22:50:23Z

I'm on it! juanmichelini can track my progress at all-hands.dev

Convert build_sdist_package() to a context manager that: 1. Creates temp directory for sdist build 2. Yields the path for use in builds 3. Ensures cleanup in finally block (shutil.rmtree) This addresses potential OOM issues by guaranteeing cleanup even if builds fail or are interrupted. Co-authored-by: openhands <openhands@all-hands.dev>

juanmichelini · 2026-03-10T22:56:54Z

OOM Analysis & Fix

Analysis Summary

I analyzed the OOM failures and the code changes in this PR. Here are my findings:

Key Observations:

The OOM occurred in the evaluation pod (not the build pod)
The build completed successfully in 50m48s
The evaluation pod died in only 3 minutes during apt-get installation
Memory usage was 11.4 GB (143% of the 8 GB request)

Code Path Analysis:

The build_all_images() function (batch builds) uses build_sdist_package()
The ensure_local_image() function (called during evaluation) does NOT call build_sdist_package() - it receives sdist_path=None by default
When sdist_path=None, the patch_uv_build() context manager is a no-op

Identified Issue:
While the code paths suggest the evaluation shouldn't be directly affected by the sdist optimization, I found that the temp directory created by build_sdist_package() was never cleaned up:

cache_dir = Path(tempfile.mkdtemp(prefix="oh-sdist-cache-"))
# ... no cleanup

If /tmp is mounted as tmpfs (common in Kubernetes), accumulated temp directories could contribute to memory pressure, especially on shared runners with sticky caches.

The Fix

I converted build_sdist_package() from a regular function to a context manager that:

Creates the temp directory
Builds the sdist
Yields the path for use during builds
Guarantees cleanup in a finally block using shutil.rmtree()

@contextlib.contextmanager
def build_sdist_package():
    cache_dir = Path(tempfile.mkdtemp(prefix="oh-sdist-cache-"))
    try:
        # ... build sdist
        yield sdist_path
    finally:
        # Clean up the temp directory
        if cache_dir.exists():
            logger.info(f"Cleaning up sdist cache: {cache_dir}")
            shutil.rmtree(cache_dir, ignore_errors=True)

Note on OOM Root Cause

Caveat: While the cleanup fix is a definite improvement, the OOM during evaluation (which dies during apt-get, before benchmark code runs) may have a different root cause. Possible factors:

The evaluation pod and build pod run in different environments
11.4 GB during apt-get is unusual and suggests something else consuming memory
Could be related to Docker image size or Kubernetes configuration

The cleanup fix ensures proper resource management during builds, but if OOMs persist after this fix, we should investigate other factors like:

Container memory limits
Base image changes
Kubernetes node pressure

Commit: 93ebaf6

openhands-ai · 2026-03-10T22:57:49Z

Summary of Work Completed

Analysis

I analyzed the OOM failures reported in PR #499. Key findings:

The OOM occurred in the evaluation pod (not build pod), dying during apt-get with 11.4 GB memory usage
The build completed successfully (50m48s)
The evaluation code path doesn't directly call build_sdist_package() since ensure_local_image() defaults sdist_path=None

Issue Identified

The original build_sdist_package() function created a temp directory with tempfile.mkdtemp() but never cleaned it up - a resource leak that could contribute to memory pressure if /tmp is mounted as tmpfs.

Fix Applied

Converted build_sdist_package() from a regular function to a context manager that:

Creates the temp directory
Builds the sdist
Yields the path for use during builds
Guarantees cleanup in a finally block using shutil.rmtree()

Verification

✅ Pre-commit checks pass (ruff format, ruff lint, pycodestyle, pyright)
✅ Changes are concise - only modifying build_utils.py
✅ Commit pushed to fix-build-sdist-reuse branch
✅ Comment posted on PR explaining analysis and fix

Caveat

While this fix addresses a real resource leak, the OOM happening during apt-get (before benchmark code runs) in a separate evaluation pod may have additional contributing factors. If OOMs persist, further investigation into container limits or Kubernetes configuration may be needed.

Commit: 93ebaf6
PR: #499

Co-authored-by: openhands <openhands@all-hands.dev>

juanmichelini · 2026-03-10T23:35:01Z

testing again here https://github.com/OpenHands/software-agent-sdk/actions/runs/22929111032

fix: reuse pre-built SDK sdist to speed up image builds

86a7c8e

Avoids running 'uv build' 500 times (once per image), saving ~15s per image and eliminating logging spam.

Merge main into fix-build-sdist-reuse

ade0074

Co-authored-by: openhands <openhands@all-hands.dev>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: Reuse pre-built SDK sdist to speed up image builds#499

fix: Reuse pre-built SDK sdist to speed up image builds#499
juanmichelini wants to merge 3 commits intomainfrom
fix-build-sdist-reuse

juanmichelini commented Mar 10, 2026 •

edited

Loading

Uh oh!

juanmichelini commented Mar 10, 2026

Uh oh!

juanmichelini commented Mar 10, 2026

Uh oh!

openhands-ai bot commented Mar 10, 2026

Uh oh!

juanmichelini commented Mar 10, 2026

Uh oh!

openhands-ai bot commented Mar 10, 2026

Uh oh!

juanmichelini commented Mar 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

juanmichelini commented Mar 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

juanmichelini commented Mar 10, 2026

🚨 Evaluation OOM Failure Report

Test Runs

OOM Details (Pod: eval-22921567775-qwen3-5-fl-vfrwq)

Code Changes Analysis

Theories on Root Cause

Comparison with Working Evals

Questions for Debugging

Recommendation

Uh oh!

juanmichelini commented Mar 10, 2026

Uh oh!

openhands-ai bot commented Mar 10, 2026

Uh oh!

juanmichelini commented Mar 10, 2026

OOM Analysis & Fix

Analysis Summary

The Fix

Note on OOM Root Cause

Uh oh!

openhands-ai bot commented Mar 10, 2026

Summary of Work Completed

Analysis

Issue Identified

Fix Applied

Verification

Caveat

Uh oh!

juanmichelini commented Mar 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

juanmichelini commented Mar 10, 2026 •

edited

Loading