Skip to content

fix: Reuse pre-built SDK sdist to speed up image builds#499

Draft
juanmichelini wants to merge 3 commits intomainfrom
fix-build-sdist-reuse
Draft

fix: Reuse pre-built SDK sdist to speed up image builds#499
juanmichelini wants to merge 3 commits intomainfrom
fix-build-sdist-reuse

Conversation

@juanmichelini
Copy link
Collaborator

@juanmichelini juanmichelini commented Mar 10, 2026

Root Cause Analysis:

  1. Trigger: Commit updated the submodule to a new commit ().
  2. Tag Mismatch: Benchmark image tags include the SDK SHA. Because the SHA changed, returns for all 500 images, forcing a full rebuild of every image.
  3. Build Overhead: Each image build calls , which runs .
  4. The Bottleneck: takes about 15 seconds per run and outputs verbose logs (warnings).
    • Execution Time: 500 images * 15 seconds = ~2 hours of pure CPU/IO overhead just for packaging the SDK.
    • Logging Spam: 500 images * ~1200 lines = ~600,000 lines of logs. This overwhelmed the previous attempt to fix it via log suppression (PR Disable rich logging by default to fix multiprocessing deadlock #487), because suppressing logs doesn't remove the 15-second execution time.

The Fix:
This PR optimizes the build process by reusing the SDK :

  1. Build Once: It builds the SDK package once at the start of the batch build process.
  2. Reuse: It monkeypatches the build process in worker threads to skip running and instead copy the pre-built .
  3. Result: This eliminates the 15-second overhead per image (saving ~2 hours) and completely removes the logging spam.

This reduces the step from ~15s (noisy) to <1s (silent).

Why not change the SDK?
The SDK change that introduced is valid for standalone builds. However, for mass-build scenarios like benchmarks, we need this optimization without forcing changes upstream in the SDK repo. Monkeypatching is a pragmatic solution to inject this optimization locally within the benchmarks tooling.

Avoids running 'uv build' 500 times (once per image), saving ~15s per image and eliminating logging spam.
@juanmichelini
Copy link
Collaborator Author

🚨 Evaluation OOM Failure Report

I tested this branch with two evaluation runs, and both failed with Out-Of-Memory (OOM) errors before any evaluations could start.

Test Runs

  1. First Attempt: Workflow 22906638498

    • Pod: eval-22906638498-qwen3-5-fl-pntvh
    • Result: Disappeared after 1h11m with 0 output (likely OOM)
  2. Second Attempt: Workflow 22921567775

    • Pod: eval-22921567775-qwen3-5-fl-vfrwq
    • Benchmarks Build: ✅ Completed successfully (50m48s)
    • Result: Pod evicted due to OOM after 3 minutes

OOM Details (Pod: eval-22921567775-qwen3-5-fl-vfrwq)

Kubernetes Event:

Warning   Evicted   pod/eval-22921567775-qwen3-5-fl-vfrwq   
The node was low on resource: memory. Threshold quantity: 100Mi, 
available: 79036Ki. Container eval-container was using 11461612Ki, 
request is 8Gi, has larger consumption of memory.

Memory Stats:

  • Container Request: 8 GB
  • Container Usage: 11.4 GB (143% of request!)
  • Available on Node: Only 79 MB
  • Result: Pod EVICTED

Timeline:

  • 20:47 UTC - Benchmarks build completed, evaluation started
  • 20:47-20:50 UTC - Pod installing system packages (apt-get)
  • 20:50 UTC - Pod evicted (OOM)
  • Duration: Only 3 minutes from start to eviction

Last Captured Log Lines:

Get:3 http://deb.debian.org/debian-security trixie-security InRelease [43.4 kB]
Get:4 http://deb.debian.org/debian trixie/main amd64 Packages [9670 kB]
Get:5 http://deb.debian.org/debian trixie-updates/main amd64 Packages [5412 B]
Get:6 http://deb.debian.org/debian-security trixie-security/main amd64 Packages [108 kB]
Fetched 9875 kB in 1s (10.8 MB/s)

The pod died very early - during initial package installation, before any evaluation logic ran.

Code Changes Analysis

I compared this branch with main and found these key changes:

  1. SDK Sdist Pre-building (the optimization this PR introduces)

    • Creates temp directory: Path(tempfile.mkdtemp(prefix="oh-sdist-cache-"))
    • Builds SDK once, reuses for all images
    • ⚠️ Potential issue: If /tmp is mounted as tmpfs (in-memory), this could contribute to memory usage
  2. Image Check Function Rename

    • remote_image_exists()image_exists()
    • ⚠️ Potential issue: May change behavior in how images are checked/loaded during evaluation

Theories on Root Cause

Most Likely: The memory spike happens during evaluation initialization, possibly due to:

  • Aggressive Docker image pulling/loading (loading many images at once instead of streaming)
  • Temp directory accumulation in memory (if tmpfs)
  • Image metadata caching without cleanup
  • The image_exists() behavior change affecting evaluation startup

Evidence Supporting This:

  • ✅ OOM happens in 3 minutes (very fast)
  • ✅ Dies during setup phase, not during evaluation
  • ✅ Consistent across both test attempts
  • ✅ 11.4 GB = reasonable size for Docker images/cache
  • ✅ Never reaches evaluation code (0, 0, 0, 0 output)

Comparison with Working Evals

Other pods running on main branch:

  • eval-22906539712-glm-5-ffvfq: Running 7h+ with output 25, 40, 0, 0
  • eval-22922402157-claude-4-6-nzvlj: Running 1h+ with output 3, 3, 0, 0

These successfully run evaluations without OOM issues.

Questions for Debugging

  1. Does the evaluation code also call build_sdist_package() or create temp directories?

    • If yes, and /tmp is tmpfs, that's memory pressure
  2. Does image_exists() behavior differ from remote_image_exists() during evaluation?

    • Could it be checking/loading images differently?
  3. Are there any changes in how Docker images are pulled/loaded in eval startup?

    • The 11.4 GB in 3 minutes suggests aggressive image operations
  4. Is there a way to test this locally with memory profiling?

    # Monitor memory during eval startup
    watch -n 1 'free -h; docker stats --no-stream'

Recommendation

The build optimization looks good (benchmarks build succeeded), but something in this branch causes evaluation pods to OOM during initialization. I'd suggest:

  1. Profile memory usage during evaluation startup on this branch
  2. Compare with main branch behavior
  3. Check if temp directories or image caching are the culprit
  4. Consider if this optimization should only apply to builds, not evaluations

Full investigation details available in these files:

  • fix-build-sdist-reuse_failure_analysis.md
  • oom_root_cause_analysis.md
  • Workflow runs linked above

@juanmichelini
Copy link
Collaborator Author

@OpenHands please analyze why we got OOM twice and then update the PR to address the most likely cause.
the image_exists seems unlikely to affect, since it is not part of this PR diff.

@openhands-ai
Copy link

openhands-ai bot commented Mar 10, 2026

I'm on it! juanmichelini can track my progress at all-hands.dev

Convert build_sdist_package() to a context manager that:
1. Creates temp directory for sdist build
2. Yields the path for use in builds
3. Ensures cleanup in finally block (shutil.rmtree)

This addresses potential OOM issues by guaranteeing cleanup even if
builds fail or are interrupted.

Co-authored-by: openhands <openhands@all-hands.dev>
Copy link
Collaborator Author

OOM Analysis & Fix

Analysis Summary

I analyzed the OOM failures and the code changes in this PR. Here are my findings:

Key Observations:

  1. The OOM occurred in the evaluation pod (not the build pod)
  2. The build completed successfully in 50m48s
  3. The evaluation pod died in only 3 minutes during apt-get installation
  4. Memory usage was 11.4 GB (143% of the 8 GB request)

Code Path Analysis:

  • The build_all_images() function (batch builds) uses build_sdist_package()
  • The ensure_local_image() function (called during evaluation) does NOT call build_sdist_package() - it receives sdist_path=None by default
  • When sdist_path=None, the patch_uv_build() context manager is a no-op

Identified Issue:
While the code paths suggest the evaluation shouldn't be directly affected by the sdist optimization, I found that the temp directory created by build_sdist_package() was never cleaned up:

cache_dir = Path(tempfile.mkdtemp(prefix="oh-sdist-cache-"))
# ... no cleanup

If /tmp is mounted as tmpfs (common in Kubernetes), accumulated temp directories could contribute to memory pressure, especially on shared runners with sticky caches.

The Fix

I converted build_sdist_package() from a regular function to a context manager that:

  1. Creates the temp directory
  2. Builds the sdist
  3. Yields the path for use during builds
  4. Guarantees cleanup in a finally block using shutil.rmtree()
@contextlib.contextmanager
def build_sdist_package():
    cache_dir = Path(tempfile.mkdtemp(prefix="oh-sdist-cache-"))
    try:
        # ... build sdist
        yield sdist_path
    finally:
        # Clean up the temp directory
        if cache_dir.exists():
            logger.info(f"Cleaning up sdist cache: {cache_dir}")
            shutil.rmtree(cache_dir, ignore_errors=True)

Note on OOM Root Cause

Caveat: While the cleanup fix is a definite improvement, the OOM during evaluation (which dies during apt-get, before benchmark code runs) may have a different root cause. Possible factors:

  • The evaluation pod and build pod run in different environments
  • 11.4 GB during apt-get is unusual and suggests something else consuming memory
  • Could be related to Docker image size or Kubernetes configuration

The cleanup fix ensures proper resource management during builds, but if OOMs persist after this fix, we should investigate other factors like:

  • Container memory limits
  • Base image changes
  • Kubernetes node pressure

Commit: 93ebaf6

@openhands-ai
Copy link

openhands-ai bot commented Mar 10, 2026

Summary of Work Completed

Analysis

I analyzed the OOM failures reported in PR #499. Key findings:

  • The OOM occurred in the evaluation pod (not build pod), dying during apt-get with 11.4 GB memory usage
  • The build completed successfully (50m48s)
  • The evaluation code path doesn't directly call build_sdist_package() since ensure_local_image() defaults sdist_path=None

Issue Identified

The original build_sdist_package() function created a temp directory with tempfile.mkdtemp() but never cleaned it up - a resource leak that could contribute to memory pressure if /tmp is mounted as tmpfs.

Fix Applied

Converted build_sdist_package() from a regular function to a context manager that:

  1. Creates the temp directory
  2. Builds the sdist
  3. Yields the path for use during builds
  4. Guarantees cleanup in a finally block using shutil.rmtree()

Verification

  • ✅ Pre-commit checks pass (ruff format, ruff lint, pycodestyle, pyright)
  • ✅ Changes are concise - only modifying build_utils.py
  • ✅ Commit pushed to fix-build-sdist-reuse branch
  • ✅ Comment posted on PR explaining analysis and fix

Caveat

While this fix addresses a real resource leak, the OOM happening during apt-get (before benchmark code runs) in a separate evaluation pod may have additional contributing factors. If OOMs persist, further investigation into container limits or Kubernetes configuration may be needed.

Commit: 93ebaf6
PR: #499

Co-authored-by: openhands <openhands@all-hands.dev>
@juanmichelini
Copy link
Collaborator Author

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants