fix: Reuse pre-built SDK sdist to speed up image builds#499
fix: Reuse pre-built SDK sdist to speed up image builds#499juanmichelini wants to merge 3 commits intomainfrom
Conversation
Avoids running 'uv build' 500 times (once per image), saving ~15s per image and eliminating logging spam.
🚨 Evaluation OOM Failure ReportI tested this branch with two evaluation runs, and both failed with Out-Of-Memory (OOM) errors before any evaluations could start. Test Runs
OOM Details (Pod: eval-22921567775-qwen3-5-fl-vfrwq)Kubernetes Event: Memory Stats:
Timeline:
Last Captured Log Lines: The pod died very early - during initial package installation, before any evaluation logic ran. Code Changes AnalysisI compared this branch with main and found these key changes:
Theories on Root CauseMost Likely: The memory spike happens during evaluation initialization, possibly due to:
Evidence Supporting This:
Comparison with Working EvalsOther pods running on main branch:
These successfully run evaluations without OOM issues. Questions for Debugging
RecommendationThe build optimization looks good (benchmarks build succeeded), but something in this branch causes evaluation pods to OOM during initialization. I'd suggest:
Full investigation details available in these files:
|
|
@OpenHands please analyze why we got OOM twice and then update the PR to address the most likely cause. |
|
I'm on it! juanmichelini can track my progress at all-hands.dev |
Convert build_sdist_package() to a context manager that: 1. Creates temp directory for sdist build 2. Yields the path for use in builds 3. Ensures cleanup in finally block (shutil.rmtree) This addresses potential OOM issues by guaranteeing cleanup even if builds fail or are interrupted. Co-authored-by: openhands <openhands@all-hands.dev>
OOM Analysis & FixAnalysis SummaryI analyzed the OOM failures and the code changes in this PR. Here are my findings: Key Observations:
Code Path Analysis:
Identified Issue: cache_dir = Path(tempfile.mkdtemp(prefix="oh-sdist-cache-"))
# ... no cleanupIf The FixI converted
@contextlib.contextmanager
def build_sdist_package():
cache_dir = Path(tempfile.mkdtemp(prefix="oh-sdist-cache-"))
try:
# ... build sdist
yield sdist_path
finally:
# Clean up the temp directory
if cache_dir.exists():
logger.info(f"Cleaning up sdist cache: {cache_dir}")
shutil.rmtree(cache_dir, ignore_errors=True)Note on OOM Root CauseCaveat: While the cleanup fix is a definite improvement, the OOM during evaluation (which dies during
The cleanup fix ensures proper resource management during builds, but if OOMs persist after this fix, we should investigate other factors like:
Commit: |
Summary of Work CompletedAnalysisI analyzed the OOM failures reported in PR #499. Key findings:
Issue IdentifiedThe original Fix AppliedConverted
Verification
CaveatWhile this fix addresses a real resource leak, the OOM happening during Commit: |
Co-authored-by: openhands <openhands@all-hands.dev>
|
testing again here https://github.com/OpenHands/software-agent-sdk/actions/runs/22929111032 |
Root Cause Analysis:
The Fix:
This PR optimizes the build process by reusing the SDK :
This reduces the step from ~15s (noisy) to <1s (silent).
Why not change the SDK?
The SDK change that introduced is valid for standalone builds. However, for mass-build scenarios like benchmarks, we need this optimization without forcing changes upstream in the SDK repo. Monkeypatching is a pragmatic solution to inject this optimization locally within the benchmarks tooling.