Skip to content

Fix fork+threading deadlock in SWE-Bench image builds#486

Draft
juanmichelini wants to merge 1 commit intomainfrom
openhands/fix-image-build-deadlock
Draft

Fix fork+threading deadlock in SWE-Bench image builds#486
juanmichelini wants to merge 1 commit intomainfrom
openhands/fix-image-build-deadlock

Conversation

@juanmichelini
Copy link
Collaborator

Summary

This PR fixes the root cause of issue #476 where SWE-Bench image builds get stuck for 5+ hours during the batch build process.

Root Cause

Commit 2bfcc6c (#456) introduced the following import in benchmarks/utils/image_utils.py:

from openhands.sdk import get_logger
logger = get_logger(__name__)

When this module is imported, the SDK logger auto-initializes with a RichHandler (from the rich library), which uses locks and potentially threads for console rendering. The issue occurs when:

  1. build_utils.py imports image_utils, triggering logger initialization in the parent process
  2. build_all_images() creates a ProcessPoolExecutor using fork (default on Linux)
  3. Fork copies the parent process, including Rich logger's locks in their current state
  4. Child processes deadlock waiting for locks that will never be released

This is the exact same fork+threading deadlock that commit 744df225 (#459) fixed in evaluation.py, but that fix only applied to evaluation.py and not to build_utils.py.

Evidence

From commit 744df225 (#459):

Root cause: ProcessPoolExecutor uses fork() by default on Linux, which is unsafe when the parent process has threads. When fork() copies a process, it copies locks in their current state. If a thread holds a lock during fork(), the child process deadlocks waiting for that lock forever.

Evidence from Datadog logs:

  • Warning: 'This process is multi-threaded, use of fork() may lead to deadlocks'
  • Workers stuck in futex_wait_queue (mutex wait)
  • 84/500 instances succeeded before deadlock (timing-dependent)

The same pattern applies to image building:

  • Multiple workers (12 by default) get stuck in deadlock
  • The number of images built before deadlock is timing-dependent
  • Builds hang for hours without completing

Solution

Use spawn multiprocessing context instead of fork in build_all_images(), matching the fix from commit 744df225 for evaluation.py.

The spawn method starts fresh Python processes instead of forking, avoiding the fork+threads deadlock entirely.

Changes

  • Added import multiprocessing to build_utils.py
  • Modified ProcessPoolExecutor instantiation to use mp_context=multiprocessing.get_context("spawn")
  • Added explanatory comments matching the pattern from evaluation.py

Testing

✅ All pre-commit checks pass
✅ Type checking passes
✅ Linting passes

The fix should be verified by running a full 500-image build on GitHub Actions, which was previously getting stuck.

Related

Root cause: Commit 2bfcc6c added 'from openhands.sdk import get_logger'
to image_utils.py, which auto-initializes RichHandler with locks/threads.
When build_utils.py uses ProcessPoolExecutor with fork (default on Linux),
it copies the process with Rich logger's locks in their current state,
causing child processes to deadlock waiting for locks that will never be
released.

Solution: Use 'spawn' multiprocessing context instead of 'fork' in
build_all_images(), matching the fix in commit 744df22 for evaluation.py.

This prevents the fork+threading deadlock by starting fresh Python
processes instead of forking processes with inherited thread state.

Fixes #476

Co-authored-by: openhands <openhands@all-hands.dev>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

SWE-Bench image builds stuck for 5+ hours, blocking evaluation jobs

2 participants