ISSUE-4357: Fix shared _temp race between cancelled and new run-service workers#4371
Open
herbenderbler wants to merge 1 commit intoactions:mainfrom
Open
Conversation
Contributor
There was a problem hiding this comment.
Pull request overview
Fixes a race in run-service job dispatch where a canceled worker could still be tearing down (and cleaning <work>/_temp) while a new worker starts, deleting the new worker’s _runner_file_commands/* files mid-job.
Changes:
- Update
JobDispatcher.EnsureDispatchFinishedfor run-service jobs to cancel the previous worker and wait (up to 45s) for the previousWorkerDispatchto finish before proceeding. - Add an L0 regression test that verifies
EnsureDispatchFinisheddoes not complete until the previous worker dispatch task completes.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
src/Runner.Listener/JobDispatcher.cs |
Adds a 45s wait in the run-service cancellation path to prevent overlapping workers from racing on the shared _temp directory. |
src/Test/L0/Listener/JobDispatcherL0.cs |
Adds a regression test that drives the run-service EnsureDispatchFinished path and asserts it awaits the previous worker’s completion. |
Restructure EnsureDispatchFinished so run-service and shutdown cancel paths fall through to the common try/catch/finally (log faulted WorkerDispatch; _jobInfos TryRemove/Dispose) instead of returning before it. Dispose reflected WorkerDispatcher in the regression test. Fixes actions#4357
9bfa7b9 to
37dea8e
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Fixes #4357.
For run-service jobs,
JobDispatcher.EnsureDispatchFinishedcancelled the previous worker and returned without awaiting its exit. The newRunner.Workerthen spawned while the previous worker was still tearing down. Both processes share<work>/_temp, so the exiting worker'sTempDirectoryManager.CleanupTempDirectory()wiped the new worker's_runner_file_commands/*pipes mid-job, failing it withMissing file at path: .../set_output_<uuid>and exit code102.Implementation Details
src/Runner.Listener/JobDispatcher.cs: in the_isRunServiceJobbranch ofEnsureDispatchFinished, after cancelling the previous worker,await Task.WhenAny(jobDispatch.WorkerDispatch, Task.Delay(45s))and throwInvalidOperationExceptionon timeout — matching the existing non-run-service branch in the same method.src/Test/L0/Listener/JobDispatcherL0.cs: addEnsureDispatchFinishedAwaitsPreviousWorkerForRunServiceJobregression test (drives the path with a controlledTaskCompletionSource; verified to fail without the fix and pass with it).Checklist
./dev.sh formatis a no-op on the diff.JobDispatcherL0suite (14 tests) passes locally.main._templayout change.