Fix race condition between zombie reaper and RPC command execution#1805
Closed
mshipkovenski wants to merge 1 commit intocarvel-dev:developfrom
Closed
Fix race condition between zombie reaper and RPC command execution#1805mshipkovenski wants to merge 1 commit intocarvel-dev:developfrom
mshipkovenski wants to merge 1 commit intocarvel-dev:developfrom
Conversation
The sidecar container had a race condition between the reapZombies goroutine and the RPC command handler (CmdExec.Run). Both called Wait4 on the same child processes: - reapZombies called Wait4(-1, WNOHANG) to reap any exited child. - CmdExec.Run called cmd.Run(), which internally calls cmd.Wait(), which calls waitid(P_PID, <pid>) on the specific child process. Since the kernel only stores one exit status per process and the first Wait4/waitid call to match a PID consumes it, the reaper could steal the exit status before cmd.Wait() collected it. This caused cmd.Wait() to fail with "waitid: no child processes" (ECHILD), which surfaced in kapp-controller logs as: Fetching resources: waitid: no child processes The fix introduces a centralized Reaper that is the sole caller of Wait4 in the sidecar process. CmdExec.Run now uses cmd.Start() instead of cmd.Run() and registers the child PID with the Reaper via a channel. The Reaper's Wait4(-1) loop dispatches exit statuses to tracked PIDs through their channels, and silently discards statuses for untracked PIDs (orphaned zombies). A mutex ensures cmd.Start() and PID registration are atomic with respect to the reaping loop, preventing the Reaper from consuming an exit status before the channel is registered. Made-with: Cursor ai-assisted=yes Made-with: Cursor Signed-off-by: Miroslav Shipkovenski <miroslav.shipkovenski@broadcom.com>
0660ad2 to
2905324
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The sidecar container had a race condition between the reapZombies goroutine and the RPC command handler (CmdExec.Run). Both called Wait4 on the same child processes:
Since the kernel only stores one exit status per process and the first Wait4/waitid call to match a PID consumes it, the reaper could steal the exit status before cmd.Wait() collected it. This caused cmd.Wait() to fail with "waitid: no child processes" (ECHILD), which surfaced in kapp-controller logs as:
Fetching resources: waitid: no child processes
The fix introduces a centralized Reaper that is the sole caller of Wait4 in the sidecar process. CmdExec.Run now uses cmd.Start() instead of cmd.Run() and registers the child PID with the Reaper via a channel. The Reaper's Wait4(-1) loop dispatches exit statuses to tracked PIDs through their channels, and silently discards statuses for untracked PIDs (orphaned zombies). A mutex ensures cmd.Start() and PID registration are atomic with respect to the reaping loop, preventing the Reaper from consuming an exit status before the channel is registered.
Made-with: Cursor