Skip to content

Fix race condition between zombie reaper and RPC command execution#1805

Closed
mshipkovenski wants to merge 1 commit intocarvel-dev:developfrom
mshipkovenski:sidecar-wait-fix
Closed

Fix race condition between zombie reaper and RPC command execution#1805
mshipkovenski wants to merge 1 commit intocarvel-dev:developfrom
mshipkovenski:sidecar-wait-fix

Conversation

@mshipkovenski
Copy link
Copy Markdown

The sidecar container had a race condition between the reapZombies goroutine and the RPC command handler (CmdExec.Run). Both called Wait4 on the same child processes:

  • reapZombies called Wait4(-1, WNOHANG) to reap any exited child.
  • CmdExec.Run called cmd.Run(), which internally calls cmd.Wait(), which calls waitid(P_PID, ) on the specific child process.

Since the kernel only stores one exit status per process and the first Wait4/waitid call to match a PID consumes it, the reaper could steal the exit status before cmd.Wait() collected it. This caused cmd.Wait() to fail with "waitid: no child processes" (ECHILD), which surfaced in kapp-controller logs as:

Fetching resources: waitid: no child processes

The fix introduces a centralized Reaper that is the sole caller of Wait4 in the sidecar process. CmdExec.Run now uses cmd.Start() instead of cmd.Run() and registers the child PID with the Reaper via a channel. The Reaper's Wait4(-1) loop dispatches exit statuses to tracked PIDs through their channels, and silently discards statuses for untracked PIDs (orphaned zombies). A mutex ensures cmd.Start() and PID registration are atomic with respect to the reaping loop, preventing the Reaper from consuming an exit status before the channel is registered.

Made-with: Cursor

The sidecar container had a race condition between the reapZombies
goroutine and the RPC command handler (CmdExec.Run). Both called
Wait4 on the same child processes:

- reapZombies called Wait4(-1, WNOHANG) to reap any exited child.
- CmdExec.Run called cmd.Run(), which internally calls cmd.Wait(),
  which calls waitid(P_PID, <pid>) on the specific child process.

Since the kernel only stores one exit status per process and the first
Wait4/waitid call to match a PID consumes it, the reaper could steal
the exit status before cmd.Wait() collected it. This caused cmd.Wait()
to fail with "waitid: no child processes" (ECHILD), which surfaced in
kapp-controller logs as:

  Fetching resources: waitid: no child processes

The fix introduces a centralized Reaper that is the sole caller of
Wait4 in the sidecar process. CmdExec.Run now uses cmd.Start()
instead of cmd.Run() and registers the child PID with the Reaper via
a channel. The Reaper's Wait4(-1) loop dispatches exit statuses to
tracked PIDs through their channels, and silently discards statuses
for untracked PIDs (orphaned zombies). A mutex ensures cmd.Start()
and PID registration are atomic with respect to the reaping loop,
preventing the Reaper from consuming an exit status before the
channel is registered.

Made-with: Cursor

ai-assisted=yes

Made-with: Cursor
Signed-off-by: Miroslav Shipkovenski <miroslav.shipkovenski@broadcom.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Closed

Development

Successfully merging this pull request may close these issues.

2 participants