Skip to content

Fix zombie process race condition in sidecar container#1806

Open
mdzhigarov wants to merge 2 commits intocarvel-dev:developfrom
mdzhigarov:fix-zombie-process
Open

Fix zombie process race condition in sidecar container#1806
mdzhigarov wants to merge 2 commits intocarvel-dev:developfrom
mdzhigarov:fix-zombie-process

Conversation

@mdzhigarov
Copy link
Copy Markdown
Contributor

Summary

Fixes a race condition in the sidecar container that causes intermittent "waitid: no child processes" errors during template operations (ytt, vendir, imgpkg).

Problem

The custom reapZombies() function was incorrectly implemented:

  • Used syscall.Wait4(-1, ...) which reaps any child process, not just orphans
  • Created race condition with normal parent-child process waiting in CmdExec.Run
  • When ytt/vendir/imgpkg completed, reapZombies could reap them before their actual parent
  • Result: cmd.Wait() failed with ECHILD ("waitid: no child processes")

Root Cause

The reapZombies function violated Unix process management conventions by reaping processes that had living parents, instead of only reaping orphaned processes (whose parent died).

Solution

  • Replace custom zombie reaper with tini: Industry standard init system for containers
  • Remove reapZombies() function entirely: Eliminates the race condition source
  • Proper PID 1 handling: tini correctly handles only orphaned processes
  • Maintains zombie cleanup: Without interfering with normal parent-child relationships

Changes

  • Dockerfile: Install tini package and set as entrypoint
  • sidecarexec.go: Remove reapZombies function and unused imports
  • deployment.yml: Fix ytt template comment syntax

Impact

  • ✅ Eliminates "waitid: no child processes" errors
  • ✅ Fixes intermittent PackageRepository reconciliation failures
  • ✅ Improves reliability under concurrent template operations
  • ✅ Uses industry standard solution (tini)
  • ✅ Backward compatible - no API changes

Test Plan

  • Build and deploy updated container image
  • Verify process tree shows tini as PID 1: ps -ef should show tini -- /kapp-controller
  • Test PackageRepository reconciliation under load
  • Confirm no more "waitid: no child processes" errors in logs
  • Validate graceful container shutdown behavior

Related Issues

This addresses the race condition described in the error logs where template operations fail with zombie process errors during high-frequency reconciliation.

Made with Cursor

Replace custom reapZombies implementation with tini as PID 1 to eliminate
race condition causing "waitid: no child processes" errors during template
operations.

Problem:
- reapZombies() used syscall.Wait4(-1, ...) which reaps ANY child process
- This interfered with normal parent-child process waiting in CmdExec.Run
- Race condition: reapZombies could reap ytt/vendir/imgpkg processes before
  their actual parent (sidecar process) could wait for them
- Result: cmd.Wait() failed with ECHILD ("waitid: no child processes")

Solution:
- Install and use tini as proper PID 1 init system in Dockerfile
- Remove problematic reapZombies function entirely
- tini correctly handles only orphaned processes, not normal children
- Eliminates race condition while maintaining proper zombie cleanup

Changes:
- Dockerfile: Install tini package and set as entrypoint
- sidecarexec.go: Remove reapZombies function and unused imports
- deployment.yml: Add documentation comment about tini configuration

This fixes intermittent failures during PackageRepository reconciliation
and other template-heavy operations under concurrent load.

Fixes: Race condition between zombie reaper and command execution
Made-with: Cursor
Signed-off-by: Marin Dzhigarov <m.dzhigarov@gmail.com>
Made-with: Cursor
Use ytt-specific comment syntax (#!) instead of regular comments (#)
to avoid template compilation errors.

Signed-off-by: Marin Dzhigarov <m.dzhigarov@gmail.com>
Made-with: Cursor
@carvel-bot carvel-bot added this to Carvel Apr 1, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

2 participants