Skip to content

Conversation

@isaac-dasan
Copy link
Member

Description

This PR fixes a critical bug in the SocketWatcher that caused the device plugin to enter a tight restart loop if the socket file was deleted (e.g., by Kubelet) or if the plugin restarted for any reason.

The Problem:

The SocketWatcher maintains a map socketChans of active watchers. When a watcher goroutine exited (due to socket deletion or context cancellation), it closed its notification channel but failed to remove the entry from the map.

Consequently, when the PluginManager attempted to restart the plugin:

  1. It called WatchSocket with the same socket path.
  2. WatchSocket found the existing entry in the map and returned the already-closed channel.
  3. The plugin's main loop (Run) received from this closed channel immediately and exited.
  4. The loop repeated indefinitely, causing the plugin to restart every ~100ms.

Symptoms:

Logs showing "starting device plugin for resource" and "registering with kubelet" repeating rapidly with no intermediate errors.
Pods failing with "UnexpectedAdmissionError: Allocate failed due to no healthy devices present". This occurred because the plugin was constantly churning, leaving no stable window for Kubelet to allocate devices.

The Fix:

Updated SocketWatcher to ensure that the socket entry is deleted from the socketChans map when the watcher gor
outine exits. This ensures that subsequent calls to WatchSocket create a fresh watcher and channel.

Testing:

Added a regression test TestWatchSocketCleanup in socketwatcher_test.go to verify that the map entry is cleaned up and new watchers can be established successfully.

@isaac-dasan
Copy link
Member Author

/azp run Azure Container Networking PR

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes a critical bug where the Device Plugin would enter an infinite restart loop when its socket was deleted or the plugin restarted. The root cause was the SocketWatcher failing to clean up map entries for closed watchers, causing subsequent WatchSocket calls to return already-closed channels.

Key Changes

  • Modified SocketWatcher.WatchSocket() to delete map entries before closing channels in the goroutine cleanup defer block
  • Added defensive socket cleanup in Server.Run() to remove stale sockets before listening
  • Enhanced test error handling and added regression test TestWatchSocketCleanup to verify map cleanup behavior

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated no comments.

File Description
cns/deviceplugin/socketwatcher.go Core fix: Added mutex-protected map cleanup in defer block before closing channel to prevent returning closed channels on subsequent WatchSocket calls
cns/deviceplugin/socketwatcher_test.go Enhanced error handling by checking logger creation and file operations; added new regression test TestWatchSocketCleanup to verify map entry cleanup after watcher exit
cns/deviceplugin/server.go Added defensive socket removal before net.Listen() to handle stale socket files
cns/deviceplugin/server_test.go New test TestServer_Run_CleansUpExistingSocket validates that server properly removes and recreates stale socket files

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Signed-off-by: GitHub <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants