fix: infinite loop in Device Plugin caused by stale SocketWatcher state #4165
+154
−20
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
This PR fixes a critical bug in the
SocketWatcherthat caused the device plugin to enter a tight restart loop if the socket file was deleted (e.g., by Kubelet) or if the plugin restarted for any reason.The Problem:
The
SocketWatchermaintains a mapsocketChansof active watchers. When a watcher goroutine exited (due to socket deletion or context cancellation), it closed its notification channel but failed to remove the entry from the map.Consequently, when the
PluginManagerattempted to restart the plugin:WatchSocketwith the same socket path.WatchSocketfound the existing entry in the map and returned the already-closed channel.Symptoms:
Logs showing
"starting device plugin for resource"and"registering with kubelet"repeating rapidly with no intermediate errors.Pods failing with
"UnexpectedAdmissionError: Allocate failed due to no healthy devices present". This occurred because the plugin was constantly churning, leaving no stable window for Kubelet to allocate devices.The Fix:
Updated SocketWatcher to ensure that the socket entry is deleted from the socketChans map when the watcher gor
outine exits. This ensures that subsequent calls to WatchSocket create a fresh watcher and channel.
Testing:
Added a regression test TestWatchSocketCleanup in socketwatcher_test.go to verify that the map entry is cleaned up and new watchers can be established successfully.