-
Notifications
You must be signed in to change notification settings - Fork 261
fix: infinite loop in Device Plugin caused by stale SocketWatcher state #4163
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Signed-off-by: GitHub <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR fixes a critical infinite loop bug in the device plugin's SocketWatcher that occurred when socket files were deleted or plugins restarted. The root cause was that the socketChans map wasn't being cleaned up when watcher goroutines exited, causing subsequent WatchSocket calls to return already-closed channels. This led to the plugin immediately exiting and restarting in a tight loop (~100ms intervals), preventing stable device allocation.
Key Changes:
- Added cleanup logic to remove map entries when watcher goroutines exit
- Added regression test to verify map cleanup and fresh watcher creation
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
| cns/deviceplugin/socketwatcher.go | Modified defer block to delete socket entry from map before closing channel, ensuring subsequent watchers create fresh channels |
| cns/deviceplugin/socketwatcher_test.go | Added TestWatchSocketCleanup to verify map cleanup and that new watchers can be established after previous ones exit |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
|
||
| func TestWatchSocketCleanup(t *testing.T) { | ||
| // Create a temporary directory | ||
| tempDir, err := os.MkdirTemp("", "socket-watcher-test-") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tend to prefer adding a .gitignore file with entries for every filename that we might create so you don't accidentally check in test data
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will create a file in /tmp/socket-watcher-test-123456 and the defer os.RemoveAll(tempDir) will clean it up, so no need to add entries to .gitignore
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can't guarantee that your os.RemoveAll will run, if your tests crash then the files will be left behind. On linux this is fine since the temp directory goes under /tmp but on windows I think os.MkdirTemp uses the current directory (not 100% sure about that). If you can guarantee that these temp dirs never get created in the scope of the git repo, then it is fine to not add a .gitignore
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://pkg.go.dev/os#MkdirTemp
https://pkg.go.dev/os#TempDir
Yes it is guaranteed to not be created in repo for windows as well. It typically looks like:
C:\Users\<Username>\AppData\Local\Temp\socket-watcher-test-<random>\to-be-deleted.sock
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do what matt said
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/tmp may not be accessible to a user with restricted permissions, may cross filesystem boundaries, may not actually be as temporary as you think (ie these files will persist until reboot, on Linux; and forever, on Windows). Etc. Also, this means your test has side-effects, and as you pointed out, escapes the repo scope - that's explicitly bad.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
conventionally tests use the testdata folder (go test will set the current working directory to the directory of your test file so you don't have to do anything tricky when building the base path), so just create testdata/.gitignore and add the filenames that your tests create, then you can just read from testdata/<file> in the test without needing to construct any special base paths
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At least use T.TempDir. But keeping test filesystem I/O within the repo tree is best imo.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated with t.TempDir() since it is designed to handle cleanup automatically, even if the test fails or panics.
Signed-off-by: GitHub <[email protected]>
Signed-off-by: GitHub <[email protected]>
Signed-off-by: GitHub <[email protected]>
e7ad386 to
8e12d5c
Compare
Signed-off-by: GitHub <[email protected]>
|
/azp run Azure Container Networking PR |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
|
||
| func TestWatchSocketCleanup(t *testing.T) { | ||
| // Create a temporary directory | ||
| tempDir, err := os.MkdirTemp("", "socket-watcher-test-") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do what matt said
Signed-off-by: GitHub <[email protected]>
426cc35
Replaces usage of `t.TempDir()` with a local `testdata` directory for temporary socket files in `socketwatcher_test.go` and `server_test.go`. This change addresses two issues on Windows: 1. `t.TempDir()` creates files in the system temp directory, which can cause cleanup failures and disk space issues on CI agents if files are left behind. 2. Explicitly closes file handles after `os.Create()` to prevent "Access is denied" errors during cleanup, as Windows does not allow deleting open files. Also adds a `.gitignore` to `cns/deviceplugin/testdata` to ensure temporary socket files are not tracked. Signed-off-by: GitHub <[email protected]>
|
/azp run Azure Container Networking PR |
|
Azure Pipelines successfully started running 1 pipeline(s). |
Signed-off-by: GitHub <[email protected]>
|
/azp run Azure Container Networking PR |
|
Azure Pipelines successfully started running 1 pipeline(s). |
Description
This PR fixes a critical bug in the
SocketWatcherthat caused the device plugin to enter a tight restart loop if the socket file was deleted (e.g., by Kubelet) or if the plugin restarted for any reason.The Problem:
The
SocketWatchermaintains a mapsocketChansof active watchers. When a watcher goroutine exited (due to socket deletion or context cancellation), it closed its notification channel but failed to remove the entry from the map.Consequently, when the
PluginManagerattempted to restart the plugin:WatchSocketwith the same socket path.WatchSocketfound the existing entry in the map and returned the already-closed channel.Symptoms:
Logs showing
"starting device plugin for resource"and"registering with kubelet"repeating rapidly with no intermediate errors.Pods failing with
"UnexpectedAdmissionError: Allocate failed due to no healthy devices present". This occurred because the plugin was constantly churning, leaving no stable window for Kubelet to allocate devices.The Fix:
Updated SocketWatcher to ensure that the socket entry is deleted from the socketChans map when the watcher gor
outine exits. This ensures that subsequent calls to WatchSocket create a fresh watcher and channel.
Testing:
Added a regression test TestWatchSocketCleanup in socketwatcher_test.go to verify that the map entry is cleaned up and new watchers can be established successfully.