Skip to content

Conversation

@isaac-dasan
Copy link
Member

Description

This PR fixes a critical bug in the SocketWatcher that caused the device plugin to enter a tight restart loop if the socket file was deleted (e.g., by Kubelet) or if the plugin restarted for any reason.

The Problem:

The SocketWatcher maintains a map socketChans of active watchers. When a watcher goroutine exited (due to socket deletion or context cancellation), it closed its notification channel but failed to remove the entry from the map.

Consequently, when the PluginManager attempted to restart the plugin:

  1. It called WatchSocket with the same socket path.
  2. WatchSocket found the existing entry in the map and returned the already-closed channel.
  3. The plugin's main loop (Run) received from this closed channel immediately and exited.
  4. The loop repeated indefinitely, causing the plugin to restart every ~100ms.

Symptoms:

Logs showing "starting device plugin for resource" and "registering with kubelet" repeating rapidly with no intermediate errors.
Pods failing with "UnexpectedAdmissionError: Allocate failed due to no healthy devices present". This occurred because the plugin was constantly churning, leaving no stable window for Kubelet to allocate devices.

The Fix:

Updated SocketWatcher to ensure that the socket entry is deleted from the socketChans map when the watcher gor
outine exits. This ensures that subsequent calls to WatchSocket create a fresh watcher and channel.

Testing:

Added a regression test TestWatchSocketCleanup in socketwatcher_test.go to verify that the map entry is cleaned up and new watchers can be established successfully.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes a critical infinite loop bug in the device plugin's SocketWatcher that occurred when socket files were deleted or plugins restarted. The root cause was that the socketChans map wasn't being cleaned up when watcher goroutines exited, causing subsequent WatchSocket calls to return already-closed channels. This led to the plugin immediately exiting and restarting in a tight loop (~100ms intervals), preventing stable device allocation.

Key Changes:

  • Added cleanup logic to remove map entries when watcher goroutines exit
  • Added regression test to verify map cleanup and fresh watcher creation

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File Description
cns/deviceplugin/socketwatcher.go Modified defer block to delete socket entry from map before closing channel, ensuring subsequent watchers create fresh channels
cns/deviceplugin/socketwatcher_test.go Added TestWatchSocketCleanup to verify map cleanup and that new watchers can be established after previous ones exit

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.


func TestWatchSocketCleanup(t *testing.T) {
// Create a temporary directory
tempDir, err := os.MkdirTemp("", "socket-watcher-test-")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tend to prefer adding a .gitignore file with entries for every filename that we might create so you don't accidentally check in test data

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will create a file in /tmp/socket-watcher-test-123456 and the defer os.RemoveAll(tempDir) will clean it up, so no need to add entries to .gitignore

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can't guarantee that your os.RemoveAll will run, if your tests crash then the files will be left behind. On linux this is fine since the temp directory goes under /tmp but on windows I think os.MkdirTemp uses the current directory (not 100% sure about that). If you can guarantee that these temp dirs never get created in the scope of the git repo, then it is fine to not add a .gitignore

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://pkg.go.dev/os#MkdirTemp
https://pkg.go.dev/os#TempDir

Yes it is guaranteed to not be created in repo for windows as well. It typically looks like:
C:\Users\<Username>\AppData\Local\Temp\socket-watcher-test-<random>\to-be-deleted.sock

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do what matt said

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/tmp may not be accessible to a user with restricted permissions, may cross filesystem boundaries, may not actually be as temporary as you think (ie these files will persist until reboot, on Linux; and forever, on Windows). Etc. Also, this means your test has side-effects, and as you pointed out, escapes the repo scope - that's explicitly bad.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

conventionally tests use the testdata folder (go test will set the current working directory to the directory of your test file so you don't have to do anything tricky when building the base path), so just create testdata/.gitignore and add the filenames that your tests create, then you can just read from testdata/<file> in the test without needing to construct any special base paths

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At least use T.TempDir. But keeping test filesystem I/O within the repo tree is best imo.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated with t.TempDir() since it is designed to handle cleanup automatically, even if the test fails or panics.

@isaac-dasan isaac-dasan requested review from a team as code owners December 11, 2025 21:42
@isaac-dasan isaac-dasan force-pushed the users/isaac/watcherbugfix branch from e7ad386 to 8e12d5c Compare December 11, 2025 21:44
Signed-off-by: GitHub <[email protected]>
thatmattlong
thatmattlong previously approved these changes Dec 11, 2025
miguelgoms
miguelgoms previously approved these changes Dec 11, 2025
@isaac-dasan
Copy link
Member Author

/azp run Azure Container Networking PR

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@isaac-dasan isaac-dasan added this pull request to the merge queue Dec 11, 2025
@rbtr rbtr removed this pull request from the merge queue due to a manual request Dec 12, 2025

func TestWatchSocketCleanup(t *testing.T) {
// Create a temporary directory
tempDir, err := os.MkdirTemp("", "socket-watcher-test-")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do what matt said

Signed-off-by: GitHub <[email protected]>
@isaac-dasan isaac-dasan dismissed stale reviews from miguelgoms and thatmattlong via 426cc35 December 12, 2025 00:52
Replaces usage of `t.TempDir()` with a local `testdata` directory for temporary socket files in `socketwatcher_test.go` and `server_test.go`.

This change addresses two issues on Windows:
1. `t.TempDir()` creates files in the system temp directory, which can cause cleanup failures and disk space issues on CI agents if files are left behind.
2. Explicitly closes file handles after `os.Create()` to prevent "Access is denied" errors during cleanup, as Windows does not allow deleting open files.

Also adds a `.gitignore` to `cns/deviceplugin/testdata` to ensure temporary socket files are not tracked.

Signed-off-by: GitHub <[email protected]>
@isaac-dasan
Copy link
Member Author

/azp run Azure Container Networking PR

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Signed-off-by: GitHub <[email protected]>
@isaac-dasan
Copy link
Member Author

/azp run Azure Container Networking PR

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants