-
Notifications
You must be signed in to change notification settings - Fork 514
Fix container auto-delete on rapid stop/start #841
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Fix container auto-delete on rapid stop/start #841
Conversation
|
I think we should just get rid of the 5 second sleep+exit and deregister (and not ignore the error) in stop. The shutdown rpc is just meant as a way (our docs here are shitty, otherwise it'd be easier to tell 😄) for any implementations of the sandbox service to clean up any resources it may have created during container lifecycle (temp files, sockets, whatever). I think we can just rely on the APIServer to kill it, and not have the sandbox service itself exit on its own. I'm not sure on that entirely, but it seems fine to me to leave having it exit up to the APIServer. So:
|
I’m also gonna move service registration from create() to bootstrap(), because if create() registers the service and bootstrap() registers again it’ll fail. This way services start when containers are actually running. |
9a67abd to
ddf2c35
Compare
ddf2c35 to
1388dbc
Compare
1388dbc to
c8a4cb7
Compare
| let client = try state.getClient() | ||
| try await client.shutdown() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I still think shutdown shouldn't be fatal and we can leave what we had as it's mostly best effort cleanup on the sb services end, I just meant to not ignore all of the errors like what you had in your original change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're in a weird spot if deregistration fails though. Curious on your thoughts on how to handle that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ahh, If deregister fails, we could catch it and attempt kill SIGKILL + deregister as a fallback?
Type of Change
Motivation and Context
Currently, when stopping and immediately restarting a container, it would fail with the error:
“container expected to be in created state, got: shuttingDown”and then be automatically deleted.The
SandboxServiceprocess waits five seconds before exiting after shutdown. During this interval, a rapid restart could reconnect to the still-terminating process in theshuttingDownstate, triggering a state validation error.This fix forcefully terminates the
SandboxServiceprocess withSIGKILLupon container exit, instead of waiting five seconds. The bootstrap now defensively checks for and cleans up any stale services before registering new ones, preventing reconnections to processes in theshuttingDownstate.Testing