Skip to content

Conversation

@realrajaryan
Copy link
Contributor

Type of Change

  • Bug fix
  • New feature
  • Breaking change
  • Documentation update

Motivation and Context

Currently, when stopping and immediately restarting a container, it would fail with the error:
“container expected to be in created state, got: shuttingDown” and then be automatically deleted.
The SandboxService process waits five seconds before exiting after shutdown. During this interval, a rapid restart could reconnect to the still-terminating process in the shuttingDown state, triggering a state validation error.

This fix forcefully terminates the SandboxService process with SIGKILL upon container exit, instead of waiting five seconds. The bootstrap now defensively checks for and cleans up any stale services before registering new ones, preventing reconnections to processes in the shuttingDown state.

Testing

  • Tested locally
  • Added/updated tests
  • Added/updated docs

@realrajaryan realrajaryan requested a review from dcantah November 4, 2025 02:32
@dcantah
Copy link
Member

dcantah commented Nov 4, 2025

I think we should just get rid of the 5 second sleep+exit and deregister (and not ignore the error) in stop. The shutdown rpc is just meant as a way (our docs here are shitty, otherwise it'd be easier to tell 😄) for any implementations of the sandbox service to clean up any resources it may have created during container lifecycle (temp files, sockets, whatever). I think we can just rely on the APIServer to kill it, and not have the sandbox service itself exit on its own. I'm not sure on that entirely, but it seems fine to me to leave having it exit up to the APIServer.

So:

  1. Remove the Task sleep and exit in shutdown rpc.
  2. Always deregister after shutdown comes back successfully.
  3. Get rid of the checks you added in bootstrap

@realrajaryan
Copy link
Contributor Author

I think we should just get rid of the 5 second sleep+exit and deregister (and not ignore the error) in stop. The shutdown rpc is just meant as a way (our docs here are shitty, otherwise it'd be easier to tell 😄) for any implementations of the sandbox service to clean up any resources it may have created during container lifecycle (temp files, sockets, whatever). I think we can just rely on the APIServer to kill it, and not have the sandbox service itself exit on its own. I'm not sure on that entirely, but it seems fine to me to leave having it exit up to the APIServer.

So:

  1. Remove the Task sleep and exit in shutdown rpc.
  2. Always deregister after shutdown comes back successfully.
  3. Get rid of the checks you added in bootstrap

I’m also gonna move service registration from create() to bootstrap(), because if create() registers the service and bootstrap() registers again it’ll fail. This way services start when containers are actually running.

@realrajaryan realrajaryan force-pushed the users/realrajaryan/kill-sandbox-service-on-exit branch from 9a67abd to ddf2c35 Compare November 4, 2025 22:06
@realrajaryan realrajaryan force-pushed the users/realrajaryan/kill-sandbox-service-on-exit branch from ddf2c35 to 1388dbc Compare November 4, 2025 22:07
@realrajaryan realrajaryan force-pushed the users/realrajaryan/kill-sandbox-service-on-exit branch from 1388dbc to c8a4cb7 Compare November 4, 2025 22:21
Comment on lines +464 to +465
let client = try state.getClient()
try await client.shutdown()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I still think shutdown shouldn't be fatal and we can leave what we had as it's mostly best effort cleanup on the sb services end, I just meant to not ignore all of the errors like what you had in your original change.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're in a weird spot if deregistration fails though. Curious on your thoughts on how to handle that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ahh, If deregister fails, we could catch it and attempt kill SIGKILL + deregister as a fallback?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: Container auto deletes when we stop a container and start the same container within few milliseconds.

2 participants