Skip to content

Add graceful shutdown handler for SIGTERM/SIGINT #9

@xinbenlv

Description

@xinbenlv

Original Request

Add graceful shutdown handler for SIGTERM/SIGINT

Context: The server in packages/server/src/index.ts doesn't handle SIGTERM or SIGINT signals. On Toolforge, when the container is restarted or scaled, connections drop abruptly — SSE clients lose their stream mid-event, MongoDB connections may not close cleanly, and in-flight requests are terminated without response. Adding signal handlers to: (1) stop accepting new connections, (2) close SSE streams with a "server-shutdown" event so clients can reconnect, (3) flush pending bulk writes, and (4) close MongoDB connection pool — would improve reliability during deployments and container restarts.

Agent's Two Cents (could be wrong)

Everything below is the AI agent's best guess based on the current codebase.
Take with a grain of salt — the original request above is the only thing that came from a human.

Problem / Motivation

The standalone server (packages/server/src/index.ts) currently calls serve() and never captures the returned server handle, nor registers any signal handlers. When Toolforge sends SIGTERM during a container restart or scale-down, Node.js terminates immediately — dropping all active SSE streams in events.ts, leaving Mongoose connections in a dirty state, and abandoning any in-flight HTTP responses. This causes client-visible errors and potential data loss on every deployment.

Proposed Solution

Register process.on("SIGTERM", ...) and process.on("SIGINT", ...) handlers in the server startup block. On signal receipt the handler should:

  1. Stop accepting new connections — call server.close() on the handle returned by @hono/node-server's serve().
  2. Notify SSE clients — emit a server-shutdown event on the eventBus so connected SSE streams in events.ts can send a final event and close gracefully, giving clients a cue to reconnect.
  3. Close the revert-risk EventSource stream — shut down the Wikimedia EventSource opened by startRevertRiskStream() so the process isn't held open by it.
  4. Flush pending writes — if any bulk-write batches are pending, flush them before closing.
  5. Close the MongoDB connection pool — call mongoose.connection.close().
  6. Exitprocess.exit(0) after cleanup (with a hard timeout of ~10 s as a safety net).

Dependencies & Potential Blockers

  • @hono/node-server's serve() returns a Node http.Server — need to confirm it's captured and that server.close() stops new connections as expected.
  • The eventBus in packages/server/src/lib/eventBus.ts currently has no "shutdown" event type — a new event type will need to be added.
  • startRevertRiskStream() does not currently export a stop() function — it will need to expose one (the eventSource variable is module-scoped but not exported).

How to Validate

  • Send SIGTERM to the running server; verify it logs a shutdown message and exits with code 0 within ~10 s.
  • While an SSE client is connected, send SIGTERM; verify the client receives a server-shutdown event before the connection closes.
  • After SIGTERM, confirm no new HTTP requests are accepted (connection refused).
  • Check MongoDB logs or Mongoose debug output to confirm the connection pool closes cleanly (no "topology was destroyed" warnings on next startup).
  • Start the server, trigger SIGINT (Ctrl-C); same graceful behavior as SIGTERM.

Scope Estimate

Small — the change is confined to the server startup block and a few supporting modules; no API surface changes.

Key Files/Modules Likely Involved

  • packages/server/src/index.ts — main startup block where signal handlers will be registered
  • packages/server/src/routes/events.ts — SSE endpoint; needs to listen for shutdown event
  • packages/server/src/lib/eventBus.ts — needs a new shutdown event type
  • packages/server/src/lib/revertRiskStream.ts — needs to export a stopRevertRiskStream() function
  • packages/server/src/db/connection.ts — may add a disconnectDB() helper

Rough Implementation Sketch

  • Capture the return value of serve() in a variable (const server = serve(...))
  • Create an async function gracefulShutdown(signal: string) that:
    • Logs the signal
    • Calls server.close()
    • Emits shutdown on eventBus
    • Calls a new stopRevertRiskStream()
    • Calls mongoose.connection.close()
    • Sets a setTimeout(() => process.exit(1), 10_000) safety net
    • Calls process.exit(0) on success
  • Register process.on("SIGTERM", () => gracefulShutdown("SIGTERM")) and same for SIGINT
  • In events.ts, listen for the shutdown event on eventBus and write a final server-shutdown SSE event before resolving the stream promise
  • In revertRiskStream.ts, export stopRevertRiskStream() that calls eventSource?.close() and clears the retry timer

Open Questions

  • Should the shutdown handler wait for in-flight HTTP requests to finish (drain), or just close after SSE/DB cleanup? A drain timeout of a few seconds might be appropriate.
  • Is there a bulk-write buffer anywhere beyond the revert-risk ring buffer that needs flushing? (The ring buffer is in-memory and read-only, so it can be dropped safely.)
  • Should the health endpoint return a "shutting down" status during the grace period to help load balancers route away?

Potential Risks or Gotchas

  • If serve() from @hono/node-server doesn't return a standard http.Server, the server.close() approach may need adjustment.
  • The safety-net setTimeout must call process.exit(1) with .unref() so it doesn't itself keep the process alive if everything else closes cleanly.
  • On Toolforge, Kubernetes sends SIGTERM and then SIGKILL after 30 s by default. The 10 s internal timeout is well within that window, but worth documenting.

Metadata

Metadata

Assignees

No one assigned

    Labels

    AgentsCanDoSuitable for autonomous agent pickupenhancementNew feature or requestp2Medium priority

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions