feat: add /roll command for gated fleet image rollouts#161
Open
barnabasbusa wants to merge 3 commits into
Open
feat: add /roll command for gated fleet image rollouts#161barnabasbusa wants to merge 3 commits into
barnabasbusa wants to merge 3 commits into
Conversation
Adds a gated, sequential, health-checked image rollout capability across a network's nodes, surfaced as a Discord /roll slash command and a `panda-pulse roll` CLI subcommand backed by a shared pkg/roll engine. - pkg/roll: sequential engine (abort + leave-rest-untouched on failure), pluggable actuators (SSH-to-local-watchtower and watchtower vhost API), Dora-based health gating with per-node beacon fallback, cartographoor inventory resolution, and Ansible-style host selection (globs, exclusions, client/group/node, "all"). - Discord /roll: client autocomplete from the network inventory, live per-host progress in a channel message (survives the 15m interaction window), a force option to override health gating, and a completion ping. Health uses Dora as the source of truth (one unauthenticated fleet-wide call); rolls trigger watchtower's HTTP API via the watchtower-<host> vhost, so no SSH or basic auth is required on the default path.
The watchtower-vhost API actuator + Dora health gating is the only path now, so the SSH actuator was dead weight. Remove it and all SSH wiring (config, flags, env, golang.org/x/crypto/ssh). Target.SSH is retained purely as the host string the API actuator derives watchtower-<host> from. Also satisfies golangci-lint (govet shadow, wsl_v5 whitespace, and tagliatelle nolints on the beacon/Dora snake_case API structs). The reviewer's reported error-handling bug was a false positive — c.actuator()'s error is already captured and checked at the call site.
Simplify the rollout to a single, clean path with no SSH and no direct (basic-auth) beacon access: - Health is sourced solely from Dora (status: ready), keyed by node name. Remove the beacon health checker and all basic-auth wiring (config, env, flags, Options fields). - The API actuator targets each node's watchtower vhost (https://watchtower-<node>.srv.<network>.ethpandaops.io) with a bearer token; drop SSH host derivation. - Inventory targets are grouped by node name (clientName); drop the unused ssh/beacon/rpc fields and the beaconScheme plumbing. The only inputs are now the network, the cartographoor inventory, the Dora URL (derived from the network), and the watchtower API token.
mattevans
approved these changes
Jun 1, 2026
| return fmt.Errorf("trigger: %w", err) | ||
| } | ||
|
|
||
| if err := sleep(ctx, opts.PostTriggerWait); err != nil { |
Member
There was a problem hiding this comment.
This will just verify liveness of the rolled client, not necessarily its the new img. Not sure if feasible, but if watchtower is slow you'll get the old container reporting "healthy" without the new img having being rolled out.
| // the gated rollout. Progress is tracked in a normal bot message (not the | ||
| // interaction reply) so it survives past Discord's 15-minute interaction-token | ||
| // window — a multi-node roll can take longer than that. | ||
| func (c *Command) run(s *discordgo.Session, i *discordgo.InteractionCreate, data discordgo.ApplicationCommandInteractionData) error { |
Member
There was a problem hiding this comment.
Not sure if you want some form of concurrent lock here, but simultaneous /roll commands could be run at the moment. I guess prob not too much of an issue?
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds a gated, sequential, health-checked image rollout across a network's nodes — a Discord
/rollslash command and apanda-pulse rollCLI subcommand, both backed by a sharedpkg/rollengine.watchtower auto-updates are uncontrolled: a bad image can hit the whole fleet at once.
/rollrolls one node at a time, verifies each recovers, and aborts the moment a node fails — so a bad image never propagates.How it works
https://watchtower-<node>.srv.<network>.ethpandaops.io/v1/update) with a bearer token. Uses watchtower's stock/v1/update(no fork required).status: ready), one unauthenticated fleet-wide call. Pre-flight gate + per-node recovery wait;--force/force:trueskips gating for a known-bad node.client_group, exact node,*globs,!exclusions,all.Surfaces
/roll—network+client(autocompleted from the network inventory) + optionalimage,delay,force,dry_run. Live per-host progress is posted to a channel message (so it survives Discord's 15-minute interaction window on long rolls), with a completion ping to the invoker.panda-pulse roll— the same engine from the CLI.Configuration
WATCHTOWER_HTTP_API_TOKEN— bearer token for the watchtower API.https://dora.<network>.ethpandaops.io); overridable.Requires
watchtower-<node>vhost (separate ansible change). No SSH, no basic auth.Test plan
go build ./...go test ./pkg/roll/...golangci-lint run(clean)/rollin a guild and dry-run against a network