Skip to content

feat: add /roll command for gated fleet image rollouts#161

Open
barnabasbusa wants to merge 3 commits into
masterfrom
feat/roll-command
Open

feat: add /roll command for gated fleet image rollouts#161
barnabasbusa wants to merge 3 commits into
masterfrom
feat/roll-command

Conversation

@barnabasbusa

@barnabasbusa barnabasbusa commented May 29, 2026

Copy link
Copy Markdown
Contributor

Summary

Adds a gated, sequential, health-checked image rollout across a network's nodes — a Discord /roll slash command and a panda-pulse roll CLI subcommand, both backed by a shared pkg/roll engine.

watchtower auto-updates are uncontrolled: a bad image can hit the whole fleet at once. /roll rolls one node at a time, verifies each recovers, and aborts the moment a node fails — so a bad image never propagates.

How it works

  • Inventory — targets are resolved from cartographoor's per-network inventory and grouped by node.
  • Trigger — each node's watchtower HTTP API at its vhost (https://watchtower-<node>.srv.<network>.ethpandaops.io/v1/update) with a bearer token. Uses watchtower's stock /v1/update (no fork required).
  • HealthDora is the source of truth (status: ready), one unauthenticated fleet-wide call. Pre-flight gate + per-node recovery wait; --force / force:true skips gating for a known-bad node.
  • Selection — Ansible-style host patterns: client, client_group, exact node, * globs, ! exclusions, all.

Surfaces

  • Discord /rollnetwork + client (autocompleted from the network inventory) + optional image, delay, force, dry_run. Live per-host progress is posted to a channel message (so it survives Discord's 15-minute interaction window on long rolls), with a completion ping to the invoker.
  • panda-pulse roll — the same engine from the CLI.

Configuration

  • WATCHTOWER_HTTP_API_TOKEN — bearer token for the watchtower API.
  • Dora URL is derived from the network (https://dora.<network>.ethpandaops.io); overridable.

Requires

  • The watchtower HTTP API exposed per node at the watchtower-<node> vhost (separate ansible change). No SSH, no basic auth.

Test plan

  • go build ./...
  • go test ./pkg/roll/...
  • golangci-lint run (clean)
  • Register /roll in a guild and dry-run against a network
  • Live roll of a single client on a devnet

Adds a gated, sequential, health-checked image rollout capability across a
network's nodes, surfaced as a Discord /roll slash command and a
`panda-pulse roll` CLI subcommand backed by a shared pkg/roll engine.

- pkg/roll: sequential engine (abort + leave-rest-untouched on failure),
  pluggable actuators (SSH-to-local-watchtower and watchtower vhost API),
  Dora-based health gating with per-node beacon fallback, cartographoor
  inventory resolution, and Ansible-style host selection (globs, exclusions,
  client/group/node, "all").
- Discord /roll: client autocomplete from the network inventory, live
  per-host progress in a channel message (survives the 15m interaction
  window), a force option to override health gating, and a completion ping.

Health uses Dora as the source of truth (one unauthenticated fleet-wide
call); rolls trigger watchtower's HTTP API via the watchtower-<host> vhost,
so no SSH or basic auth is required on the default path.
@barnabasbusa barnabasbusa requested a review from mattevans as a code owner May 29, 2026 13:50
The watchtower-vhost API actuator + Dora health gating is the only path now,
so the SSH actuator was dead weight. Remove it and all SSH wiring (config,
flags, env, golang.org/x/crypto/ssh). Target.SSH is retained purely as the host
string the API actuator derives watchtower-<host> from.

Also satisfies golangci-lint (govet shadow, wsl_v5 whitespace, and tagliatelle
nolints on the beacon/Dora snake_case API structs).

The reviewer's reported error-handling bug was a false positive — c.actuator()'s
error is already captured and checked at the call site.
Simplify the rollout to a single, clean path with no SSH and no direct
(basic-auth) beacon access:

- Health is sourced solely from Dora (status: ready), keyed by node name.
  Remove the beacon health checker and all basic-auth wiring (config, env,
  flags, Options fields).
- The API actuator targets each node's watchtower vhost
  (https://watchtower-<node>.srv.<network>.ethpandaops.io) with a bearer token;
  drop SSH host derivation.
- Inventory targets are grouped by node name (clientName); drop the unused
  ssh/beacon/rpc fields and the beaconScheme plumbing.

The only inputs are now the network, the cartographoor inventory, the Dora URL
(derived from the network), and the watchtower API token.
Comment thread pkg/roll/engine.go
return fmt.Errorf("trigger: %w", err)
}

if err := sleep(ctx, opts.PostTriggerWait); err != nil {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will just verify liveness of the rolled client, not necessarily its the new img. Not sure if feasible, but if watchtower is slow you'll get the old container reporting "healthy" without the new img having being rolled out.

// the gated rollout. Progress is tracked in a normal bot message (not the
// interaction reply) so it survives past Discord's 15-minute interaction-token
// window — a multi-node roll can take longer than that.
func (c *Command) run(s *discordgo.Session, i *discordgo.InteractionCreate, data discordgo.ApplicationCommandInteractionData) error {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if you want some form of concurrent lock here, but simultaneous /roll commands could be run at the moment. I guess prob not too much of an issue?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants