Skip to content

[DPE-10397] The watcher now handles 4, 6, … units (2 units worked previously)#40

Draft
taurus-forever wants to merge 1 commit into
16/edgefrom
alutay/dpe10397
Draft

[DPE-10397] The watcher now handles 4, 6, … units (2 units worked previously)#40
taurus-forever wants to merge 1 commit into
16/edgefrom
alutay/dpe10397

Conversation

@taurus-forever

@taurus-forever taurus-forever commented Jun 14, 2026

Copy link
Copy Markdown
Contributor

The new behavior keeps is untouched (votes at 1 unit; odd ≥3 stands down with an Active+WARNING status; AZ blocking logic unchanged).

The key finding from the audit: the vote rule itself (_should_watcher_vote) was already correct for even counts — what was broken was the lifecycle around it. The most serious bug was the 3→4 rejoin: it depended entirely on a single relation-changed event, with no recovery if that event was missed, and the update-status path never told PostgreSQL the watcher had stood down.

Fixes in src/relations/watcher_requirer.py:

  • Added a self-heal path in update-status: if the watcher should be voting but its Raft service isn't running, it reconfigures and flips raft-status back to "connected" (the core 3→4 rejoin fix).
  • The update-status stand-down now publishes raft-status: "disabled" so PostgreSQL actually drops the watcher from its Patroni raft config, stops the service before removing the raft member (a running pysyncobj node could rejoin in between), and is idempotent instead of re-probing every 5 minutes.
  • Fixed two return-instead-of-continue bugs that let one relation starve all others (multi-cluster), and the stand-down now sets the warning status immediately instead of waiting for the next update-status.
  • Provider disable-watcher is now distinguished from the odd-count stand-down: it tears the watcher down even at even counts, is never self-healed, and doesn't emit the misleading "odd number units" warning.
  • Status while stood down now reads "Watcher standing down, monitoring N PostgreSQL endpoints" instead of falsely claiming "Raft connected".

Tests (written first, watched fail, then made green):

  • 16 new unit tests: the vote rule parametrized over 0–7 units, plus stand-down/rejoin/idempotency/self-heal/multi-relation scenarios — tox run -e unit: 31 passed, tox run -e lint: clean.
  • New integration test tests/integration/ha_tests/test_watcher_scaling.py scaling 2→3→4→2 with continuous writes, asserting raft membership (3 → 3 without watcher → 5 → 3) via the shared verify_raft_cluster_health helper (moved to ha_tests/helpers.py with a new expect_watcher_absent flag), plus a spread task so CI picks it up. Not run locally — it needs a Juju+LXD model (tox run -e integration -- tests/integration/ha_tests/test_watcher_scaling.py --model testing), so the rejoin path should be considered proven only once that passes in CI.

Assisted-by: Claude:claude-fable-5 (before-banned-by-trump) :-(

The behavior you asked to keep is untouched (votes at 1 unit;
odd ≥3 stands down with an Active+WARNING status; AZ blocking logic unchanged).

The key finding from the audit: the vote rule itself (_should_watcher_vote) was
already correct for even counts — what was broken was the lifecycle around it.
The most serious bug was the 3→4 rejoin: it depended entirely on a single
relation-changed event, with no recovery if that event was missed,
and the update-status path never told PostgreSQL the watcher had stood down.

Fixes in src/relations/watcher_requirer.py:
- Added a self-heal path in update-status: if the watcher should be
  voting but its Raft service isn't running, it reconfigures and flips
  raft-status back to "connected" (the core 3→4 rejoin fix).
- The update-status stand-down now publishes raft-status:
  "disabled" so PostgreSQL actually drops the watcher from its Patroni raft config,
  stops the service before removing the raft member
  (a running pysyncobj node could rejoin in between),
  and is idempotent instead of re-probing every 5 minutes.
- Fixed two return-instead-of-continue bugs that let one relation starve all others (multi-cluster),
  and the stand-down now sets the warning status immediately instead of waiting for the next update-status.
- Provider disable-watcher is now distinguished from the odd-count stand-down:
  it tears the watcher down even at even counts, is never self-healed,
  and doesn't emit the misleading "odd number units" warning.
- Status while stood down now reads "Watcher standing down,
  monitoring N PostgreSQL endpoints" instead of falsely claiming "Raft connected".

Tests (written first, watched fail, then made green):
- 16 new unit tests: the vote rule parametrized over 0–7 units,
  plus stand-down/rejoin/idempotency/self-heal/multi-relation
  scenarios — tox run -e unit: 31 passed, tox run -e lint: clean.
- New integration test tests/integration/ha_tests/test_watcher_scaling.py
  scaling 2→3→4→2 with continuous writes, asserting raft membership
  (3 → 3 without watcher → 5 → 3) via the shared verify_raft_cluster_health helper
  (moved to ha_tests/helpers.py with a new expect_watcher_absent flag),
  plus a spread task so CI picks it up. Not run locally — it needs a Juju+LXD model
  (tox run -e integration -- tests/integration/ha_tests/test_watcher_scaling.py --model testing),
  so the rejoin path should be considered proven only once that passes in CI.

Assisted-by: Claude:claude-fable-5 (before-banned-by-trump) :-(
@taurus-forever taurus-forever changed the title [DPE-10397] The watcher now handles 2, 4, 6, … PostgreSQL units reliably [DPE-10397] The watcher now handles 4, 6, … units (2 units worked previously) Jun 14, 2026
@taurus-forever taurus-forever added the enhancement New feature or request label Jun 14, 2026
@taurus-forever

Copy link
Copy Markdown
Contributor Author

@dragomirp mentioned it is all should NOT be necessary, the current codebase should work well with 2,4,6,... units as documented in manuals. TODO: re-test.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request Libraries: OK

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant