[DPE-10397] The watcher now handles 4, 6, … units (2 units worked previously)#40
Draft
taurus-forever wants to merge 1 commit into
Draft
[DPE-10397] The watcher now handles 4, 6, … units (2 units worked previously)#40taurus-forever wants to merge 1 commit into
taurus-forever wants to merge 1 commit into
Conversation
The behavior you asked to keep is untouched (votes at 1 unit; odd ≥3 stands down with an Active+WARNING status; AZ blocking logic unchanged). The key finding from the audit: the vote rule itself (_should_watcher_vote) was already correct for even counts — what was broken was the lifecycle around it. The most serious bug was the 3→4 rejoin: it depended entirely on a single relation-changed event, with no recovery if that event was missed, and the update-status path never told PostgreSQL the watcher had stood down. Fixes in src/relations/watcher_requirer.py: - Added a self-heal path in update-status: if the watcher should be voting but its Raft service isn't running, it reconfigures and flips raft-status back to "connected" (the core 3→4 rejoin fix). - The update-status stand-down now publishes raft-status: "disabled" so PostgreSQL actually drops the watcher from its Patroni raft config, stops the service before removing the raft member (a running pysyncobj node could rejoin in between), and is idempotent instead of re-probing every 5 minutes. - Fixed two return-instead-of-continue bugs that let one relation starve all others (multi-cluster), and the stand-down now sets the warning status immediately instead of waiting for the next update-status. - Provider disable-watcher is now distinguished from the odd-count stand-down: it tears the watcher down even at even counts, is never self-healed, and doesn't emit the misleading "odd number units" warning. - Status while stood down now reads "Watcher standing down, monitoring N PostgreSQL endpoints" instead of falsely claiming "Raft connected". Tests (written first, watched fail, then made green): - 16 new unit tests: the vote rule parametrized over 0–7 units, plus stand-down/rejoin/idempotency/self-heal/multi-relation scenarios — tox run -e unit: 31 passed, tox run -e lint: clean. - New integration test tests/integration/ha_tests/test_watcher_scaling.py scaling 2→3→4→2 with continuous writes, asserting raft membership (3 → 3 without watcher → 5 → 3) via the shared verify_raft_cluster_health helper (moved to ha_tests/helpers.py with a new expect_watcher_absent flag), plus a spread task so CI picks it up. Not run locally — it needs a Juju+LXD model (tox run -e integration -- tests/integration/ha_tests/test_watcher_scaling.py --model testing), so the rejoin path should be considered proven only once that passes in CI. Assisted-by: Claude:claude-fable-5 (before-banned-by-trump) :-(
Contributor
Author
|
@dragomirp mentioned it is all should NOT be necessary, the current codebase should work well with 2,4,6,... units as documented in manuals. TODO: re-test. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The new behavior keeps is untouched (votes at 1 unit; odd ≥3 stands down with an Active+WARNING status; AZ blocking logic unchanged).
The key finding from the audit: the vote rule itself (_should_watcher_vote) was already correct for even counts — what was broken was the lifecycle around it. The most serious bug was the 3→4 rejoin: it depended entirely on a single relation-changed event, with no recovery if that event was missed, and the update-status path never told PostgreSQL the watcher had stood down.
Fixes in src/relations/watcher_requirer.py:
Tests (written first, watched fail, then made green):
Assisted-by: Claude:claude-fable-5 (before-banned-by-trump) :-(