Skip to content

Conversation

@arthurschreiber
Copy link
Member

@arthurschreiber arthurschreiber commented Oct 2, 2025

Description

During EmergencyReparentShard we don't want to wait for the old primary to get acks for all pending changes, because we're performing a demotion at the same time as unlinking existing replicas from the primary. If all replicas are unlinked while commits are still pending, the primary will never be able to receive the semi sync acks to be unblocked, causing the demoted host to stick around as a PRIMARY.

The old primary won't directly cause any issues, as vtgates will notice that it's PrimaryTermStartTime will be lower than the newly elected primary's time, so traffic will be cut off.

But vtorc will notice that an unhealthy primary is still around and will continuously try to demote it.
The tablet itself will also notice that there's a different tablet that has assumed the PRIMARY role, and will try to demote itself as well.

Both of these attempts will continuously fail if there's any pending writes that are stuck waiting for a semi-sync ack, as without any replicas, the old primary won't receive any acks. These pending writes will cause other operations like setting the primary to read-only to fail (because the read-only change requires a lock that can't be granted while semi-sync acks are pending).

This PR tries to resolve this behavior by adding a new option to the DemotePrimaryRequest grpc call that allows "forcing" a demotion by detecting whether the old primary is blocked waiting for semi-sync writes, and if it is, then simply disabling the "source" side of semi-sync.

Disabling the "source" side of semi-sync will allow all pending commits to be written, unblocking other operations. This will very likely cause errant GTIDs on the primary, as those commits won't have been acknowledged by any of the replicas.

These errant GTIDs will be noticed by vtorc and the demoted primary will be moved into DRAIN type eventually.

Because of an unrelated issue (#18763), this won't directly help with the initial call to ERS, but it should allow old primary tablets to properly demote themselves once they notice that a different tablet has been promoted.

Details

Here's my understanding on how a semi-sync blocked primary will behave when DemotePrimary gets called with these new changes:

  • The primary will be set to non-serving. This operation can take up to --shutdown-grace-period. For queries/transactions that haven't finished by the time --shutdown-grace-period has passed, they should return an error to clients (even if at that point they might still be stuck on the MySQL side waiting for a semi-sync side). This is important because that means we never acknowledge that these writes have happened to the client that tried to perform this write. At the same time, no new reads can happen.

  • We check if a query / transaction is blocked on semi-sync. If it is, and we're forcing a demotion, we will disable source side semi-sync. This will unblock all of the blocked queries / transactions, and very likely lead to the new primary having one or more errant gtids.

  • super_read_only is enabled.

  • the latest GTID of the demoted primary is returned.

Reproduction

It took me a while to come up with a test case that demonstrates the issue, and shows that this fix works.

Here's one way how to reproduce the problem:

  • The primary of a cluster loses connection to all semi sync acking replicas. New / in-flight writes will start blocking.
  • EmergencyReparentShard gets called to elect a new primary and fix the situation.
  • The old primary does not come back online before --wait-replicas-timeout has passed. This is important as there seems to be a separate bug where a leftover goroutine tries to execute SetReplicationSource on the old primary even after EmergencyReparentShard has finished executing.
  • Once the old primary notices that it's no longer supposed to be a primary, it will try to demote itself via DemotePrimary.
  • This is where things go sideways. The old primary can't be demoted, because the semi-sync unblock wait will timeout, as there are no replicas attached to the old primary to allow unblocking semi-sync. This makes the demotion fail / timeout, and the old primary stays in PRIMARY mode (but is no longer serving).
  • If vtorc is setup, it will continue to try and demote the primary via DemotePrimary (which will continue failing).

With the new force option, the self-demotion of the old primary will skip the semi-sync unblock step and instead disable the primary side semi sync to unblock any commits waiting for an ack. The primary will demote itself and switch itself to a REPLICA. It will fail re-attaching to the cluster, because of errant GTIDS, and will stay in NOT_SERVING mode until vtorc notices the errant GTIDs and puts it into DRAINED.

Related Issue(s)

Checklist

  • "Backport to:" labels have been added if this change should be back-ported to release branches
  • If this change is to be back-ported to previous releases, a justification is included in the PR description
  • Tests were added or are not required
  • Did the new or modified tests pass consistently locally and on CI?
  • Documentation was added or is not required

Deployment Notes

AI Disclosure

…mergencyReparentShard`.

During `EmergencyReparentShard` we don't want to wait for the old primary to get acks for all pending changes, because we're performing a demotation at the same time as unlinking existing replicas from the primary. If all replicas are unlinked while commits are still pending, the primary will never be able to receive the semi sync acks to be unblocked, causing ERS to never be able to complete.

Signed-off-by: Arthur Schreiber <[email protected]>
@vitess-bot
Copy link
Contributor

vitess-bot bot commented Oct 2, 2025

Review Checklist

Hello reviewers! 👋 Please follow this checklist when reviewing this Pull Request.

General

  • Ensure that the Pull Request has a descriptive title.
  • Ensure there is a link to an issue (except for internal cleanup and flaky test fixes), new features should have an RFC that documents use cases and test cases.

Tests

  • Bug fixes should have at least one unit or end-to-end test, enhancement and new features should have a sufficient number of tests.

Documentation

  • Apply the release notes (needs details) label if users need to know about this change.
  • New features should be documented.
  • There should be some code comments as to why things are implemented the way they are.
  • There should be a comment at the top of each new or modified test to explain what the test does.

New flags

  • Is this flag really necessary?
  • Flag names must be clear and intuitive, use dashes (-), and have a clear help text.

If a workflow is added or modified:

  • Each item in Jobs should be named in order to mark it as required.
  • If the workflow needs to be marked as required, the maintainer team must be notified.

Backward compatibility

  • Protobuf changes should be wire-compatible.
  • Changes to _vt tables and RPCs need to be backward compatible.
  • RPC changes should be compatible with vitess-operator
  • If a flag is removed, then it should also be removed from vitess-operator and arewefastyet, if used there.
  • vtctl command output order should be stable and awk-able.

@vitess-bot vitess-bot bot added NeedsBackportReason If backport labels have been applied to a PR, a justification is required NeedsDescriptionUpdate The description is not clear or comprehensive enough, and needs work NeedsIssue A linked issue is missing for this Pull Request NeedsWebsiteDocsUpdate What it says labels Oct 2, 2025
@github-actions github-actions bot added this to the v23.0.0 milestone Oct 2, 2025
@codecov
Copy link

codecov bot commented Oct 3, 2025

Codecov Report

❌ Patch coverage is 51.78571% with 27 lines in your changes missing coverage. Please review.
✅ Project coverage is 69.76%. Comparing base (8230b8a) to head (3a4c5a7).
⚠️ Report is 5 commits behind head on main.

Files with missing lines Patch % Lines
go/vt/mysqlctl/replication.go 0.00% 15 Missing ⚠️
go/vt/vttablet/tabletmanager/rpc_replication.go 37.50% 10 Missing ⚠️
go/vt/vtcombo/tablet_map.go 0.00% 1 Missing ⚠️
go/vt/vttablet/faketmclient/fake_client.go 0.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main   #18714      +/-   ##
==========================================
- Coverage   69.77%   69.76%   -0.02%     
==========================================
  Files        1608     1608              
  Lines      214865   214908      +43     
==========================================
  Hits       149922   149922              
- Misses      64943    64986      +43     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@arthurschreiber arthurschreiber added Component: TabletManager Type: Enhancement Logical improvement (somewhere between a bug and feature) and removed NeedsWebsiteDocsUpdate What it says NeedsBackportReason If backport labels have been applied to a PR, a justification is required NeedsDescriptionUpdate The description is not clear or comprehensive enough, and needs work NeedsIssue A linked issue is missing for this Pull Request labels Oct 3, 2025
@arthurschreiber arthurschreiber changed the title Add new force flag to DemotePrimary to force a failover during `E… Add new force flag to DemotePrimary to force a failover during EmergencyReparentShard Oct 3, 2025
@systay systay modified the milestones: v23.0.0, v24.0.0 Oct 8, 2025
Copy link
Contributor

@shlomi-noach shlomi-noach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Things I've seen:

  • When primary is blocked on semi-sync, you sometimes can't create a new connection. This shouldn't strictly happen, but it does, on my local sandbox
  • When primary is blocked, setting rpl_semi_sync_source_enabled = 0 should be fine and has the intended effect.

Another approach would be to actively kill queries/transactions that are waiting on semi-sync, but frankly that in itself could be blocking.

@arthurschreiber
Copy link
Member Author

arthurschreiber commented Oct 19, 2025

Another approach would be to actively kill queries/transactions that are waiting on semi-sync, but frankly that in itself could be blocking.

My experience is that when things are blocked on semi sync, the kill flag is not actively probed and the query can't be killed. 😞

@arthurschreiber
Copy link
Member Author

  • When primary is blocked on semi-sync, you sometimes can't create a new connection. This shouldn't strictly happen, but it does, on my local sandbox

Do you mean a new connection through the connection pool? Or a new connection against MySQL?

The connection pool won't allow opening a new connection if all slots are used up (unlikely, but can happen if other work gets stuck waiting too and we hit the limit).

Connections to MySQL should be allowed as long as we don't hit the overall connection limit, no?

@shlomi-noach
Copy link
Contributor

Do you mean a new connection through the connection pool? Or a new connection against MySQL?

A new connection against MySQL. Tested on my local dev host using dbdeployer and a mysql80 replication setup. Steps:

  • install plugin on primary
  • configure semi sync on primary (enable, timeout, number of acknowledgements)
  • create table t1(id int) - hangs
  • Try to open new connection to the primary - hangs.

@shlomi-noach
Copy link
Contributor

Connections to MySQL should be allowed as long as we don't hit the overall connection limit, no?

Yes, I agree! And yet this is the behavior I'm seeing locally. It surprised me, too. I don't have recollection of such behavior.

Signed-off-by: Arthur Schreiber <[email protected]>
Signed-off-by: Arthur Schreiber <[email protected]>
@arthurschreiber arthurschreiber changed the title Add new force flag to DemotePrimary to force a failover during EmergencyReparentShard Add new force flag to DemotePrimary to force a demotion even when blocked on waiting for semi-sync acks Oct 22, 2025
Signed-off-by: Arthur Schreiber <[email protected]>
@arthurschreiber arthurschreiber marked this pull request as ready for review October 22, 2025 14:09
Signed-off-by: Tim Vaillancourt <[email protected]>
Signed-off-by: Tim Vaillancourt <[email protected]>
Signed-off-by: Tim Vaillancourt <[email protected]>
Signed-off-by: Tim Vaillancourt <[email protected]>
Signed-off-by: Tim Vaillancourt <[email protected]>
Signed-off-by: Tim Vaillancourt <[email protected]>
Signed-off-by: Tim Vaillancourt <[email protected]>
This reverts commit f2622c4.

Signed-off-by: Tim Vaillancourt <[email protected]>
@timvaillancourt timvaillancourt force-pushed the arthur/add-forced-demotion branch from 714a20a to a805c4a Compare November 17, 2025 13:25
@timvaillancourt timvaillancourt self-requested a review November 17, 2025 13:50
@timvaillancourt timvaillancourt linked an issue Nov 18, 2025 that may be closed by this pull request
var primaryStatus *replicationdatapb.PrimaryStatus

primaryStatus, err = tmc.DemotePrimary(groupCtx, tabletInfo.Tablet)
primaryStatus, err = tmc.DemotePrimary(groupCtx, tabletInfo.Tablet, durability.HasSemiSync() /* force */)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just noting that this impacts PRS as well as ERS.

}

func (mysqld *Mysqld) IsSemiSyncBlocked(ctx context.Context) (bool, error) {
conn, err := getPoolReconnect(ctx, mysqld.dbaPool)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's fair. I don't think we should use DBA everywhere here, but it does line up with the other uses. And using a single connection for all of the work might be a good idea here too. Not related though so we can investigate this more separately.

return conn.Conn.SemiSyncExtensionLoaded()
}

func (mysqld *Mysqld) IsSemiSyncBlocked(ctx context.Context) (bool, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like that we have another method to check this when the monitor exposes methods for this as well.

arthurschreiber and others added 5 commits November 19, 2025 17:19
Signed-off-by: Tim Vaillancourt <[email protected]>
Setting this to be based on the durability policy might actually end up being wrong, because that reflects what the durability policy should be _after_ the failover, but `DemotePrimary` needs to operate based on the _current_ state.

Signed-off-by: Arthur Schreiber <[email protected]>
…/add-forced-demotion

Signed-off-by: Arthur Schreiber <[email protected]>
Copy link
Member

@mattlord mattlord left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Just the one minor comment. Thank you for working on this @arthurschreiber and @timvaillancourt ! ❤️

var primaryStatus *replicationdatapb.PrimaryStatus

primaryStatus, err = tmc.DemotePrimary(groupCtx, tabletInfo.Tablet)
primaryStatus, err = tmc.DemotePrimary(groupCtx, tabletInfo.Tablet, true /* force */)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Today, stopReplicationAndBuildStatusMaps appears to be used only for ERS. The function does not appear to be specific to ERS though. So IMO it would be better to make this explicitly tied to ERS. Meaning:

  1. This function takes a new bool for forceDemotion
  2. It then passes that on to DemotePrimary

It's not required, but might potentially prevent future unintended behavior.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively, we could move this function from replication.go to emergency_reparenter.go, as that's the only place it's used from.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll do that in a follow up PR, just to make sure the diffs are easier to read (and so this doesn't require another review round on this PR).

…/add-forced-demotion

Signed-off-by: Arthur Schreiber <[email protected]>
@arthurschreiber arthurschreiber enabled auto-merge (squash) November 26, 2025 15:52
@arthurschreiber arthurschreiber merged commit 1f49de4 into main Nov 26, 2025
106 of 111 checks passed
@arthurschreiber arthurschreiber deleted the arthur/add-forced-demotion branch November 26, 2025 17:14
siddharth16396 pushed a commit to siddharth16396/postpone-complete that referenced this pull request Dec 3, 2025
… blocked on waiting for semi-sync acks (vitessio#18714)

Signed-off-by: Arthur Schreiber <[email protected]>
Signed-off-by: Tim Vaillancourt <[email protected]>
Co-authored-by: Tim Vaillancourt <[email protected]>
Signed-off-by: siddharth16396 <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Component: TabletManager Type: Enhancement Logical improvement (somewhere between a bug and feature)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Bug Report: DemotePrimary always fails during ERS

6 participants