Skip to content

Bug Report: vtorc can cause primary lock up when keyspace durability policy is changed #18712

@arthurschreiber

Description

@arthurschreiber

Overview of the Issue

When the keyspace durability policy is changed from none to any other setting that requires semi-sync to be enabled, vtorc will notice that change and attempt to "fix" the semi sync settings on the primary by executing a UndoDemotePrimary call agains the primary vttablet.

This call will enable semi-sync on the primary, but potentially do so before any of the replicas have semi sync enabled, causing the primary to lock up waiting for acknowledgements. Even when semi sync gets enabled on the replicas, they can't ack any of the changes that are still pending on the primary, so the whole shard locks up.

I believe vtorc notices that the primary is locked up, and might attempt to run an ERS, but ERS does not handle semi-sync lockups well right now, and it seems that we can't get out of this broken state.

Reproduction Steps

N/A

Binary Version

N/A

Operating System and Environment details

N/A

Log Fragments

N/A

Metadata

Metadata

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions