-
Notifications
You must be signed in to change notification settings - Fork 2.3k
Description
Overview of the Issue
Problem
On at least v19 and probably versions ahead of that, EmergencyReparentShard (which relies on calling the StopReplicationAndGetStatus tabletmanager RPC to all tablets) fails when any tablet in a shard has MySQL down:
rpc error: code = Unknown desc = TabletManager.StopReplicationAndGetStatus on REDACTED-0171543854: before status failed: net.Dial(/mnt/vitess/mysql/datadir/mysql.sock) to local server failed: dial unix /mnt/vitess/mysql/datadir/mysql.sock: connect: connection refused (errno 2002) (sqlstate HY000)The net.Dial error is vttablet attempting to connect to a downed MySQL server on the tablet. In our experience this scenario can happen when:
- MySQL/InnoDB has crashed/coredumped and cannot come back up
- In our production we don't let MySQL restart intentionally if it crashes
- There are also cases where MySQL can crash and cannot start back up
- MySQL is stopped manually for whatever reason
The ERS code today attempts to, for EVERY tablet (no matter what):
- Stop Replication and get GTID positions
- Wait for relay logs to apply
- Pick a most advanced candidate
This approach is being very careful to understand who has the most relaylog changes, so data loss and errants are not created. However, the failure in the StopReplicationAndGetStatus RPC for any tablet halts the ERS at Step number 1 above, which is quite dangerous for availability
Solution
Before I propose a solution here, I'll start with some known limitations: what I'm about to propose won't work for tablets with remote MySQL. It's much harder to be certain a network dial error means MySQL is "down"
Now, back to the error we receive: net.Dial(/mnt/vitess/mysql/datadir/mysql.sock) to local server failed: dial unix /mnt/vitess/mysql/datadir/mysql.sock: connect: connection refused (errno 2002) (sqlstate HY000). On a tablet with local MySQL this is a "pretty-good" signal that MySQL is down, but we could be even more certain by checking the PID (we know this via the pidfile) and perhaps other details
So let's imagine we have a strong signal MySQL is down on a tablet that still responds to tabletmanager RPCs. With the caveat that we can't always be 100% certain, if MySQL is down on the tablet, I feel we can infer (perhaps optionally):
- The MySQL on this tablet is no longer a semi-sync ack'er, because it's down
- The MySQL on this tablet cannot be the most advanced, because it's down (can't query it's positions)
- There are really odd edge cases where this might not be true, but I'd argue it should cover most regular Vitess users
So TL;DR: let's give EmergencyReparentShard logic the context of what tablets have MySQL down or up, so we don't try to wait for them fruitlessly, failing the reparent. We'll assume a down mysqld means the tablet cannot be most-advanced
This could be implemented in a few ways:
- An in-
vttablet"MySQL Monitor" + modify the RPC response forStopReplicationAndGetStatusto include this state- Today we get a
nilresponse because the RPC errored, so we'd need to adjust that to still return a response, and perhaps include the error in the response
- Today we get a
- The ERS code has a function that reliably matches
mysqld-down errors, eg:IsMySQLDown(err error) bool- This could become a game of whack-a-mole, because we're matching on error strings (I believe)
- This logic would need to match on dial errors (
mysqldis down) and potentially ignore "timeout" errors, because this doesn't meanmysqldis really down
This definitely relies on VTOrc existing in a cluster, to fix tablets with the wrong primary if anything fails here
Your thoughts are appreciated!
Reproduction Steps
- Setup a shard with many tablets
kill -9 $(pidof mysqld)on one tablet, but ensure the host/pod +vttabletremains alive- Run an
EmergencyReparentShardon the given shard - Notice the ERS fails on the mysql-down error for a single tablet
Binary Version
v19, probably versions above v19Operating System and Environment details
LinuxLog Fragments
rpc error: code = Unknown desc = TabletManager.StopReplicationAndGetStatus on REDACTED-0171543854: before status failed: net.Dial(/mnt/vitess/mysql/datadir/mysql.sock) to local server failed: dial unix /mnt/vitess/mysql/datadir/mysql.sock: connect: connection refused (errno 2002) (sqlstate HY000)