Skip to content

Bug Report/RFC: EmergencyReparentShard fails when mysqld is down on any tablet in a shard #18528

@timvaillancourt

Description

@timvaillancourt

Overview of the Issue

Problem

On at least v19 and probably versions ahead of that, EmergencyReparentShard (which relies on calling the StopReplicationAndGetStatus tabletmanager RPC to all tablets) fails when any tablet in a shard has MySQL down:

rpc error: code = Unknown desc = TabletManager.StopReplicationAndGetStatus on REDACTED-0171543854: before status failed: net.Dial(/mnt/vitess/mysql/datadir/mysql.sock) to local server failed: dial unix /mnt/vitess/mysql/datadir/mysql.sock: connect: connection refused (errno 2002) (sqlstate HY000)

The net.Dial error is vttablet attempting to connect to a downed MySQL server on the tablet. In our experience this scenario can happen when:

  1. MySQL/InnoDB has crashed/coredumped and cannot come back up
    • In our production we don't let MySQL restart intentionally if it crashes
    • There are also cases where MySQL can crash and cannot start back up
  2. MySQL is stopped manually for whatever reason

The ERS code today attempts to, for EVERY tablet (no matter what):

  1. Stop Replication and get GTID positions
  2. Wait for relay logs to apply
  3. Pick a most advanced candidate

This approach is being very careful to understand who has the most relaylog changes, so data loss and errants are not created. However, the failure in the StopReplicationAndGetStatus RPC for any tablet halts the ERS at Step number 1 above, which is quite dangerous for availability

Solution

Before I propose a solution here, I'll start with some known limitations: what I'm about to propose won't work for tablets with remote MySQL. It's much harder to be certain a network dial error means MySQL is "down"

Now, back to the error we receive: net.Dial(/mnt/vitess/mysql/datadir/mysql.sock) to local server failed: dial unix /mnt/vitess/mysql/datadir/mysql.sock: connect: connection refused (errno 2002) (sqlstate HY000). On a tablet with local MySQL this is a "pretty-good" signal that MySQL is down, but we could be even more certain by checking the PID (we know this via the pidfile) and perhaps other details

So let's imagine we have a strong signal MySQL is down on a tablet that still responds to tabletmanager RPCs. With the caveat that we can't always be 100% certain, if MySQL is down on the tablet, I feel we can infer (perhaps optionally):

  1. The MySQL on this tablet is no longer a semi-sync ack'er, because it's down
  2. The MySQL on this tablet cannot be the most advanced, because it's down (can't query it's positions)
    • There are really odd edge cases where this might not be true, but I'd argue it should cover most regular Vitess users

So TL;DR: let's give EmergencyReparentShard logic the context of what tablets have MySQL down or up, so we don't try to wait for them fruitlessly, failing the reparent. We'll assume a down mysqld means the tablet cannot be most-advanced

This could be implemented in a few ways:

  1. An in-vttablet "MySQL Monitor" + modify the RPC response for StopReplicationAndGetStatus to include this state
    • Today we get a nil response because the RPC errored, so we'd need to adjust that to still return a response, and perhaps include the error in the response
  2. The ERS code has a function that reliably matches mysqld-down errors, eg: IsMySQLDown(err error) bool
    • This could become a game of whack-a-mole, because we're matching on error strings (I believe)
    • This logic would need to match on dial errors (mysqld is down) and potentially ignore "timeout" errors, because this doesn't mean mysqld is really down

This definitely relies on VTOrc existing in a cluster, to fix tablets with the wrong primary if anything fails here

Your thoughts are appreciated!

Reproduction Steps

  1. Setup a shard with many tablets
  2. kill -9 $(pidof mysqld) on one tablet, but ensure the host/pod + vttablet remains alive
  3. Run an EmergencyReparentShard on the given shard
  4. Notice the ERS fails on the mysql-down error for a single tablet

Binary Version

v19, probably versions above v19

Operating System and Environment details

Linux

Log Fragments

rpc error: code = Unknown desc = TabletManager.StopReplicationAndGetStatus on REDACTED-0171543854: before status failed: net.Dial(/mnt/vitess/mysql/datadir/mysql.sock) to local server failed: dial unix /mnt/vitess/mysql/datadir/mysql.sock: connect: connection refused (errno 2002) (sqlstate HY000)

Metadata

Metadata

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions