Skip to content

Commit 8ff7a8e

Browse files
authored
propose online unsafe recovery (#91)
Signed-off-by: Connor1996 <[email protected]>
1 parent 2334fd4 commit 8ff7a8e

File tree

1 file changed

+237
-0
lines changed

1 file changed

+237
-0
lines changed
Lines changed: 237 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,237 @@
1+
# Online Unsafe Recovery
2+
3+
- Tracking Issue: https://github.com/tikv/tikv/issues/10483
4+
5+
## Summary
6+
7+
Make TiKV cluster be capable of recovering from Raft groups from majority failure online
8+
9+
## Motivation
10+
11+
Currently, we have a tool to recover Raft groups from majority failure:
12+
```
13+
$ tikv-ctl unsafe-recover remove-fail-stores -s <store_id, store_id> …
14+
$ tikv-ctl recreate-region -r <region_id> ...
15+
```
16+
17+
The first command is used to recover Raft groups of which the majority but not all replicas failed. Its mechanism is removing failed stores from membership configurations of those Raft groups.
18+
19+
The second command is used to recover Raft groups of which all replicas failed. Before running the command we need to find those Raft groups, which is not a documented routine. So there are no users who can do this recovery without our DBAs’ help.
20+
21+
Both of the two commands should be run on stopped TiKV instances, which makes the recovery process so long and complex that not all users can perform all steps correctly.
22+
23+
And there are some corner cases that could lead to inconsistent membership configuration among all replicas. If the Region splits later, this mistake will lead to inconsistent Region boundaries, which is very hard to locate and fix.
24+
25+
So we need to re-design and re-implement a new feature to recover a TiKV cluster from majority failure. The feature should be:
26+
27+
- able to be executed online
28+
- able to avoid all inconsistencies
29+
- a feature instead of a tool, focus on Raft majority failure, shouldn’t be used for other situations
30+
- easy to use, at most 1 or 2 steps or command
31+
32+
## Spec
33+
34+
We must know there are no silver bullets for Raft majority failure. There must be something lost after a TiKV cluster recovers from that. Considering such integrations or consistencies:
35+
36+
- Committed writes can be lost, so linear consistency of Raft can be broken
37+
- Some prewrites can be lost for a given transaction, so Transaction Atomicity and/or Consistency can be broken
38+
- Commits can be lost for a given transaction, so Transaction Consistency and/or Durability can be broken
39+
- For a given table, both records or indices can be lost, so Data Consistency can be broken
40+
- For a given transaction, it can be partially committed, so Data Integrity can be broken
41+
42+
## Interface
43+
44+
The entrance of this feature is on PD side, triggered by pd-ctl. There are two pd-ctl commands for it:
45+
```
46+
$ pd-ctl unsafe remove-failed-stores <store-id1, store-id2, store-id3...> --timeout=300
47+
$ pd-ctl unsafe remove-failed-stores show
48+
```
49+
50+
The first command registers a task into PD. It's possible to call it again before the previous one gets finished, in which case PD just rejects the later tasks. `--timeout` specifies the maximum execution time for the recovery. If it doesn't finish the work within timeout, it fails directly.
51+
The second command shows the progress the recovery, including all the operations have done and current operations being executing for the recovery.
52+
53+
## Detailed design
54+
55+
The changes fall into both PD and TiKV. A recovery controller is introduced in PD to drive the recovery process. Quite like other PD schedulers, it collects region information first, and dispatches corresponding operations to TiKV. Whereas regions losing leader won't trigger region heartbeat, the peers' information are collected by store heartbeat requests and the recovery operations are dispatched by the store heartbeat responses.
56+
57+
Moreover, given there is no leader for the regions losing majority peers, how can the rest of peers in one region execute the recovery operations consistently? To solve that, we introduce force leader. It will be elaborated in later section. Now all you need to know is that it forces Raft to make one peer to be leader even without the quorum. With force leader, the Raft group can continue drive the proposal to be committed and applied. So we can propose conf change to demote the failed voter to learner. After the conf change is applied, the quorum can be formed. Then let the force leader step down and it will elect a new leader by Raft normal process. So far, the recovery for one region is finished which is totally online.
58+
59+
Here is the overall process:
60+
61+
1. Pd-ctl registers a recovery task in recovery controller in PD
62+
2. Recovery controller asks all the TiKVs by store heartbeat responses to send peers' information
63+
3. Each TiKV collects all the peers on itself and sends the information called store report by store heartbeat response
64+
4. PD receives the store report, and generates force leader recovery plan once receives all the store reports
65+
5. PD dispatch the force leader operations to TiKV by store heartbeat response
66+
6. TiKV executes the force leader operations, and sends the latest store report
67+
7. PD receives the store report, and generates demote failed voters recovery plan once receives all the store reports
68+
8. TiKV executes the demote failed voters recovery operations, and sends the latest store report
69+
9. PD receives the store report, and create empty region recovery plan once receives all the store reports
70+
10. TiKV executes the create empty region recovery operations, and sends the latest store report
71+
11. PD receives the store report, and finish the recovery task
72+
73+
### Protocol
74+
75+
Add store report in `StoreHeartbeatRequest`
76+
77+
```protobuf
78+
message StoreHeartbeatRequest {
79+
...
80+
81+
// Detailed store report that is only filled up on PD's demand for online unsafe recovery.
82+
StoreReport store_report = 3;
83+
}
84+
85+
message PeerReport {
86+
raft_serverpb.RaftLocalState raft_state = 1;
87+
raft_serverpb.RegionLocalState region_state = 2;
88+
bool is_force_leader = 3;
89+
// The peer has proposed but uncommitted commit merge.
90+
bool has_commit_merge = 4;
91+
}
92+
93+
message StoreReport {
94+
repeated PeerReport peer_reports = 1;
95+
uint64 step = 2;
96+
}
97+
```
98+
99+
Add recovery plan in `StoreHeartbeatResponse`
100+
101+
```protobuf
102+
message StoreHeartbeatResponse {
103+
...
104+
105+
// Operations of recovery. After the plan is executed, TiKV should attach the
106+
// store report in store heartbeat.
107+
RecoveryPlan recovery_plan = 5;
108+
}
109+
110+
message DemoteFailedVoters {
111+
uint64 region_id = 1;
112+
repeated metapb.Peer failed_voters = 2;
113+
}
114+
115+
message ForceLeader {
116+
// The store ids of the failed stores, TiKV uses it to decide if a peer is alive.
117+
repeated uint64 failed_stores = 1;
118+
// The region ids of the peer which is to be force leader.
119+
repeated uint64 enter_force_leaders = 2;
120+
}
121+
122+
message RecoveryPlan {
123+
// Create empty regions to fill the key range hole.
124+
repeated metapb.Region creates = 1;
125+
// Update the meta of the regions, including peer lists, epoch and key range.
126+
repeated metapb.Region updates = 2 [deprecated=true];
127+
// Tombstone the peers on the store locally.
128+
repeated uint64 tombstones = 3;
129+
// Issue conf change that demote voters on failed stores to learners on the regions.
130+
repeated DemoteFailedVoters demotes = 4;
131+
// Make the peers to be force leaders.
132+
ForceLeader force_leader = 5;
133+
// Step is an increasing number to note the round of recovery,
134+
// It should be filled in the corresponding store report.
135+
uint64 step = 6;
136+
}
137+
```
138+
139+
### PD Side
140+
141+
PD recovery controller is organized in state machine manner, which has multiple stages. And the recovery plan is generated based on the current stage. Due to unfinished split and merge, or retry caused by network or TiKV crash, the stage may change back and forth multiple times. So the stages transition would be like this:
142+
143+
```
144+
+-----------+
145+
| |
146+
| idle |
147+
| |
148+
+-----------+
149+
|
150+
|
151+
|
152+
v +-----------+
153+
+-----------+ | | +-----------+ +-----------+
154+
| |----->| force |--------->| | | |
155+
| collect | | LeaderFor | | force | | failed |
156+
| Report | |CommitMerge| +-----| Leader |-----+---->| |
157+
| | | | | | | | +-----------+
158+
+-----------+ +-----------+ | +-----------+ |
159+
| | | ^ |
160+
| | | | |
161+
| | | | |
162+
| | v | |
163+
| | +-----------+ |
164+
| | | | |
165+
| | | demote | |
166+
| +-----| Voter |-----+
167+
| | | | |
168+
| | +-----------+ |
169+
| | | ^ |
170+
| | | | |
171+
| | v | |
172+
| | +-----------+ |
173+
+-----------+ | | | | |
174+
| | | | | create | |
175+
| finished | | | | Region |-----+
176+
| |<----------+-----------+-----| |
177+
+-----------+ +-----------+
178+
```
179+
180+
Stages are:
181+
- idle: initiated stage, meaning no recovery task to do
182+
- forceLeader: force leaders on the unhealthy regions whose the majority of peers are lost
183+
- forceLeaderForCommitMerge: same as forceLeader stage, but it only force leaders on the unhealthy regions which has unfinished commit merge, to make sure the target region is forced leader ahead of the source region so that target region can catch up log for source region successfully.
184+
- demoteVoter: propose conf change to demote the failed voters to learners for the regions having force leader
185+
- createRegion: recreate empty regions to fill the range hole for the case that all peers of one region are lost.
186+
- failed: recovery task aborts with an error
187+
- finished: recovery task has been executed successfully
188+
189+
The recovery process starts from the collectReport stage. Apart from idle, finished and failed stages, each stage collects store reports first and then try to generate plan for different operation one by one in order that force leader comes first, then demote voter, and last create region. If there is a plan, it transfers into corresponding stage and dispatches the plan to TiKV. If there isn't any plan to do, the recovery process transits into failed or finished stage.
190+
191+
#### Pause conf-change, split and merge
192+
193+
Note that the healthy regions are still providing service, so there may be some splits, merges and conf-changes. These make the region range and epoch changes from time to time. As the store reports are collected from different stores at a different time point, they don't form a global snapshot view of any time point. To solve that, just pause all the scheduler and checker in PD and reject `AskBatchSplitRequest` to pause region split as well in process of recovery.
194+
195+
After the stage is turned into finished or failed, all schedulers and checker resume and don't reject `AskBatchSplitRequest` anymore.
196+
197+
### TiKV Side
198+
199+
When TiKV receives recovery plan from PD, it dispatches related operations to corresponding peer fsm and broadcasts message to all peers to send report after having applied to the latest index.
200+
201+
Here are the operations:
202+
- WaitApply: to get the latest peer information, need wait until the apply index equals to commit index as least.
203+
- FillOutReport: get the region local state and raft local state of the peer
204+
- Demote: Propose the conf-change that demotes the failed voters to learners, then the region is able to elect leader normally. If the region is already in joint consensus state, exit joint state first. After demotion, exit force leader state.
205+
- Create: create a peer of new region on the store locally
206+
- Tombstone: tombstone the local peer on store
207+
208+
There are three possible work flows:
209+
- report phase(recovery plan is empty)
210+
1. WaitApply
211+
2. FillOutReport
212+
- force leader phase(recovery plan only has force leader)
213+
1. EnterForceLeader
214+
2. WaitApply
215+
3. FillOutReport
216+
- plan execution phase(recovery plan has demotes/creates/tombstones)
217+
1. Demotes/Creates/Tombstones
218+
2. ExitForceLeader
219+
3. WaitApply
220+
4. FillOutReport
221+
222+
As you can see, no matter what, it always does the recovery operations(if any) and wait apply then fill out and send the store report. Here are more details about the force leader.
223+
224+
#### Force leader
225+
226+
The peer to be force leader is chosen by PD recovery controller following the same way as Raft algorithm does that the one has largest last_term or last_index.
227+
228+
Force leader is a memory state presented on the leader peer, it won't be replicated through Raft to other peers. In force leader state, it rejects write and read requests, and only accepts conf-change admin commands.
229+
230+
The process of force leader is:
231+
1. Wait some ticks until election timeout is triggered, to make sure the origin leader lease is expired
232+
2. Pre force leader check that request vote to rest alive voters, and expect all received are grant votes.
233+
3. Be in force leader state, and forcibly forward commit index when logs are replicated to all the rest alive voters.
234+
235+
Note:
236+
- The peer having latest up-to-date log may be a learner, learner could be a force leader.
237+
- After being force leader, the commit index is advanced and some admin commands, e.g. split and merge, may be applied after that. For the newly split peer, it won't inherit the force leader state. If the newly split region can't select neither, it will be dispatched force leader operation in next stage of recovery process.

0 commit comments

Comments
 (0)