-
Notifications
You must be signed in to change notification settings - Fork 69
RFC: Enhance Slow Store Scheduler to resolve network jitter problem #119
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Signed-off-by: okJiang <[email protected]>
Signed-off-by: okJiang <[email protected]>
Signed-off-by: okJiang <[email protected]>
Signed-off-by: okJiang <[email protected]>
Signed-off-by: okJiang <[email protected]>
Signed-off-by: okJiang <[email protected]>
Signed-off-by: okJiang <[email protected]>
|
|
||
| Slow store scheduler is an existed scheduler in PD. Its current function is to detect whether the TiKV disk fails and quickly schedule the leaders away. Here is its steps: | ||
|
|
||
| 1. Each TiKV collects the timeout I/O requests locally (during each inspect-interval) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does TIKV collect the timeout I/O requests for all other TIKV peers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the first delivery version, it will not be collected, but in the second delivery version, it will be collected.
| #### TiKV <-> TiKV health check | ||
|
|
||
| When a tikv has a network problem, other tikvs will also experience request timeouts when accessing the problematic tikv. In this case, the slow score of the normal tikv may also increase, and if it lasts for a long time, it may even reach 100, triggering the slow store mechanism. | ||
| To avoid this problem, we can send the collected health information and its corresponding store_id to PD. If the `TimeoutRatio` of a node exceeds RatioMaxThresh, it is considered that the network of the node has a problem, and PD will filter out the request timeouts from normal nodes and then calculate the `TimeoutRatio`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you explain how PD will filter the slow score elevation on the normal nodes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the tikv <-> tikv health check mechanism, we calculate the score for each health check object in each node and feed all the scores back to PD. PD knows which two nodes the score comes from. Based on the distribution of all the scores, PD is able to distinguish which node has a problem.
ref tikv/pd#9359