Skip to content

Conversation

@okJiang
Copy link
Member

@okJiang okJiang commented May 27, 2025

Signed-off-by: okJiang <[email protected]>
okJiang added 6 commits May 27, 2025 17:22
Signed-off-by: okJiang <[email protected]>
Signed-off-by: okJiang <[email protected]>
Signed-off-by: okJiang <[email protected]>
Signed-off-by: okJiang <[email protected]>
Signed-off-by: okJiang <[email protected]>
Signed-off-by: okJiang <[email protected]>

Slow store scheduler is an existed scheduler in PD. Its current function is to detect whether the TiKV disk fails and quickly schedule the leaders away. Here is its steps:

1. Each TiKV collects the timeout I/O requests locally (during each inspect-interval)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does TIKV collect the timeout I/O requests for all other TIKV peers?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the first delivery version, it will not be collected, but in the second delivery version, it will be collected.

#### TiKV <-> TiKV health check

When a tikv has a network problem, other tikvs will also experience request timeouts when accessing the problematic tikv. In this case, the slow score of the normal tikv may also increase, and if it lasts for a long time, it may even reach 100, triggering the slow store mechanism.
To avoid this problem, we can send the collected health information and its corresponding store_id to PD. If the `TimeoutRatio` of a node exceeds RatioMaxThresh, it is considered that the network of the node has a problem, and PD will filter out the request timeouts from normal nodes and then calculate the `TimeoutRatio`.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you explain how PD will filter the slow score elevation on the normal nodes?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the tikv <-> tikv health check mechanism, we calculate the score for each health check object in each node and feed all the scores back to PD. PD knows which two nodes the score comes from. Based on the distribution of all the scores, PD is able to distinguish which node has a problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants