RFC: Enhance Slow Store Scheduler to resolve network jitter problem #119

okJiang · 2025-05-27T09:22:21Z

ref tikv/pd#9359

Signed-off-by: okJiang <[email protected]>

aravndr · 2025-06-05T14:21:04Z

text/0119-enhance-slow-score-store-scheduler.md

+
+Slow store scheduler is an existed scheduler in PD. Its current function is to detect whether the TiKV disk fails and quickly schedule the leaders away. Here is its steps:
+
+1. Each TiKV collects the timeout I/O requests locally (during each inspect-interval)


Does TIKV collect the timeout I/O requests for all other TIKV peers?

In the first delivery version, it will not be collected, but in the second delivery version, it will be collected.

aravndr · 2025-06-05T14:42:39Z

text/0119-enhance-slow-score-store-scheduler.md

+#### TiKV <-> TiKV health check
+
+When a tikv has a network problem, other tikvs will also experience request timeouts when accessing the problematic tikv. In this case, the slow score of the normal tikv may also increase, and if it lasts for a long time, it may even reach 100, triggering the slow store mechanism.
+To avoid this problem, we can send the collected health information and its corresponding store_id to PD. If the `TimeoutRatio` of a node exceeds RatioMaxThresh, it is considered that the network of the node has a problem, and PD will filter out the request timeouts from normal nodes and then calculate the `TimeoutRatio`.


Could you explain how PD will filter the slow score elevation on the normal nodes?

In the tikv <-> tikv health check mechanism, we calculate the score for each health check object in each node and feed all the scores back to PD. PD knows which two nodes the score comes from. Based on the distribution of all the scores, PD is able to distinguish which node has a problem.

update

0097ca1

Signed-off-by: okJiang <[email protected]>

ti-chi-bot bot added the dco-signoff: yes label May 27, 2025

okJiang added 6 commits May 27, 2025 17:22

update

3a46f53

Signed-off-by: okJiang <[email protected]>

update

5e65434

Signed-off-by: okJiang <[email protected]>

update

29a0db5

Signed-off-by: okJiang <[email protected]>

update typo

198e9fb

Signed-off-by: okJiang <[email protected]>

update

2990916

Signed-off-by: okJiang <[email protected]>

update

f218e89

Signed-off-by: okJiang <[email protected]>

aravndr reviewed Jun 5, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

RFC: Enhance Slow Store Scheduler to resolve network jitter problem #119

RFC: Enhance Slow Store Scheduler to resolve network jitter problem #119

Uh oh!

okJiang commented May 27, 2025

Uh oh!

aravndr Jun 5, 2025

Uh oh!

okJiang Jun 6, 2025

Uh oh!

aravndr Jun 5, 2025

Uh oh!

okJiang Jun 6, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants


		Slow store scheduler is an existed scheduler in PD. Its current function is to detect whether the TiKV disk fails and quickly schedule the leaders away. Here is its steps:

		1. Each TiKV collects the timeout I/O requests locally (during each inspect-interval)

RFC: Enhance Slow Store Scheduler to resolve network jitter problem #119

Are you sure you want to change the base?

RFC: Enhance Slow Store Scheduler to resolve network jitter problem #119

Uh oh!

Conversation

okJiang commented May 27, 2025

Uh oh!

aravndr Jun 5, 2025

Choose a reason for hiding this comment

Uh oh!

okJiang Jun 6, 2025

Choose a reason for hiding this comment

Uh oh!

aravndr Jun 5, 2025

Choose a reason for hiding this comment

Uh oh!

okJiang Jun 6, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants