Skip to content

Conversation

@elias-dbx
Copy link

A large number of client leases can cause cascading failures within the etcd cluster. Currently, when the keepalive stream has an error we will always wait 500ms and then try to recreate the stream with LeaseKeepAlive(). Since there is no backoff or jitter, if the lease streams originally broke due to overload on the servers the retries can cause a cascading failure and put more load on the servers.

We can backoff and jitter -- similar to what is done in watch streams -- in order to alleviate server load in the case where leases are causing the overload.

Related to: #20717

@k8s-ci-robot
Copy link

Hi @elias-dbx. Thanks for your PR.

I'm waiting for a etcd-io member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.


// retryConnWait is how long to wait before retrying request due to an error
retryConnWait = 500 * time.Millisecond
// retryConnMinBackoff is the starting backoff when retrying a request due to an error
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How were these values chosen?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They were chosen to be in line with the default exponential backoff parameters of other widely used client side libraries. For example, the aws-sdk-go-v2 library has a default max backoff of 20 seconds: https://github.com/aws/aws-sdk-go-v2/blob/main/aws/retry/standard.go#L31

I kept the initial backoff to 500ms as that is the current backoff time.

@ronaldngounou
Copy link
Member

/ok-to-test

@codecov
Copy link

codecov bot commented Oct 3, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 69.28%. Comparing base (8a4955b) to head (84e84de).
⚠️ Report is 4 commits behind head on main.

Additional details and impacted files
Files with missing lines Coverage Δ
client/v3/lease.go 91.06% <100.00%> (+0.12%) ⬆️
client/v3/utils.go 100.00% <100.00%> (ø)

... and 18 files with indirect coverage changes

@@            Coverage Diff             @@
##             main   #20718      +/-   ##
==========================================
+ Coverage   69.12%   69.28%   +0.16%     
==========================================
  Files         422      422              
  Lines       34826    34838      +12     
==========================================
+ Hits        24073    24138      +65     
+ Misses       9352     9307      -45     
+ Partials     1401     1393       -8     

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8a4955b...84e84de. Read the comment docs.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@k8s-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: elias-dbx, ronaldngounou
Once this PR has been reviewed and has the lgtm label, please assign ahrtr for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@elias-dbx
Copy link
Author

/retest-required

A large number of client leases can cause cascading failures within the
etcd cluster. Currently, when the keepalive stream has an error we will
always wait 500ms and then try to recreate the stream with
LeaseKeepAlive(). Since there is no backoff or jitter, if the lease
streams originally broke due to overload on the servers the retries can
cause a cascading failure and put more load on the servers.

We can backoff and jitter -- similar to what is done in watch streams --
in order to alleviate server load in the case where leases are causing
the overload.

Signed-off-by: Elias Carter <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Development

Successfully merging this pull request may close these issues.

3 participants