Observing high read latencies (300ms+) and "took too long" warnings despite good P99 fsync latency #20892
mensylisir
started this conversation in
General
Replies: 1 comment
-
|
Please note you are running a 5 year old etcd release. That limits the support you can get from community as most people will not remember what issues were reported and fixed. I recommend to upgrade and see if that helps. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Environment
Observed Behavior
We are operating a large-scale Kubernetes cluster and are observing persistent performance issues related to etcd. The primary symptom is a high volume of took too long warnings in the etcd logs for read-only range requests.
Phenomenon 1: High Read Latency in etcd Logs
The etcd leader's log shows thousands of warnings per hour for simple read requests, with latencies frequently exceeding 300ms.
A count of these warnings over a 60-minute period shows over 2,300 occurrences.
Phenomenon 2: Volatile Storage Performance in fio Benchmarks
To investigate the underlying storage, we ran fio to measure fsync latency. We observed highly variable performance between test runs.
FIO Test Run A:
This run shows a low P99 latency but a significantly higher P99.99 and max latency.
The observed IOPS on a single machine fluctuate, with test results ranging from 400+ to 700+."
Phenomenon 3: Standard Kubernetes API Workload
To understand the load on etcd, we analyzed the Kubernetes API audit logs. The analysis indicates that the write traffic is primarily generated by core Kubernetes system components performing routine operations, such as managing leases and endpoint slices.
Top Write Operations by User Agent:
Top Write Operations by Resource Type:
The workload appears to be consistent with the expected behavior of a large Kubernetes cluster, without any obvious application-level traffic storms.
Request for Discussion
We are trying to understand the relationship between these observed phenomena. Specifically, we are seeking insights from the community on the following points:
We would appreciate any thoughts or similar experiences the community could share to help us better understand this behavior. Thank you.
Beta Was this translation helpful? Give feedback.
All reactions