-
Notifications
You must be signed in to change notification settings - Fork 69
Fixing Peak Sample Count Per Series #604
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixing Peak Sample Count Per Series #604
Conversation
Signed-off-by: Naman-B-Parlecha <[email protected]>
|
@yeya24 @MichaHoffmann PTAL! |
|
I think it is better to do #573 (comment). |
|
Sure refactoring! |
|
Thanks for the work on this! Before we finalize the implementation, I'd like to clarify what we mean by peak samples in the context of the Thanos engine. In the Prometheus engine, peak samples is defined as the highest number of samples held in memory at any moment during query execution, across all operators. This is used to enforce the --query.max-samples limit and prevent memory exhaustion. A few open questions for alignment:
Having a shared definition with Prometheus helps make this limit consistent and easier to reason about. If we’re diverging, it would be good to document how and why. Curious to hear thoughts. |
|
Hey @harry671003 thanks for the clarification, @yeya24 is more qualified to answer this i believe as he has worked closely with the engine |
|
I thought a little bit more about this: Let's define the Peak samples first. Peak samples for an operator is the total samples the operator processed in a single One way to implement this is to use the telemetry operator Wdyt @yeya24 @Naman-B-Parlecha ? |
great was looking into how prometheus is updating its peak |
|
Yes, what you're doing is fundamentally the same. This is just a cleaner way of doing it.
|
Thanks for the clarification 🙌 |
|
Use the same PR. |
f81684b to
0e0ae60
Compare
Signed-off-by: Naman-B-Parlecha <[email protected]>
Signed-off-by: Naman-B-Parlecha <[email protected]>
Signed-off-by: Naman-B-Parlecha <[email protected]>
0e0ae60 to
7d06931
Compare
harry671003
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work! Just some comments.
| | | | | |---[concurrent(buff=2)]: max_series: 1 total_samples: 0 peak_samples: 0 | ||
| | | | | | |---[matrixSelector] rate({[__name__="http_requests_total"]}[10m0s] 0 mod 2): max_series: 1 total_samples: 1010 peak_samples: 200 | ||
| | | | | |---[concurrent(buff=2)]: max_series: 1 total_samples: 0 peak_samples: 0 | ||
| | | | | | |---[matrixSelector] rate({[__name__="http_requests_total"]}[10m0s] 1 mod 2): max_series: 1 total_samples: 1010 peak_samples: 200 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the peak samples correct here?
For the selector {[__name__="http_requests_total"]}[10m0s]
Conditions:
- Steps batch is 10.
- Selectors shards = 2
Calculation:
-
Each Next() call has 10 steps.
-
Each step has 1 series (due to sharding)
-
For each step:
- Each series has a sample every 30 seconds.
- In [10m] range, there will be 10*60s/30s = 20 samples
-
Total samples in each step = 10 steps * 1 series * 20 samples = 200
Looks correct.
Signed-off-by: Naman-B-Parlecha <[email protected]>
625dbe1 to
eb0809d
Compare
Signed-off-by: Naman-B-Parlecha <[email protected]>
3096f45 to
ef60fd8
Compare
harry671003
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Great work. Thank you.
|
Thanks for the helping me out @harry671003 @yeya24 🙌 |
yeya24
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @Naman-B-Parlecha and @harry671003! I like how this is implemented now and it seems clean.
Let's fix lint. Unit tests also failed so maybe need to take another look
Signed-off-by: Naman-B-Parlecha <[email protected]>
Signed-off-by: Naman-B-Parlecha <[email protected]>
|
@yeya24 PTAL |
Signed-off-by: Naman-B-Parlecha <[email protected]>
Resolve: #573
Currently, the peak sample count is incorrectly reported as 1 because
currStepSamplesis reset to 0 at every step iteration, preventing proper accumulation of peak samples across series.To fix this we can maintain a vector to track sample count per step across all series and add up the sample at each step. Once all the all sample per step of all series are processed we can update the telemetry for all series