-
Notifications
You must be signed in to change notification settings - Fork 6.1k
stats: optimize build topn and histogram #63285
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
stats: optimize build topn and histogram #63285
Conversation
|
Hi @Tristan1900. Thanks for your PR. PRs from untrusted users cannot be marked as trusted with I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## master #63285 +/- ##
================================================
+ Coverage 72.7810% 73.4747% +0.6936%
================================================
Files 1835 1870 +35
Lines 496885 509348 +12463
================================================
+ Hits 361638 374242 +12604
+ Misses 113284 112317 -967
- Partials 21963 22789 +826
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
|
some preliminary result previous implementation After improvement Roughly 19x faster and 50% mem saved |
69b3885 to
175670c
Compare
|
run bench again after PR ready, confirmed the performance gain |
0xPoe
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great Work!
Thank you!
pkg/util/generic/bounded_min_heap.go
Outdated
| // is better than the worst item, it replaces the worst item. | ||
| func (h *BoundedMinHeap[T]) Add(item T) { | ||
| // handle zero capacity case | ||
| if h.maxSize <= 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we panic early if maxSize is less than or equal to 0?
pkg/statistics/builder.go
Outdated
|
|
||
| const ( | ||
| // defaultTopNValue is the default value for numTopN parameter | ||
| defaultTopNValue = 100 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for extracting them as constants. However, I believe we placed these magic numbers somewhere else. Could you please try to reuse them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// TopN reduced from 500 to 100 due to concerns over large number of TopN values collected for customers with many tables.
// 100 is more inline with other databases. 100-256 is also common for NumBuckets with other databases.
var analyzeOptionDefaultV2 = map[ast.AnalyzeOptionType]uint64{
ast.AnalyzeOptNumBuckets: 256,
ast.AnalyzeOptNumTopN: 100,
ast.AnalyzeOptCMSketchWidth: 2048,
ast.AnalyzeOptCMSketchDepth: 5,
ast.AnalyzeOptNumSamples: 0,
ast.AnalyzeOptSampleRate: math.Float64bits(-1),
}There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In a separate PR - we should make them variables such that customers can adjust them if needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Created an issue to track #63432
pkg/statistics/builder.go
Outdated
| for i := int64(1); i < sampleNum; i++ { | ||
| processedCount := int64(1) // we've processed the first sample | ||
|
|
||
| // Note: Start from firstSampleIdx+1 because we have already processed the first non-skipped sample. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| // Note: Start from firstSampleIdx+1 because we have already processed the first non-skipped sample. | |
| // Start from firstSampleIdx + 1 since the first non-skipped sample has already been processed when the range checker is not null. |
| rangeChecker *SequentialRangeChecker, // optional range checker for skipping TopN indices | ||
| ) (corrXYSum float64, err error) { | ||
| sampleNum := int64(len(samples)) | ||
| // As we use samples to build the histogram, the bucket number and repeat should multiply a factor. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess we don't need to delete this comment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems this comment hasn’t been restored.
pkg/statistics/builder.go
Outdated
| // If a numTopn value other than default is passed in, we assume it's a value that the user wants us to honor | ||
| allowPruning := true | ||
| if numTopN != 100 { | ||
| if numTopN != defaultTopNValue { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍🏼 Love this const, I hate magic numbers!
|
/retest |
|
@Tristan1900: Cannot trigger testing until a trusted user reviews the PR and leaves an In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
/retest |
|
@Tristan1900: Cannot trigger testing until a trusted user reviews the PR and leaves an In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
/hold one test fails for legit reason |
|
/retest actually a flaky test that happened before this PR |
|
@Tristan1900: Cannot trigger testing until a trusted user reviews the PR and leaves an In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
/ok-to-test |
Signed-off-by: Wenqi Mou <[email protected]>
|
/retest |
0xPoe
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! ![]()
/hold
For the comment issue. Feel free to unhold it after you address it.
| rangeChecker *SequentialRangeChecker, // optional range checker for skipping TopN indices | ||
| ) (corrXYSum float64, err error) { | ||
| sampleNum := int64(len(samples)) | ||
| // As we use samples to build the histogram, the bucket number and repeat should multiply a factor. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems this comment hasn’t been restored.
[LGTM Timeline notifier]Timeline:
|
Add comment to clarify sample factor calculation.
|
/unhold |
|
/retest |
1 similar comment
|
/retest |
pkg/util/generic/bounded_min_heap.go
Outdated
| sort.Slice(result, func(i, j int) bool { | ||
| return h.cmpFunc(result[i], result[j]) > 0 | ||
| }) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
slices.SortFuncis a newer replacement for it. In most cases, we should use theslicespackage instead of thesortpackage.- I think using
-h.cmpFunc(result[i], result[j])would be more explicit that the order here is reversed.
pkg/util/generic/bounded_min_heap.go
Outdated
| // Less compares two items in the heap. We use a min-heap for efficient bounded operations. | ||
| func (h *BoundedMinHeap[T]) Less(i, j int) bool { | ||
| return h.cmpFunc(h.items[i], h.items[j]) < 0 | ||
| } | ||
|
|
||
| // Swap swaps two items in the heap. | ||
| func (h *BoundedMinHeap[T]) Swap(i, j int) { | ||
| h.items[i], h.items[j] = h.items[j], h.items[i] | ||
| } | ||
|
|
||
| // Push adds an item to the heap. | ||
| func (h *BoundedMinHeap[T]) Push(x any) { | ||
| h.items = append(h.items, x.(T)) | ||
| } | ||
|
|
||
| // Pop removes and returns the smallest item from the heap. | ||
| func (h *BoundedMinHeap[T]) Pop() any { | ||
| old := h.items | ||
| n := len(old) | ||
| item := old[n-1] | ||
| h.items = old[0 : n-1] | ||
| return item | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need to reconsider the API design of this struct.
These methods are not expected to be used externally and should be unexported. I think the heap from the std lib should be an internal data type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes that's heap's interface we must comply. https://pkg.go.dev/container/heap
I think that's the standard lib heap, and I looked at other impl in the code base and they all follow the same pattern. Do you mind sharing which std lib you are referring to?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. I understand that you implemented BoundedMinHeap based on the container/heap from the std lib.
I mean the container/heap here should be an internal data type and should not be accessed externally. For example, you don't want other people to do something like NewBoundedMinHeap().Push(val).
What I have in mind is that the implementation should be something like:
type internalHeap[T any] struct {
....
}
func (h *internalHeap[T]) Len() int {...}
func (h *internalHeap[T]) Less(i, j int) bool {...}
func (h *internalHeap[T]) Swap(i, j int) {...}
func (h *internalHeap[T]) Push(x any) {...}
func (h *internalHeap[T]) Pop() any {...}
type BoundedMinHeap[T any] struct {
data *internalHeap[T]
...
}
func (h *BoundedMinHeap[T]) Add(item T) {...}
func (h *BoundedMinHeap[T]) ToSortedSlice() []T {...}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh I see the idea. Yeah that's a great suggestion. Let me do that
pkg/util/generic/bounded_min_heap.go
Outdated
| // BoundedMinHeap implements a min-heap for maintaining the best N items efficiently. | ||
| // It keeps the N items with the highest values according to the comparison function. | ||
| // The root of the heap is always the smallest item, making it easy to remove when adding better items. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need to unify the terminology: best/worst or high/low.
pkg/statistics/builder.go
Outdated
| if a.Count < b.Count { | ||
| return -1 // min-heap: smaller counts at root | ||
| } else if a.Count > b.Count { | ||
| return 1 | ||
| } | ||
| return 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cmp.Compare(a.Count, b.Count)
Signed-off-by: Wenqi Mou <[email protected]>
|
/retest |
|
/hold |
Signed-off-by: Wenqi Mou <[email protected]>
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: 0xPoe, terry1purcell, time-and-fate The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/unhold |
Signed-off-by: ti-chi-bot <[email protected]>
|
In response to a cherrypick label: new pull request created to branch |
Signed-off-by: ti-chi-bot <[email protected]>
|
In response to a cherrypick label: new pull request created to branch |
What problem does this PR solve?
Issue Number: close #63286
Problem Summary:
What changed and how does it work?
Check List
Tests
Side effects
Documentation
Release note
Please refer to Release Notes Language Style Guide to write a quality release note.