HYPERFLEET-556 - doc - Sentinel resource profiling result and recommendation#86
HYPERFLEET-556 - doc - Sentinel resource profiling result and recommendation#86pnguyen44 wants to merge 6 commits intoopenshift-hyperfleet:mainfrom
Conversation
result and recommendation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
WalkthroughA new docs/resource-profiling.md page titled "Sentinel Profiling Results" was added. It describes the profiling goal (validate configured CPU/memory requests & limits), experiment setup (5s poll, 15-minute scenarios, 60s Estimated code review effort🎯 1 (Trivial) | ⏱️ ~3 minutes 🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@docs/resource-profiling.md`:
- Around line 12-13: Update the profiling guidance that currently lists
"Duration" and "Sample interval" so sampling is denser and runs longer: change
the Sample interval from 60s to a finer resolution (e.g., 5–10s) and increase
Duration from 5 min to a longer window (e.g., 15–30 min or multiple runs) to
capture transients; add a requirement to record percentile metrics (p95, p99) in
addition to average/peak and document how to aggregate multiple runs; apply
these changes to the sections that mention "Duration" and "Sample interval" and
to the sizing guidance that references averages/peaks so recommendations use
percentiles and longer, higher-resolution sampling.
- Around line 33-34: Update the documentation to add an explicit mitigation for
poll-interval overrun noted in the "At 5000 clusters with a 5s poll interval"
sentence: add a recommendation entry in the existing recommendation table (and
mirror it where lines 39-47 discuss scaling) that suggests concrete operational
controls such as increasing the poll interval, sharding workloads across
multiple Sentinel instances, and introducing a horizontal-scaling trigger (e.g.,
queue length or CPU threshold) to prevent continuous back-to-back runs; ensure
the new table row references the "poll interval" condition and gives brief
actionable steps and thresholds for large-scale deployments.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 526c47a0-716a-43d0-8346-dec85b183ce5
📒 Files selected for processing (1)
docs/resource-profiling.md
with doc standards
with 15-min data
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
docs/resource-profiling.md (1)
41-45: Recommendations need concrete per-tier sizing values.This section stays generic (“increase resource limits”) even though the doc presents scale-specific profiling. Add explicit request/limit recommendations per tier so operators can apply sizing directly and consistently.
Suggested doc enhancement
## Recommendations -Current defaults (CPU 100m/500m, memory 128Mi/512Mi) support up to ~1000 clusters per Sentinel instance. Beyond 1000 clusters, operators should increase resource limits or shard across multiple instances using `resource_selector`. The results table above can be used to guide sizing. +Current defaults (CPU 100m/500m, memory 128Mi/512Mi) support up to ~1000 clusters per Sentinel instance. Beyond 1000 clusters, operators should increase resource limits or shard across multiple instances using `resource_selector`. + +Suggested starting tiers from this profiling run: + +| Tier | Cluster count | CPU request / limit | Memory request / limit | +| ---- | ------------- | ------------------- | ---------------------- | +| Small | up to 100 | 100m / 500m | 128Mi / 512Mi | +| Medium | up to 1000 | 100m / 500m | 128Mi / 512Mi | +| Large | up to 5000 | 150m / 500m | 192Mi / 512Mi | + +Re-profile after changing poll interval, selectors, or workload shape. A Vertical Pod Autoscaler (VPA) could automatically adjust requests to match actual usage, removing the need to manually tune. Most useful if cluster count drifts over time.As per coding guidelines, "**`: -Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/resource-profiling.md` around lines 41 - 45, Update the "Recommendations" section to replace the generic guidance with concrete per-tier CPU and memory request/limit values tied to cluster-count tiers; specifically, add a small/medium/large/xlarge table or list that maps cluster ranges (e.g., 0–250, 251–500, 501–1000, >1000) to explicit requests/limits (starting from the current defaults CPU 100m/500m, memory 128Mi/512Mi and increasing appropriately per tier), include actionable guidance for sharding with resourceSelector and when to enable VPA, and ensure the wording under the "Recommendations" heading clearly states the numeric values so operators can apply them directly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@docs/resource-profiling.md`:
- Line 43: The docs reference the wrong config key name for sharding
guidance—replace all occurrences of the camelCase key "resourceSelector" with
the actual YAML key "resource_selector" (including the sentence that currently
reads "shard across multiple instances using `resourceSelector`" and any
examples/tables that show the key) so operators follow the correct config;
verify consistency across the document and update any related example snippets,
headings, or cross-references that mention the camelCase variant.
---
Nitpick comments:
In `@docs/resource-profiling.md`:
- Around line 41-45: Update the "Recommendations" section to replace the generic
guidance with concrete per-tier CPU and memory request/limit values tied to
cluster-count tiers; specifically, add a small/medium/large/xlarge table or list
that maps cluster ranges (e.g., 0–250, 251–500, 501–1000, >1000) to explicit
requests/limits (starting from the current defaults CPU 100m/500m, memory
128Mi/512Mi and increasing appropriately per tier), include actionable guidance
for sharding with resourceSelector and when to enable VPA, and ensure the
wording under the "Recommendations" heading clearly states the numeric values so
operators can apply them directly.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 601531c3-478f-4c26-ab6d-ed158b221a22
📒 Files selected for processing (1)
docs/resource-profiling.md
17bf6e4 to
aacce8c
Compare
There was a problem hiding this comment.
♻️ Duplicate comments (1)
docs/resource-profiling.md (1)
39-43:⚠️ Potential issue | 🟡 MinorAdd poll interval adjustment as an explicit mitigation option.
Line 39 identifies that Sentinel is saturated at 5000 clusters with the 5s poll interval, but the recommendation in line 43 doesn't mention increasing the poll interval as a mitigation option. Including this would give operators a simpler alternative to sharding or resource scaling.
📝 Suggested addition
-Current defaults (CPU 100m/500m, memory 128Mi/512Mi) support up to ~1000 clusters per Sentinel instance. Beyond 1000 clusters, operators should increase resource requests and limits or shard across multiple instances using `resourceSelector`. The results table above can be used to guide sizing. +Current defaults (CPU 100m/500m, memory 128Mi/512Mi) support up to ~1000 clusters per Sentinel instance. Beyond 1000 clusters, operators should increase resource requests and limits, increase the poll interval (e.g., 10s) to allow idle time between cycles, or shard across multiple instances using `resourceSelector`. The results table above can be used to guide sizing.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/resource-profiling.md` around lines 39 - 43, The recommendations section for Sentinel capacity omits increasing the poll interval as a mitigation; update the text in the "Recommendations" paragraph that currently references defaults (CPU 100m/500m, memory 128Mi/512Mi) and `resourceSelector` to explicitly include increasing the poll interval (e.g., from 5s to a higher value) as an alternative to sharding or raising resource requests/limits, and mention the tradeoff that longer intervals reduce CPU load but increase detection latency so operators can choose the correct balance.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@docs/resource-profiling.md`:
- Around line 39-43: The recommendations section for Sentinel capacity omits
increasing the poll interval as a mitigation; update the text in the
"Recommendations" paragraph that currently references defaults (CPU 100m/500m,
memory 128Mi/512Mi) and `resourceSelector` to explicitly include increasing the
poll interval (e.g., from 5s to a higher value) as an alternative to sharding or
raising resource requests/limits, and mention the tradeoff that longer intervals
reduce CPU load but increase detection latency so operators can choose the
correct balance.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 7ea245ab-61ef-4183-bdb9-b3e13ce41916
📒 Files selected for processing (1)
docs/resource-profiling.md
at 1000 clusters, add multi-instance deployment link
There was a problem hiding this comment.
🧹 Nitpick comments (1)
docs/resource-profiling.md (1)
41-45: Add explicit poll-interval option for cycle-time saturation scenario.Line 39 identifies that at 5000 clusters, "cycle time exceeds the interval, so Sentinel runs back-to-back with no idle time." However, the recommendations on line 43 mention "increase resource requests and limits or shard" but omit the third mitigation option: adjusting the poll interval.
For the 5000-cluster saturation case, increasing resource limits won't resolve the cycle-time bottleneck (CPU is already well under the 500m limit at peak 172m). The viable options are:
- Increase poll interval (e.g., from 5s to 10s) to allow cycles to complete with idle time
- Shard workload across multiple instances using
resourceSelectorThis guidance was specifically requested in previous reviews. Consider adding the poll-interval option explicitly.
📝 Suggested documentation enhancement
-Current defaults (CPU 100m/500m, memory 128Mi/512Mi) support up to ~1000 clusters per Sentinel instance. At this scale the CPU request (100m) is tight with peaks exceeding it, but the 500m limit provides sufficient burst headroom. Beyond 1000 clusters, operators should increase resource requests and limits or shard across multiple instances using `resourceSelector`. The results table above can be used to guide sizing. +Current defaults (CPU 100m/500m, memory 128Mi/512Mi) support up to ~1000 clusters per Sentinel instance. At this scale the CPU request (100m) is tight with peaks exceeding it, but the 500m limit provides sufficient burst headroom. Beyond 1000 clusters, operators should increase resource requests and limits or shard across multiple instances using `resourceSelector`. At 5000 clusters, cycle-time saturation occurs; increasing the poll interval or sharding are the primary mitigations (resource increases alone won't resolve the cycle-time bottleneck). The results table above can be used to guide sizing.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/resource-profiling.md` around lines 41 - 45, The Recommendations section omits advising to increase the poll interval as a mitigation for the 5000-cluster "cycle time exceeds the interval" saturation case; update the "Recommendations" paragraph (the text discussing CPU/memory defaults, 1000 vs 5000 clusters, and sharding with resourceSelector) to explicitly add a third mitigation: increase the poll interval (e.g., from 5s to 10s) so cycles have idle time, and mention this is the appropriate fix when CPU usage is below limits but cycle time itself is the bottleneck; reference the 5000-cluster saturation example and include the concrete suggestion to adjust poll interval as an alternative to changing resource requests or sharding with resourceSelector.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@docs/resource-profiling.md`:
- Around line 41-45: The Recommendations section omits advising to increase the
poll interval as a mitigation for the 5000-cluster "cycle time exceeds the
interval" saturation case; update the "Recommendations" paragraph (the text
discussing CPU/memory defaults, 1000 vs 5000 clusters, and sharding with
resourceSelector) to explicitly add a third mitigation: increase the poll
interval (e.g., from 5s to 10s) so cycles have idle time, and mention this is
the appropriate fix when CPU usage is below limits but cycle time itself is the
bottleneck; reference the 5000-cluster saturation example and include the
concrete suggestion to adjust poll interval as an alternative to changing
resource requests or sharding with resourceSelector.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: a43d00b1-3874-4465-898f-d8e2d4bbc219
📒 Files selected for processing (1)
docs/resource-profiling.md
|
/lgtm |
Summary
Jira ticket
Test Plan
make test-allpassesmake lintpassesmake test-helm(if applicable)Summary by CodeRabbit