-
Notifications
You must be signed in to change notification settings - Fork 1.2k
feat: support selective resource dimension checking in capacity plugin #4696
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Summary of ChangesHello @dafu-wu, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request enhances the scheduler's capacity plugin by introducing a mechanism for selective resource dimension checking. This allows administrators to configure queues to enforce limits only on specific resource types, such as GPUs, while allowing other resources like CPU and memory to be more flexibly allocated. This feature is particularly beneficial for heterogeneous clusters and specialized workloads, improving resource utilization and simplifying queue management by preventing overly restrictive checks on all resource dimensions. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a valuable feature for selective resource dimension checking in the capacity plugin, which is particularly useful for specialized queues like GPU-only ones. The implementation is mostly solid, with a new comparison function and integration into the capacity plugin's allocation and enqueueing logic. The accompanying design document is clear and comprehensive.
My review has identified one high-severity issue where the simulation logic was not updated to use the new feature, which could lead to scheduling inconsistencies. I've also included a medium-severity suggestion to improve code conciseness. Addressing these points will help ensure the feature is robust and maintainable.
| func (cp *capacityPlugin) queueAllocatable(queue *api.QueueInfo, candidate *api.TaskInfo) bool { | ||
| attr := cp.queueOpts[queue.UID] | ||
| return queueAllocatable(attr, candidate, queue) | ||
| return cp.queueAllocatableWithCheck(attr, candidate, queue) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change correctly updates queueAllocatable to use the new selective dimension checking logic. However, the simulation logic for predicates in SimulateAllocatableFn (defined around line 300) still seems to use the old queueAllocatable function, which does not respect the checkQueueDimensionsOnly flag.
This can lead to inconsistencies between the scheduler's simulation phase and the actual allocation phase, potentially causing scheduling failures or unexpected behavior when checkQueueDimensionsOnly is enabled.
To fix this, the simulateQueueAllocatable closure within OnSessionOpen should be updated to use cp.queueAllocatableWithCheck to ensure simulation results are consistent with the actual allocation logic.
| lessEqualFunc := func(l, r, diff float64) bool { | ||
| if l < r || math.Abs(l-r) < diff { | ||
| return true | ||
| } | ||
| return false | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add checkQueueDimensionsOnly option to only validate resources explicitly defined in queue capability. Enables GPU-only queues and specialized resource management. Signed-off-by: dafu <[email protected]>
|
Hello @dafu-wu, What I don't understand is that why should this be a capacity plugin argument? This should be the default how the plugin shall work. If we don't specify a resource dimension we shouldn't care, so I consider this to be a bug. Of course you can specify as many arguments as you want. |
Sounds good, We will conduct tests to see if 4659 truly solves the problem. |
Add checkQueueDimensionsOnly option for selective resource checking
What type of PR is this?
/kind feature
What this PR does / why we need it:
Adds
checkQueueDimensionsOnlyconfiguration option to capacity plugin, enabling selective resource dimension checking based on queue's capability definition.Use case: GPU-only queues where only GPU resources need enforcement while CPU/memory remain flexible.
Implementation:
LessEqualWithSpecifiedDimensions()function for selective dimension comparisonWhich issue(s) this PR fixes:
Fixes #
Special notes for your reviewer:
Does this PR introduce a user-facing change?
Configuration Example
Files Changed
pkg/scheduler/api/resource_info.go: +67 lines (new comparison function)pkg/scheduler/plugins/capacity/capacity.go: +58 -6 lines (plugin logic)docs/design/capacity-check-queue-dimensions-only.md: new file (documentation)