Skip to content

Conversation

@dafu-wu
Copy link
Contributor

@dafu-wu dafu-wu commented Oct 26, 2025

Add checkQueueDimensionsOnly option for selective resource checking

What type of PR is this?

/kind feature

What this PR does / why we need it:

Adds checkQueueDimensionsOnly configuration option to capacity plugin, enabling selective resource dimension checking based on queue's capability definition.

Use case: GPU-only queues where only GPU resources need enforcement while CPU/memory remain flexible.

Implementation:

  • New LessEqualWithSpecifiedDimensions() function for selective dimension comparison
  • Updated capacity plugin to honor the configuration switch
  • Applied to both task allocation and job enqueueing logic

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

  • Backward compatible: feature is opt-in (default: false)
  • Nil-safe: multiple safety checks prevent nil pointer dereference
  • Tested: code compiles successfully

Does this PR introduce a user-facing change?

Add checkQueueDimensionsOnly option to capacity plugin for selective resource checking. When enabled, only resource dimensions defined in queue's capability field are validated, allowing flexible allocation for other resources.

Configuration Example

# Scheduler config
tiers:
- plugins:
  - name: capacity
    arguments:
      checkQueueDimensionsOnly: true
# Queue with GPU-only constraint
apiVersion: scheduling.volcano.sh/v1beta1
kind: Queue
metadata:
  name: gpu-queue
spec:
  capability:
    nvidia.com/gpu: "16"  # Only GPU checked, CPU/memory flexible

Files Changed

  • pkg/scheduler/api/resource_info.go: +67 lines (new comparison function)
  • pkg/scheduler/plugins/capacity/capacity.go: +58 -6 lines (plugin logic)
  • docs/design/capacity-check-queue-dimensions-only.md: new file (documentation)

@volcano-sh-bot volcano-sh-bot added the kind/feature Categorizes issue or PR as related to a new feature. label Oct 26, 2025
@volcano-sh-bot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To complete the pull request process, please assign lowang-bh
You can assign the PR to them by writing /assign @lowang-bh in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@gemini-code-assist
Copy link

Summary of Changes

Hello @dafu-wu, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request enhances the scheduler's capacity plugin by introducing a mechanism for selective resource dimension checking. This allows administrators to configure queues to enforce limits only on specific resource types, such as GPUs, while allowing other resources like CPU and memory to be more flexibly allocated. This feature is particularly beneficial for heterogeneous clusters and specialized workloads, improving resource utilization and simplifying queue management by preventing overly restrictive checks on all resource dimensions.

Highlights

  • Selective Resource Dimension Checking: Introduced a new configuration option, checkQueueDimensionsOnly, to the capacity plugin. When enabled, it allows the scheduler to validate only the resource dimensions explicitly defined in a queue's capability field, providing more flexibility for specialized workloads.
  • New Resource Comparison Function: Added a new Go function, LessEqualWithSpecifiedDimensions, in pkg/scheduler/api/resource_info.go. This function enables comparing resources by only considering a specified subset of dimensions, which is crucial for the new selective checking feature.
  • Capacity Plugin Integration: The capacity plugin (pkg/scheduler/plugins/capacity/capacity.go) has been updated to honor the checkQueueDimensionsOnly setting. This new logic is applied to both task allocation (queueAllocatableWithCheck) and job enqueueing (jobEnqueueable) processes.
  • Documentation and Examples: Comprehensive documentation has been added in docs/design/capacity-check-queue-dimensions-only.md, detailing the feature's design, configuration, use cases (e.g., GPU-only queues), implementation, migration guide, and best practices.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@volcano-sh-bot volcano-sh-bot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Oct 26, 2025
Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a valuable feature for selective resource dimension checking in the capacity plugin, which is particularly useful for specialized queues like GPU-only ones. The implementation is mostly solid, with a new comparison function and integration into the capacity plugin's allocation and enqueueing logic. The accompanying design document is clear and comprehensive.

My review has identified one high-severity issue where the simulation logic was not updated to use the new feature, which could lead to scheduling inconsistencies. I've also included a medium-severity suggestion to improve code conciseness. Addressing these points will help ensure the feature is robust and maintainable.

Comment on lines 839 to +857
func (cp *capacityPlugin) queueAllocatable(queue *api.QueueInfo, candidate *api.TaskInfo) bool {
attr := cp.queueOpts[queue.UID]
return queueAllocatable(attr, candidate, queue)
return cp.queueAllocatableWithCheck(attr, candidate, queue)
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This change correctly updates queueAllocatable to use the new selective dimension checking logic. However, the simulation logic for predicates in SimulateAllocatableFn (defined around line 300) still seems to use the old queueAllocatable function, which does not respect the checkQueueDimensionsOnly flag.

This can lead to inconsistencies between the scheduler's simulation phase and the actual allocation phase, potentially causing scheduling failures or unexpected behavior when checkQueueDimensionsOnly is enabled.

To fix this, the simulateQueueAllocatable closure within OnSessionOpen should be updated to use cp.queueAllocatableWithCheck to ensure simulation results are consistent with the actual allocation logic.

Comment on lines +549 to +554
lessEqualFunc := func(l, r, diff float64) bool {
if l < r || math.Abs(l-r) < diff {
return true
}
return false
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

For better readability and conciseness, this function can be simplified. An if-true-return-true-else-return-false pattern can be replaced by directly returning the boolean expression.

lessEqualFunc := func(l, r, diff float64) bool {
	return l < r || math.Abs(l-r) < diff
}

Add checkQueueDimensionsOnly option to only validate resources explicitly
defined in queue capability. Enables GPU-only queues and specialized resource management.

Signed-off-by: dafu <[email protected]>
@hajnalmt
Copy link
Contributor

Hello @dafu-wu,
I see the issue and I already implemented something like this in this PR:
#4659

What I don't understand is that why should this be a capacity plugin argument? This should be the default how the plugin shall work. If we don't specify a resource dimension we shouldn't care, so I consider this to be a bug. Of course you can specify as many arguments as you want.

@dafu-wu
Copy link
Contributor Author

dafu-wu commented Oct 31, 2025

Hello @dafu-wu, I see the issue and I already implemented something like this in this PR: #4659

What I don't understand is that why should this be a capacity plugin argument? This should be the default how the plugin shall work. If we don't specify a resource dimension we shouldn't care, so I consider this to be a bug. Of course you can specify as many arguments as you want.

Sounds good, We will conduct tests to see if 4659 truly solves the problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

kind/feature Categorizes issue or PR as related to a new feature. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants