-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Description
Description
Currently, we use hierarchical queues to achieve resource isolation in multi-tenant scenarios. We’ve encountered a scenario that could lead to significant stability issues recently.
For example:
We have the following hierarchical queue.
root (capacity: 32 H100 = cluster total, 4 nodes every 8 cards)
├── child-queue-a (capacity: 32)
└── child-queue-b (capacity: 10)
└── subchild-queue-a (capacity: 5)
....
The quotas for ChildQueue and Subchild Queue are declared by tenants based on the total capacity of the resource pool.
At first,the whole system works well. At a certain point, one of the physical nodes experienced a GPU card failure—a fairly common fault scenario. This caused the GPU resources reported by the node to decrease from 8 to 7. Because root queue is calculated from node capacity, then cap of root queue decreased from 32 to 31. Due to the validation of checkHierarchicalQueue, capacity plugin blocks when the tree data structure has been destroyed, which may cause prevent all tenants' computing workloads from being scheduled. From a business perspective, even if a GPU card failure occurs, child-queue-b and subchild-queue-a should still work normally—this is because the entire resource pool still has sufficient GPU resources.
This may have two solutions:
- Currently child queue and subchild queue is declared by the business itself. They can be maintained independently by the business's controller to ensure the stability of their tree structure. However, root queue is not the same. It is limited by the node's resource reporting chain, including components such as physical devices and device plugins. It may be an option to handle the root queue specially, for example, skip root queue check in
checkHierarchicalQueueand another serveral functions. But I don't think it's a good solution. - The core issue is that currently, we allow users to set the Capacity for the root queue themselves, but in checkHierarchicalQueue, we actually still use realCapacity, which in turn comes from the sum of the actual resources of the nodes. I hope to add an option to directly use the set Cap as realCapacity for calculations.
Steps to reproduce the issue
Describe the results you received and expected
Tasks can be enqueued normally when resources are sufficient.
What version of Volcano are you using?
master
Any other relevant information
No response
Metadata
Metadata
Assignees
Labels
Type
Projects
Status