Scheduler do not work cause checkHierarchicalQueue failed when GPU dropped

### Description

Currently, we use hierarchical queues to achieve resource isolation in multi-tenant scenarios. We’ve encountered a scenario that could lead to significant stability issues recently.
For example:
We have the following hierarchical queue.
```
root (capacity: 32 H100 = cluster total, 4 nodes every 8 cards)
├── child-queue-a (capacity: 32)
└── child-queue-b (capacity: 10)
  └── subchild-queue-a (capacity: 5)
....
```
The quotas for ChildQueue and Subchild Queue are declared by tenants based on the total capacity of the resource pool.

At first，the whole system works well. At a certain point, one of the physical nodes experienced a GPU card failure—a fairly common fault scenario. This caused the GPU resources reported by the node to decrease from 8 to 7. Because root queue is calculated from node capacity, then cap of root queue decreased from 32 to 31. Due to the validation of `checkHierarchicalQueue`, capacity plugin blocks when the tree data structure has been destroyed, which may cause prevent all tenants' computing workloads from being scheduled. From a business perspective, even if a GPU card failure occurs, child-queue-b and subchild-queue-a should still work normally—this is because the entire resource pool still has sufficient GPU resources.


This may have two solutions：
1. Currently child queue and subchild queue is declared by the business itself. They can be maintained independently by the business's controller to ensure the stability of their tree structure. However, root queue is not the same. It is limited by the node's resource reporting chain, including components such as physical devices and device plugins. It may be an option to handle the root queue specially, for example, skip root queue check in `checkHierarchicalQueue` and  another serveral functions. But I don't think it's a good solution.
2. The core issue is that currently, we allow users to set the Capacity for the root queue themselves, but in checkHierarchicalQueue, we actually still use realCapacity, which in turn comes from the sum of the actual resources of the nodes. I hope to add an option to directly use the set Cap as realCapacity for calculations.

### Steps to reproduce the issue

1.
2.
3.


### Describe the results you received and expected

Tasks can be enqueued normally when resources are sufficient.

### What version of Volcano are you using?

master

### Any other relevant information

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Scheduler do not work cause checkHierarchicalQueue failed when GPU dropped #4684

Description

Steps to reproduce the issue

Describe the results you received and expected

What version of Volcano are you using?

Any other relevant information

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Scheduler do not work cause checkHierarchicalQueue failed when GPU dropped #4684

Description

Description

Steps to reproduce the issue

Describe the results you received and expected

What version of Volcano are you using?

Any other relevant information

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions