-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Open
Labels
kind/featureCategorizes issue or PR as related to a new feature.Categorizes issue or PR as related to a new feature.
Description
What is the problem you're trying to solve
Currently vcjob and PodGroup support setting workload level network topology constrain, we should also support task level network topology setting.
Describe the solution you'd like
In both training and inference scenarios, we don't necessarily need all tasks within the entire job to be restricted to the same HyperNode.
-
In training scenarios, with pipeline parallelism (pp) and data parallelism (dp), it's sufficient for the data parallel (dp) tasks to be distributed within one topology domain.
-
In inference scenarios, such as vllm, the requirement is usually only for the workers to be deployed within the same topology domain, while the leader has no topology constraints.
Additional context
- Should modify Volcano scheduler to support task level topology.
- Should add webhook to validate, e.g., task level HighestTierAllowed should not be greater than job level HighestTierAllowed.
Metadata
Metadata
Assignees
Labels
kind/featureCategorizes issue or PR as related to a new feature.Categorizes issue or PR as related to a new feature.
Type
Projects
Status
Done
Status
Done