Skip to content

Support task level network topology constrain #4188

@Monokaix

Description

@Monokaix

What is the problem you're trying to solve

Currently vcjob and PodGroup support setting workload level network topology constrain, we should also support task level network topology setting.

Describe the solution you'd like

In both training and inference scenarios, we don't necessarily need all tasks within the entire job to be restricted to the same HyperNode.

  1. In training scenarios, with pipeline parallelism (pp) and data parallelism (dp), it's sufficient for the data parallel (dp) tasks to be distributed within one topology domain.

  2. In inference scenarios, such as vllm, the requirement is usually only for the workers to be deployed within the same topology domain, while the leader has no topology constraints.

Additional context

  1. Should modify Volcano scheduler to support task level topology.
  2. Should add webhook to validate, e.g., task level HighestTierAllowed should not be greater than job level HighestTierAllowed.

Metadata

Metadata

Labels

kind/featureCategorizes issue or PR as related to a new feature.

Projects

Status

Done

Status

Done

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions