Skip to content

Conversation

@3sunny
Copy link
Contributor

@3sunny 3sunny commented Oct 16, 2025

What type of PR is this?

/kind feature

What this PR does / why we need it:

Add support for task-level network topology constraints, including deployment file updates and webhooks and controller sections.

Which issue(s) this PR fixes:

Fixes #4188

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Example YAML for vcJob after modification

apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
  name: network-topology-job
spec:
  minAvailable: 6
  schedulerName: volcano
  networkTopology:
    mode: hard
    highestTierAllowed: 2
  tasks:
    - replicas: 6
      name: "task"
      partitionPolicy:
        totalPartitions: 2
        partitionSize: 3
        networkTopology:
            mode: hard
            highestTierAllowed: 1
      template:
        metadata:
          name: task
        spec:
          containers:
            - image: ubuntu
              imagePullPolicy: IfNotPresent
              name: task
              resources:
                requests:
                  cpu: "2"
                  memory: 2Gi
          restartPolicy: OnFailure

Example YAML for podGroup after modification

apiVersion: scheduling.volcano.sh/v1beta1
kind: PodGroup
metadata:
  name: network-topology-podgroup
spec:
  minMember: 6
  networkTopology:
    mode: hard 
    highestTierAllowed: 2
  subGroupPolicy: 
    - subGroupSize: 3
      name: task
      matchPolicy:
        - labelKey: volcano.sh/task-bunch-id
      networkTopology:
        mode: hard 
        highestTierAllowed: 1

@volcano-sh-bot
Copy link
Contributor

Welcome @3sunny!

It looks like this is your first PR to volcano-sh/volcano.

Thank you, and welcome to Volcano. 😃

@volcano-sh-bot volcano-sh-bot added the kind/feature Categorizes issue or PR as related to a new feature. label Oct 16, 2025
@volcano-sh-bot volcano-sh-bot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Oct 16, 2025
@gemini-code-assist
Copy link

Summary of Changes

Hello @3sunny, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request integrates a new feature into Volcano that enables users to specify network topology constraints for individual tasks within a job. This capability is crucial for optimizing the performance of distributed workloads by ensuring that related tasks are scheduled on nodes that meet specific network proximity criteria. The implementation involves comprehensive updates to CRD schemas, modifications to the core controller logic for managing these new policies, and the addition of robust validation mechanisms to prevent misconfigurations.

Highlights

  • Task-Level Network Topology Constraints: Introduced support for defining network topology constraints at the task level within Volcano jobs, allowing for more granular control over pod placement based on network proximity.
  • CRD Updates: Modified existing CRDs (JobFlow, JobTemplate, CronJob, Job) to include a new partitionPolicy field within TaskSpec, which encapsulates networkTopology settings. The PodGroup CRD was also updated with a bunchPolicy field to manage groups of pods with specific network topology requirements.
  • Controller Logic Enhancements: Updated the job controller to manage BunchInfo for tasks, which includes PartitionGroup, MatchPolicy, and NetworkTopology configurations. This involves changes to how job information is cloned, set, and how pods are added, updated, and deleted within the cache.
  • Pod Creation and Labeling: The createJobPod function was enhanced to apply specific labels (volcano.sh/<taskName>-bunch-id and volcano.sh/partition-id) to pods based on the defined PartitionPolicy, facilitating the grouping of pods according to network topology.
  • Webhook Validation: Added validation logic within the admission webhooks to ensure that PartitionPolicy fields (TotalPartitions, PartitionSize) are correctly configured (positive values) and consistent with the task's Replicas count.
  • New Action Type: Introduced RestartPartitionAction and PartitionGroupAction to enable actions and cleanup operations at the partition group level, extending the job controller's state management capabilities.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for task-level network topology constraints by adding a partitionPolicy to various CRDs and implementing the corresponding controller logic. The changes are extensive, touching CRD definitions, API types, controller actions, and admission webhooks.

Overall, the implementation looks solid, but I've identified a few issues related to correctness and maintainability. These include a bug in the Clone method for JobInfo, potential for stale data in UpdatePod, insufficient error handling, and some code duplication in the validation logic. I've also pointed out areas in the tests that could be improved for better coverage and correctness. Addressing these points will enhance the robustness and quality of the new feature.

@3sunny 3sunny closed this Oct 17, 2025
@3sunny 3sunny reopened this Nov 3, 2025
@3sunny 3sunny force-pushed the master_punch_affi branch 4 times, most recently from 5b852be to 0d3928e Compare November 14, 2025 01:37
@kingeasternsun
Copy link
Contributor

relate talk 【Volcano Weekly Meeting: 2025-11-07】 【精准空降到 31:16】 https://www.bilibili.com/video/BV1st23BdEfh/?share_source=copy_web&vd_source=984444f47d5c4830f6101d7f27154d98&t=1876

@3sunny 3sunny force-pushed the master_punch_affi branch 3 times, most recently from 2c05715 to 3ac3b5d Compare November 14, 2025 08:46
@ouyangshengjia
Copy link
Contributor

please update the example in the PR description.

@3sunny 3sunny force-pushed the master_punch_affi branch 3 times, most recently from 3a29053 to 7cf2be8 Compare November 14, 2025 10:26
@3sunny
Copy link
Contributor Author

3sunny commented Nov 14, 2025

please update the example in the PR description.

done

@3sunny 3sunny force-pushed the master_punch_affi branch 3 times, most recently from 0c6038a to 70cf826 Compare November 15, 2025 03:10
@3sunny 3sunny force-pushed the master_punch_affi branch 3 times, most recently from 740881f to 6f27f6b Compare November 17, 2025 11:18
@JesseStutler
Copy link
Member

/lgtm
Thanks

@volcano-sh-bot volcano-sh-bot added the lgtm Indicates that a PR is ready to be merged. label Nov 17, 2025
@JesseStutler
Copy link
Member

/cc @wangyang0616

@wangyang0616
Copy link
Member

/lgtm
/approve

@volcano-sh-bot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: wangyang0616

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@volcano-sh-bot volcano-sh-bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 18, 2025
@volcano-sh-bot volcano-sh-bot merged commit 557edce into volcano-sh:master Nov 18, 2025
22 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. kind/feature Categorizes issue or PR as related to a new feature. lgtm Indicates that a PR is ready to be merged. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support task level network topology constrain

6 participants