Skip to content

Conversation

@hansongChina
Copy link
Contributor

Signed-off-by: hansong [email protected]

What type of PR is this?

This is a elastic job training architecture design

What this PR does / why we need it:

To complete the proposal for elastic job training, add an architecture design

Which issue(s) this PR fixes:

Fixes #4666

Special notes for your reviewer:

Does this PR introduce a user-facing change?


@volcano-sh-bot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To complete the pull request process, please assign hwdef
You can assign the PR to them by writing /assign @hwdef in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@gemini-code-assist
Copy link

Summary of Changes

Hello @hansongChina, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a detailed architectural design for elastic job scheduling, primarily aimed at optimizing resource utilization for AI training workloads. It proposes mechanisms for tasks to dynamically expand or shrink based on cluster resource availability, while also enhancing preemption logic to manage resource reclamation effectively. The design outlines new API annotations for defining elastic job properties, introduces a dedicated 'elastic' scheduling action, and refines existing Allocate and Reclaim actions to support these new capabilities, ultimately providing more controllable and efficient resource management.

Highlights

  • New Elastic Job Design: Introduces a comprehensive design for elastic job scheduling, particularly for AI training scenarios, allowing tasks to dynamically scale based on resource availability and predefined strategies.
  • Flexible Resource Management: Enables users to define jobs as 'elastic' with specific scaling strategies (e.g., power of 2) and cooldown times, improving cluster utilization by leveraging idle resources.
  • Enhanced Preemption Logic: Modifies the Reclaim action to include job-dimension-based scoring and sorting, prioritizing elastic tasks for preemption while ensuring normal tasks stay within their Deserved limits in elastic mode.
  • New elastic Action and API: Introduces a dedicated elastic action for scheduling elastic tasks and new vcjob annotations for configuring elastic behavior, such as volcano.sh/elastic-train and volcano.sh/elastic-train-power-of-two.
  • Configurable Reclaim Modes: Adds capacity.reclaim.mode parameter, allowing administrators to choose between a general reclaim mode or an elastic mode that specifically targets elastic tasks for preemption.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@volcano-sh-bot volcano-sh-bot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Oct 9, 2025
Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This is a good start for the design document. I've identified several areas that need clarification, especially regarding the core concepts of resource guarantees (min), preemption modes (reclaim vs. elastic), and the API design for elastic strategies. Addressing these points will make the design much clearer and more robust. I've also pointed out a few typos and grammatical issues to improve readability.


## Proposal

The scheduling mode of elastic tasks supports the capacity plugin, but does not support proportion. Based on the AI elastic scenario, scheduling can be used together with reclaim mode, meaning normal tasks with min may also be reclaimed.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This sentence states that "normal tasks with min may also be reclaimed", which seems to contradict the purpose of min as a guaranteed resource level. It also conflicts with the statement on line 31, which says "there is a guarantee for min normal tasks" and "only elastic resources will be preempted" in elastic mode.

Please clarify the preemption and reclamation behavior for min tasks in different modes (reclaim vs. elastic). The guarantee provided by min should be explicitly defined. For example:

  • Are min tasks ever preemptible?
  • If so, under what specific conditions?
  • How does this interact with the reclaim and elastic modes?

A clear definition is crucial for users to understand the reliability of their jobs.

## Design detail

### Elastic task scheduling architecture diagram
<div align="center"><img width="800" height="400" src="image/elastic.png" /></div>

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The architecture diagram image/elastic.png is referenced here, but it has not been added to the repository in this pull request. A design document is incomplete without key diagrams. Please add the image file to the PR.

Comment on lines +69 to +78
volcano.sh/cooldown-time: 5s
# How long can a flexible task be killed
volcano.sh/elastic-cooldown-time: 10m0s
# The time interval between each scaling operation.
volcano.sh/elastic-train: "true"
# This task is an elastic task.
volcano.sh/elastic-train-power-of-two: "true"
# Elastic strategy, is it a power of 2
volcano.sh/elastic-train-factor: "2"
# Elastic strategy, with the expansion capacity specified as a multiple of the value given by elastic-train-factor. 1 means expanding when there are resources available.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The annotations in the API design and their corresponding comments are confusing and need clarification.

  • volcano.sh/cooldown-time: This annotation is missing a comment explaining its purpose.
  • volcano.sh/elastic-cooldown-time: There are two comments that seem to relate to this:
    1. # How long can a flexible task be killed (line 70) - This is ambiguous. Does it mean minimum uptime before a task is eligible for preemption?
    2. # The time interval between each scaling operation. (line 72) - This is a more standard definition of a cooldown.
      Please clarify the exact meaning of this annotation and provide a single, clear comment.
  • volcano.sh/elastic-train-power-of-two and volcano.sh/elastic-train-factor: The relationship between these two annotations is not explained.
    • If elastic-train-power-of-two: "true", is elastic-train-factor ignored?
    • What are the valid values and behavior for elastic-train-factor? The comment mentions "1 means expanding when there are resources available," which seems to imply a different strategy.

Please revise this section to clearly define each annotation, its purpose, its valid values, and how the annotations interact with each other.

- Plugin capacity update
- **arg**
- Add capacity.reclaim.mode:reclaim/elastic reclaim mode parameter
- reclaim mode is consistent with the existing elastic scenarios. And it supports elastic mode.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This sentence is contradictory. It states that reclaim mode "is consistent with the existing elastic scenarios" and also "supports elastic mode". It's unclear how one mode can also support another mode. Please clarify the relationship and behavior of these modes. Does reclaim mode encompass the functionality of elastic mode, or are they mutually exclusive?

@@ -0,0 +1,107 @@
# Elastic scheduling based on AI scenarios

[@hansongChina](hhttps://github.com/hansongChina)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

There's a typo in the URL for the author's GitHub profile. It starts with hhttps:// instead of https://, which makes the link broken.

Suggested change
[@hansongChina](hhttps://github.com/hansongChina)
[@hansongChina](https://github.com/hansongChina)


### API design

Adding annotation tags on vcjob is an elastic task

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The phrasing of this sentence is a bit awkward. It could be improved for better readability.

Suggested change
Adding annotation tags on vcjob is an elastic task
A vcjob is marked as an elastic task by adding specific annotations.

volcano.sh/elastic-cooldown-time: 10m0s
# The time interval between each scaling operation.
volcano.sh/elastic-train: "true"
# This task is an elastic task.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The comment says "This task is an elastic task." However, the annotation is on a vcjob. For clarity and precision, it would be better to refer to it as a "job".

Suggested change
# This task is an elastic task.
# This job is an elastic job.


### Main process

KKey function needed to add:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

There is a typo "KKey". It should be "Key".

Suggested change
KKey function needed to add:
Key functions needed to add:

Comment on lines +87 to +91
- update `pickUpQueuesAndJobs`function Add to check if this job has a judgment for allocate processing scheduling
- update `allocateResources` function Add task.ElasticTrain judgment, if it is an elastic task scheduling, then hand it over to elasitc's action to handle
- **Reclaim:**
- add Add elastic task preemption logic, score and preempt based on job dimension, and the number of preempted tasks should meet the scaling down strategy
- update The original reclaim logic has been enhanced to include scoring and preemption based on job dimension, supporting gang preemption.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The descriptions of the main process changes are unclear and contain typos.

  • Line 87: The phrasing is awkward. It could be "update pickUpQueuesAndJobs function to filter out elastic jobs, which will be handled by the elastic action."
  • Line 88: Typo elasitc's. The phrasing is also verbose. It could be "update allocateResources function to delegate scheduling of elastic tasks to the elastic action."
  • Lines 90-91: The term "job dimension" is vague. Please clarify what this means. Is it based on job priority, resource usage, or something else?

Please revise this section for clarity, correctness, and to provide more specific details on the proposed logic.

Comment on lines +106 to +107
Elastic action is executed at the end of configuration, can configure `capacity` as elastic mode, for preempted tasks display controllable,
Also can configure reclaim mode, which meets the elastic task preemption and normal task exceeding `Deserved` recovery mechanism. No newline at end of file

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This section is written as a single, long run-on sentence, which makes it difficult to understand. Please break it down into separate sentences for clarity.

For example:
"The elastic action should be executed at the end of the action chain. The capacity plugin can be configured in elastic mode to make preemption of tasks controllable. Alternatively, it can be configured in reclaim mode, which supports both preemption of elastic tasks and reclamation of resources from normal tasks that exceed their Deserved allocation."

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

retest-not-required-docs-only size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

Status: In progress

Development

Successfully merging this pull request may close these issues.

Elastic scheduling based on AI scenarios

2 participants