-
Notifications
You must be signed in to change notification settings - Fork 1.2k
feat: elastic job feature tdd #4667
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Signed-off-by: hansong <[email protected]>
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Summary of ChangesHello @hansongChina, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces a detailed architectural design for elastic job scheduling, primarily aimed at optimizing resource utilization for AI training workloads. It proposes mechanisms for tasks to dynamically expand or shrink based on cluster resource availability, while also enhancing preemption logic to manage resource reclamation effectively. The design outlines new API annotations for defining elastic job properties, introduces a dedicated 'elastic' scheduling action, and refines existing Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This is a good start for the design document. I've identified several areas that need clarification, especially regarding the core concepts of resource guarantees (min), preemption modes (reclaim vs. elastic), and the API design for elastic strategies. Addressing these points will make the design much clearer and more robust. I've also pointed out a few typos and grammatical issues to improve readability.
|
|
||
| ## Proposal | ||
|
|
||
| The scheduling mode of elastic tasks supports the capacity plugin, but does not support proportion. Based on the AI elastic scenario, scheduling can be used together with reclaim mode, meaning normal tasks with min may also be reclaimed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This sentence states that "normal tasks with min may also be reclaimed", which seems to contradict the purpose of min as a guaranteed resource level. It also conflicts with the statement on line 31, which says "there is a guarantee for min normal tasks" and "only elastic resources will be preempted" in elastic mode.
Please clarify the preemption and reclamation behavior for min tasks in different modes (reclaim vs. elastic). The guarantee provided by min should be explicitly defined. For example:
- Are
mintasks ever preemptible? - If so, under what specific conditions?
- How does this interact with the
reclaimandelasticmodes?
A clear definition is crucial for users to understand the reliability of their jobs.
| ## Design detail | ||
|
|
||
| ### Elastic task scheduling architecture diagram | ||
| <div align="center"><img width="800" height="400" src="image/elastic.png" /></div> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| volcano.sh/cooldown-time: 5s | ||
| # How long can a flexible task be killed | ||
| volcano.sh/elastic-cooldown-time: 10m0s | ||
| # The time interval between each scaling operation. | ||
| volcano.sh/elastic-train: "true" | ||
| # This task is an elastic task. | ||
| volcano.sh/elastic-train-power-of-two: "true" | ||
| # Elastic strategy, is it a power of 2 | ||
| volcano.sh/elastic-train-factor: "2" | ||
| # Elastic strategy, with the expansion capacity specified as a multiple of the value given by elastic-train-factor. 1 means expanding when there are resources available. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The annotations in the API design and their corresponding comments are confusing and need clarification.
volcano.sh/cooldown-time: This annotation is missing a comment explaining its purpose.volcano.sh/elastic-cooldown-time: There are two comments that seem to relate to this:# How long can a flexible task be killed(line 70) - This is ambiguous. Does it mean minimum uptime before a task is eligible for preemption?# The time interval between each scaling operation.(line 72) - This is a more standard definition of a cooldown.
Please clarify the exact meaning of this annotation and provide a single, clear comment.
volcano.sh/elastic-train-power-of-twoandvolcano.sh/elastic-train-factor: The relationship between these two annotations is not explained.- If
elastic-train-power-of-two: "true", iselastic-train-factorignored? - What are the valid values and behavior for
elastic-train-factor? The comment mentions "1 means expanding when there are resources available," which seems to imply a different strategy.
- If
Please revise this section to clearly define each annotation, its purpose, its valid values, and how the annotations interact with each other.
| - Plugin capacity update | ||
| - **arg** | ||
| - Add capacity.reclaim.mode:reclaim/elastic reclaim mode parameter | ||
| - reclaim mode is consistent with the existing elastic scenarios. And it supports elastic mode. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This sentence is contradictory. It states that reclaim mode "is consistent with the existing elastic scenarios" and also "supports elastic mode". It's unclear how one mode can also support another mode. Please clarify the relationship and behavior of these modes. Does reclaim mode encompass the functionality of elastic mode, or are they mutually exclusive?
| @@ -0,0 +1,107 @@ | |||
| # Elastic scheduling based on AI scenarios | |||
|
|
|||
| [@hansongChina](hhttps://github.com/hansongChina) | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
|
||
| ### API design | ||
|
|
||
| Adding annotation tags on vcjob is an elastic task |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| volcano.sh/elastic-cooldown-time: 10m0s | ||
| # The time interval between each scaling operation. | ||
| volcano.sh/elastic-train: "true" | ||
| # This task is an elastic task. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
|
||
| ### Main process | ||
|
|
||
| KKey function needed to add: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| - update `pickUpQueuesAndJobs`function Add to check if this job has a judgment for allocate processing scheduling | ||
| - update `allocateResources` function Add task.ElasticTrain judgment, if it is an elastic task scheduling, then hand it over to elasitc's action to handle | ||
| - **Reclaim:** | ||
| - add Add elastic task preemption logic, score and preempt based on job dimension, and the number of preempted tasks should meet the scaling down strategy | ||
| - update The original reclaim logic has been enhanced to include scoring and preemption based on job dimension, supporting gang preemption. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The descriptions of the main process changes are unclear and contain typos.
- Line 87: The phrasing is awkward. It could be "update
pickUpQueuesAndJobsfunction to filter out elastic jobs, which will be handled by theelasticaction." - Line 88: Typo
elasitc's. The phrasing is also verbose. It could be "updateallocateResourcesfunction to delegate scheduling of elastic tasks to theelasticaction." - Lines 90-91: The term "job dimension" is vague. Please clarify what this means. Is it based on job priority, resource usage, or something else?
Please revise this section for clarity, correctness, and to provide more specific details on the proposed logic.
| Elastic action is executed at the end of configuration, can configure `capacity` as elastic mode, for preempted tasks display controllable, | ||
| Also can configure reclaim mode, which meets the elastic task preemption and normal task exceeding `Deserved` recovery mechanism. No newline at end of file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This section is written as a single, long run-on sentence, which makes it difficult to understand. Please break it down into separate sentences for clarity.
For example:
"The elastic action should be executed at the end of the action chain. The capacity plugin can be configured in elastic mode to make preemption of tasks controllable. Alternatively, it can be configured in reclaim mode, which supports both preemption of elastic tasks and reclamation of resources from normal tasks that exceed their Deserved allocation."
Signed-off-by: hansong [email protected]
What type of PR is this?
This is a elastic job training architecture design
What this PR does / why we need it:
To complete the proposal for elastic job training, add an architecture design
Which issue(s) this PR fixes:
Fixes #4666
Special notes for your reviewer:
Does this PR introduce a user-facing change?