[Feature] Layer-wise MoE auxiliary loss (split finalize) and optional async router D2H offload#1528
Open
tina-wen wants to merge 2 commits intoInternLM:mainfrom
Open
[Feature] Layer-wise MoE auxiliary loss (split finalize) and optional async router D2H offload#1528tina-wen wants to merge 2 commits intoInternLM:mainfrom
tina-wen wants to merge 2 commits intoInternLM:mainfrom
Conversation
pppppM
reviewed
Mar 3, 2026
fb3ae25 to
95e9d62
Compare
Collaborator
|
@claude review |
Contributor
|
Claude: ## Summary Adds Domino EP support and layer-wise MoE balance loss computation for InternS1 Pro VL training. The layer-wise approach splits expert balance loss per-layer to reduce peak memory from accumulating all router outputs, with async CPU offloading for router tensors. IssuesCritical
Warning
Nit
VerdictREQUEST_CHANGES |
290dead to
2244c01
Compare
1968ecc to
298ebf4
Compare
HAOCHENYE
reviewed
Apr 27, 2026
Collaborator
HAOCHENYE
left a comment
There was a problem hiding this comment.
Just hold, the train engine refactor will optimizer
eb2469e to
aaac094
Compare
aaac094 to
3d1a6b6
Compare
e42ca82 to
a258bee
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Layer-wise split finalize for MoE auxiliary losses and optional async router D2H offload.
What changed
Implements layer-wise accumulation + finalize for balancing and z-loss
layer_moe_loss.py.Adds lazy async D2H offload for router tensors
router_offload.pyand integrates it inmoe.py.Benefits
Reduces peak GPU memory by offloading router logits/weights to pinned CPU while keeping tensor semantics.
Async offload is lazy (waits only when CPU data is actually needed), enabling overlap of D2H copy and GPU work to reduce wall-clock overhead.
Enabling router_async_offload=True lowers memory at the cost of possible host-device transfer latency.