Skip to content

[Feature] Layer-wise MoE auxiliary loss (split finalize) and optional async router D2H offload#1528

Open
tina-wen wants to merge 2 commits intoInternLM:mainfrom
tina-wen:split_bal_loss
Open

[Feature] Layer-wise MoE auxiliary loss (split finalize) and optional async router D2H offload#1528
tina-wen wants to merge 2 commits intoInternLM:mainfrom
tina-wen:split_bal_loss

Conversation

@tina-wen
Copy link
Copy Markdown
Contributor

@tina-wen tina-wen commented Mar 3, 2026

Summary

Layer-wise split finalize for MoE auxiliary losses and optional async router D2H offload.

What changed

Implements layer-wise accumulation + finalize for balancing and z-loss layer_moe_loss.py.
Adds lazy async D2H offload for router tensors router_offload.py and integrates it in moe.py.

Benefits

Reduces peak GPU memory by offloading router logits/weights to pinned CPU while keeping tensor semantics.
Async offload is lazy (waits only when CPU data is actually needed), enabling overlap of D2H copy and GPU work to reduce wall-clock overhead.
Enabling router_async_offload=True lowers memory at the cost of possible host-device transfer latency.

Comment thread xtuner/v1/model/moe/moe.py Outdated
@tina-wen tina-wen force-pushed the split_bal_loss branch 3 times, most recently from fb3ae25 to 95e9d62 Compare April 7, 2026 07:41
@HAOCHENYE
Copy link
Copy Markdown
Collaborator

@claude review

Comment thread xtuner/v1/model/compose/qwen3_vl/modeling_qwen3_vl.py Outdated
Comment thread xtuner/v1/loss/aux_loss.py Outdated
Comment thread xtuner/v1/model/moe/moe.py Outdated
Comment thread xtuner/v1/model/moe/moe.py Outdated
Comment thread xtuner/v1/model/moe/moe.py Outdated
Comment thread xtuner/v1/model/compose/qwen3_vl/modeling_qwen3_vl.py Outdated
Comment thread xtuner/v1/model/compose/qwen3_vl/modeling_qwen3_vl.py Outdated
@claude
Copy link
Copy Markdown
Contributor

claude Bot commented Apr 13, 2026

Claude: ## Summary

Adds Domino EP support and layer-wise MoE balance loss computation for InternS1 Pro VL training. The layer-wise approach splits expert balance loss per-layer to reduce peak memory from accumulating all router outputs, with async CPU offloading for router tensors.

Issues

Critical

  • moe.py:825non_pad_token=router_weights.shape[1] uses the wrong tensor when layer balancing is enabled. router_weights holds the last MoE layer's raw output (seq, num_experts), so shape[1] = num_experts, not the non-padding token count. Produces incorrect loss in non-distributed mode. Fix: non_pad_token=int(seq_ctx.mask.sum().item()).
  • modeling_qwen3_vl.py:141-146_prepare_llm_inputs return type annotated as MoEModelOutputs but actually returns torch.Tensor.

Warning

  • modeling_qwen3_vl.py:207-236 — Replaced seq_ctx.copy(...) with explicit SequenceContext(...) construction, silently dropping fields like device (defaults to "cpu"), block_table, image_grid_thw, etc. Use copy() to preserve all original fields.
  • moe.py:753 — Z-loss is silently disabled when layer_balancing_loss is enabled. If both are configured, users get no warning that z-loss is being skipped.
  • layer_moe_loss.py:220-225maybe_offload_tensor synchronously waits on the async D2H copy, so the async machinery provides no overlap benefit. Either defer the wait to maybe_wait_offload_tensor, or simplify to plain .cpu().

Nit

  • modeling_qwen3_vl.py:200forward signature exceeds 119-char line limit.
  • moe.py:460-464mask_list is a tensor, not a list; naming is inconsistent with cat_hidden_states / cat_position_ids convention.

Verdict

REQUEST_CHANGES

@tina-wen tina-wen force-pushed the split_bal_loss branch 4 times, most recently from 290dead to 2244c01 Compare April 23, 2026 12:39
@tina-wen tina-wen changed the title [Feature] Domino EP support and training optimizations for InternS1 Pro VL [Feature] Layer-wise MoE auxiliary loss (split finalize) and optional async router D2H offload Apr 23, 2026
@tina-wen tina-wen force-pushed the split_bal_loss branch 6 times, most recently from 1968ecc to 298ebf4 Compare April 27, 2026 07:31
Copy link
Copy Markdown
Collaborator

@HAOCHENYE HAOCHENYE left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just hold, the train engine refactor will optimizer

Comment thread xtuner/v1/loss/aux_loss.py Outdated
Comment thread xtuner/v1/loss/aux_loss.py Outdated
Comment thread xtuner/v1/loss/aux_loss.py Outdated
Comment thread xtuner/v1/loss/aux_loss.py Outdated
Comment thread xtuner/v1/loss/aux_loss.py Outdated
Comment thread xtuner/v1/loss/aux_loss.py Outdated
Comment thread xtuner/v1/loss/aux_loss.py Outdated
Comment thread xtuner/v1/loss/aux_loss.py Outdated
Comment thread xtuner/v1/loss/moe_loss.py Outdated
Comment thread xtuner/v1/loss/moe_loss.py Outdated
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants