[Feature] Layer-wise MoE auxiliary loss (split finalize) and optional async router D2H offload by tina-wen · Pull Request #1528 · InternLM/xtuner

tina-wen · 2026-03-03T14:24:27Z

Summary

Layer-wise split finalize for MoE auxiliary losses and optional async router D2H offload.

What changed

Implements layer-wise accumulation + finalize for balancing and z-loss layer_moe_loss.py.
Adds lazy async D2H offload for router tensors router_offload.py and integrates it in moe.py.

Benefits

Reduces peak GPU memory by offloading router logits/weights to pinned CPU while keeping tensor semantics.
Async offload is lazy (waits only when CPU data is actually needed), enabling overlap of D2H copy and GPU work to reduce wall-clock overhead.
Enabling router_async_offload=True lowers memory at the cost of possible host-device transfer latency.

HAOCHENYE · 2026-04-13T04:01:38Z

@claude review

claude · 2026-04-13T04:08:58Z

Claude: ## Summary

Adds Domino EP support and layer-wise MoE balance loss computation for InternS1 Pro VL training. The layer-wise approach splits expert balance loss per-layer to reduce peak memory from accumulating all router outputs, with async CPU offloading for router tensors.

Issues

Critical

moe.py:825 — non_pad_token=router_weights.shape[1] uses the wrong tensor when layer balancing is enabled. router_weights holds the last MoE layer's raw output (seq, num_experts), so shape[1] = num_experts, not the non-padding token count. Produces incorrect loss in non-distributed mode. Fix: non_pad_token=int(seq_ctx.mask.sum().item()).
modeling_qwen3_vl.py:141-146 — _prepare_llm_inputs return type annotated as MoEModelOutputs but actually returns torch.Tensor.

Warning

modeling_qwen3_vl.py:207-236 — Replaced seq_ctx.copy(...) with explicit SequenceContext(...) construction, silently dropping fields like device (defaults to "cpu"), block_table, image_grid_thw, etc. Use copy() to preserve all original fields.
moe.py:753 — Z-loss is silently disabled when layer_balancing_loss is enabled. If both are configured, users get no warning that z-loss is being skipped.
layer_moe_loss.py:220-225 — maybe_offload_tensor synchronously waits on the async D2H copy, so the async machinery provides no overlap benefit. Either defer the wait to maybe_wait_offload_tensor, or simplify to plain .cpu().

Nit

modeling_qwen3_vl.py:200 — forward signature exceeds 119-char line limit.
moe.py:460-464 — mask_list is a tensor, not a list; naming is inconsistent with cat_hidden_states / cat_position_ids convention.

Verdict

REQUEST_CHANGES

HAOCHENYE

Just hold, the train engine refactor will optimizer

pppppM reviewed Mar 3, 2026

View reviewed changes

Comment thread xtuner/v1/model/moe/moe.py Outdated

HAOCHENYE added the blocked label Mar 4, 2026

tina-wen force-pushed the split_bal_loss branch from 5a0d28e to dc03378 Compare March 5, 2026 04:19

tina-wen force-pushed the split_bal_loss branch 3 times, most recently from fb3ae25 to 95e9d62 Compare April 7, 2026 07:41