fix: correct LoRA initialization and forward pass under tensor parallelism by chen2021673 · Pull Request #150 · InfiniTensor/InfiniTrain

chen2021673 · 2026-04-30T01:40:57Z

Summary

Broadcast lora_A init from TP rank 0
LoRAColumnParallelLinear::lora_A is replicated across TP ranks but was independently random-initialized on each rank. This relied on every rank having an identical RNG state — a fragile implicit assumption flagged as broken in init.cc. The fix: only TP rank 0 generates random values; all other ranks zero-initialize; then AllReduce(sum) propagates rank 0's values to the group. No-op when TP size == 1.
Fuse base+LoRA matmuls before collective
The previous forward pass ran the base module and LoRA branch independently, each triggering its own collective (AllGather for Column, AllReduce for Row). Running two separate collectives on floating-point results caused loss divergence due to non-associativity of floating-point addition across the reduction boundary. The fix inlines both matmuls locally, adds them before the collective, then issues a single collective op — matching the approach used in the base parallel linear layers.
Also includes a LoadLoRAWeights fix to correctly slice sharded tensors (e.g. lora_B in ColumnParallel) when loading checkpoints under TP.

…ence Inline base and LoRA matmuls, add locally, then issue a single AllGather/AllReduce instead of two separate collective ops. The prior two-collective approach caused floating-point divergence in DDP loss. Also fix LoadLoRAWeights to slice sharded tensors by tp_rank when the checkpoint shape differs from the partitioned model shape.

…ated weights

chen2021673 added 2 commits April 28, 2026 16:22

fix: broadcast lora_A init from TP rank 0 to ensure consistent replic…

c614ec6

…ated weights

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: correct LoRA initialization and forward pass under tensor parallelism#150

fix: correct LoRA initialization and forward pass under tensor parallelism#150
chen2021673 wants to merge 2 commits intomasterfrom
lora_ddp_loss

chen2021673 commented Apr 30, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

chen2021673 commented Apr 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

chen2021673 commented Apr 30, 2026 •

edited

Loading