Skip to content

fix: correct LoRA initialization and forward pass under tensor parallelism#150

Open
chen2021673 wants to merge 2 commits intomasterfrom
lora_ddp_loss
Open

fix: correct LoRA initialization and forward pass under tensor parallelism#150
chen2021673 wants to merge 2 commits intomasterfrom
lora_ddp_loss

Conversation

@chen2021673
Copy link
Copy Markdown
Contributor

@chen2021673 chen2021673 commented Apr 30, 2026

Summary

  1. Broadcast lora_A init from TP rank 0
    LoRAColumnParallelLinear::lora_A is replicated across TP ranks but was independently random-initialized on each rank. This relied on every rank having an identical RNG state — a fragile implicit assumption flagged as broken in init.cc. The fix: only TP rank 0 generates random values; all other ranks zero-initialize; then AllReduce(sum) propagates rank 0's values to the group. No-op when TP size == 1.

  2. Fuse base+LoRA matmuls before collective
    The previous forward pass ran the base module and LoRA branch independently, each triggering its own collective (AllGather for Column, AllReduce for Row). Running two separate collectives on floating-point results caused loss divergence due to non-associativity of floating-point addition across the reduction boundary. The fix inlines both matmuls locally, adds them before the collective, then issues a single collective op — matching the approach used in the base parallel linear layers.
    Also includes a LoadLoRAWeights fix to correctly slice sharded tensors (e.g. lora_B in ColumnParallel) when loading checkpoints under TP.

…ence

Inline base and LoRA matmuls, add locally, then issue a single
AllGather/AllReduce instead of two separate collective ops. The prior
two-collective approach caused floating-point divergence in DDP loss.

Also fix LoadLoRAWeights to slice sharded tensors by tp_rank when the
checkpoint shape differs from the partitioned model shape.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant