[EPLB] Weight rearrangement optimization #28562

ilmarkov · 2025-11-12T15:23:56Z

In this PR we refactor eplb rearrange_expert_weights phase.

Instead of multiple small p2p operations which are done per layer, we take a group of layers and pack all p2p ops. It requires additional send and recv buffers but allows to reduce communication costs.

Benchmarking.

Isolated rearrange_expert_weights microbenchmark.
QwenNext 80B with 128 redundant experts on 4 H100 GPUs.
Total kernel duration for ncclSendRecv
Before: 1.7s
After: 0.16s

As a result more than 10X communication kernel reduction.

Purpose

EPLB weights distribution optimization

Test Plan

PR also refactors the tests allowing nicer output and test failure without hangs.

tests/distributed/test_eplb_execute.py

Test Result

Tests passed.

Signed-off-by: ilmarkov <[email protected]>

ilmarkov added 3 commits November 12, 2025 15:10

Add shuffle_layer_pack and fix tests

6fa3c34

Signed-off-by: ilmarkov <[email protected]>

Return comment

596f685

Signed-off-by: ilmarkov <[email protected]>

Remove logger

0d234f4

Signed-off-by: ilmarkov <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[EPLB] Weight rearrangement optimization #28562

[EPLB] Weight rearrangement optimization #28562

ilmarkov commented Nov 12, 2025 •

edited by github-actions bot

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

[EPLB] Weight rearrangement optimization #28562

Are you sure you want to change the base?

[EPLB] Weight rearrangement optimization #28562

Conversation

ilmarkov commented Nov 12, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Benchmarking.

Purpose

Test Plan

Test Result

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ilmarkov commented Nov 12, 2025 •

edited by github-actions bot

Loading