Skip to content

Conversation

@ilmarkov
Copy link
Contributor

@ilmarkov ilmarkov commented Nov 12, 2025

In this PR we refactor eplb rearrange_expert_weights phase.

Instead of multiple small p2p operations which are done per layer, we take a group of layers and pack all p2p ops. It requires additional send and recv buffers but allows to reduce communication costs.

Benchmarking.

Isolated rearrange_expert_weights microbenchmark.
QwenNext 80B with 128 redundant experts on 4 H100 GPUs.
Total kernel duration for ncclSendRecv
Before: 1.7s
After: 0.16s

As a result more than 10X communication kernel reduction.

Purpose

EPLB weights distribution optimization

Test Plan

PR also refactors the tests allowing nicer output and test failure without hangs.

tests/distributed/test_eplb_execute.py

Test Result

Tests passed.

Signed-off-by: ilmarkov <[email protected]>
Signed-off-by: ilmarkov <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant