[Feature] Per Expert Overlap (PEO) #492
Open
+356
−137
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Background
The insight of this work is that we group the experts, allowing the communication of some experts to overlap with the computation of other experts. We call this approach Per Expert Overlap (PEO). Compared to existing methods, our approach has the following advantages:
1. Performance:
Compared to Non-overlap, PEO performs better at all batch sizes.
Compared to PR 390 ([Feat] Single Batch Overlap (SBO): Overlaping of Down GEMM with Combine Send #390), PEO also performs better.
2. Usability
In short, during the dispatch phase, we change the order of communication (by modifying DeepEP) to allow some experts to receive tokens first. During the GEMM phase, we change the order of computation (by modifying how the inference engine calls DeepGEMM) to allow some experts to compute first. In the combine phase, we let some experts send tokens first. Overall, this allows the communication of some experts to overlap with the computation of others.
Design
In the original DeepEP, each communication unit consists of
num_expertsornum_local_expertsexperts. That is, during the dispatch phase, each rank sends tokens tonum_expertsexperts. During the combine phase, each rank sends tokens fromnum_local_expertsexperts tonum_ranksranks.This solution modifies DeepEP by dividing the experts into
num_roundsgroups, and the communication is divided intonum_roundsrounds.num_experts // num_roundsexperts.num_local_experts // num_roundslocal experts tonum_ranksranks.The process is shown as follows:
Due to differences in model parameters, deployment scale, and batch size, this solution allows the following adjustable parameters to achieve the best overlap effect in different scenarios:
Parameters for Overlap:
Overlap method
num_rounds: Number of rounds for splitting dispatch/combine.
deepep_send_num_sms: Number of SMs used for dispatch/combine send.
deepep_recv_num_sms: Number of SMs used for dispatch/combine recv.
up_deepgemm_num_sms: Number of SMs used for UP GEMM.
down_deepgemm_num_sms: Number of SMs used for DOWN GEMM.
Performance
Configuration:
Comparison Methods:
Conclusion:
For both DPSK and QWEN, the overlap performance is the best at almost all batch sizes. For DPSK, PEO achieves a maximum of 31% improvement at batch size 128. For QWEN, PEO achieves a maximum of 50% improvement at batch size 16.

