Skip to content

Conversation

@k-ling3
Copy link

@k-ling3 k-ling3 commented Nov 13, 2025

Background

The insight of this work is that we group the experts, allowing the communication of some experts to overlap with the computation of other experts. We call this approach Per Expert Overlap (PEO). Compared to existing methods, our approach has the following advantages:

1. Performance:

  • Compared to Non-overlap, PEO performs better at all batch sizes.

    • For the DPSK model:
      • At batch size 4, PEO achieves an 11% improvement.
      • At batch size 128, PEO achieves a 31% improvement. The larger the batch size, the more significant the gain.
    • For the QWEN model, PEO achieves up to a 51% improvement.
  • Compared to PR 390 ([Feat] Single Batch Overlap (SBO): Overlaping of Down GEMM with Combine Send #390), PEO also performs better.

2. Usability

  • Compared to PR 390, PEO only modifies DeepEP and does not change DeepGEEM, making it easier to use.

In short, during the dispatch phase, we change the order of communication (by modifying DeepEP) to allow some experts to receive tokens first. During the GEMM phase, we change the order of computation (by modifying how the inference engine calls DeepGEMM) to allow some experts to compute first. In the combine phase, we let some experts send tokens first. Overall, this allows the communication of some experts to overlap with the computation of others.


Design

In the original DeepEP, each communication unit consists of num_experts or num_local_experts experts. That is, during the dispatch phase, each rank sends tokens to num_experts experts. During the combine phase, each rank sends tokens from num_local_experts experts to num_ranks ranks.

This solution modifies DeepEP by dividing the experts into num_rounds groups, and the communication is divided into num_rounds rounds.

  • During the dispatch phase, in each round, each rank sends tokens to num_experts // num_rounds experts.
  • During the combine phase, in each round, each rank sends tokens from num_local_experts // num_rounds local experts to num_ranks ranks.

The process is shown as follows:

Due to differences in model parameters, deployment scale, and batch size, this solution allows the following adjustable parameters to achieve the best overlap effect in different scenarios:

Parameters for Overlap:

  • Overlap method

    • We tested different overlap methods and found they have different effects. Consider the following options:
      • overlap-1: After all dispatch sends are completed, then perform dispatch recv and GEMM.
        overlap-1 diagram
      • overlap-2: Immediately after each dispatch send, perform recv + gemm.
        overlap-2 diagram
      • overlap-3: Immediately after each dispatch send, perform recv + gemm, and allow DeepEP's send and recv to overlap.
        overlap-2 diagram
      • overlap-4: No overlap between dispatch and GEMM.
        overlap-2 diagram
  • num_rounds: Number of rounds for splitting dispatch/combine.

  • deepep_send_num_sms: Number of SMs used for dispatch/combine send.

  • deepep_recv_num_sms: Number of SMs used for dispatch/combine recv.

  • up_deepgemm_num_sms: Number of SMs used for UP GEMM.

  • down_deepgemm_num_sms: Number of SMs used for DOWN GEMM.


Performance

Configuration:

  • H20, EP16, QWEN, DPSK

Comparison Methods:

Conclusion:

For both DPSK and QWEN, the overlap performance is the best at almost all batch sizes. For DPSK, PEO achieves a maximum of 31% improvement at batch size 128. For QWEN, PEO achieves a maximum of 50% improvement at batch size 16.

@rubbberrabbit
Copy link

Hi, to achieve Parameters for Overlap, should we modify in sglang forward and launch multi time gemm( to caculate different expert group ) ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants