You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/advanced_features/expert_parallelism.md
+5-5Lines changed: 5 additions & 5 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,4 +1,4 @@
1
-
# Expert Parallelism in SGLang
1
+
# Expert Parallelism
2
2
3
3
Expert Parallelism (EP) in SGLang distributes expert weights across multiple devices in Mixture-of-Experts (MoE) models, addressing memory bottlenecks and enabling efficient scaling for high-performance inference. It is particularly vital for serving large-scale MoE models where tokens are dynamically routed to specialized experts across GPUs. By leveraging optimized all-to-all communication and grouped matrix multiplications (GEMMs), EP reduces latency, boosts throughput, and minimizes idle GPU time. SGLang's EP offers strong extensibility through its modular framework, allowing seamless integration of custom kernels, backends, and optimizations without refactoring core logic, supporting diverse hardware and quantization schemes.
4
4
@@ -25,13 +25,13 @@ Currently, DeepEP and Mooncake only support cases where `ep_size = tp_size`. For
|**`auto` (default)**| Automatically selects the optimal backend based on model architecture, hardware (e.g., NVIDIA architecture like Ampere, Hopper, Blackwell), quantization scheme (e.g., FP8, FP4), and runtime conditions. | General-purpose deployments; ensures compatibility and performance without user intervention. |
28
-
|`triton`| Triton-based implementation for grouped GEMMs, providing flexible kernel fusion and custom optimizations. | Custom kernel development or scenarios requiring high extensibility with Torch compilation support. |
28
+
|`triton`| Triton-based implementation for grouped GEMMs. To achieve higher performance, it's highly recommended to create [tuned configurations](https://github.com/sgl-project/sglang/blob/main/benchmark/kernels/fused_moe_triton/README.md). | Custom kernel development or scenarios requiring high extensibility with Torch compilation support. |
29
29
|`deep_gemm`| DeepGEMM backend optimized for MoE matrix multiplications, supporting contiguous layouts for prefill and masked layouts for decode; often JIT-compiled for performance. | Large-scale EP deployments with FP8 block-wise quantization. |
30
30
|`cutlass`| CUTLASS-based backend for efficient GEMMs. | NVIDIA architectures with CUTLASS support. |
31
-
|`flashinfer_trtllm`| FlashInfer integrated with TensorRT-LLM for accelerated MoE computations, supporting FP4 communication operators and high-performance GEMMs. |NVIDIA architectures with TRT-LLM. |
32
-
|`flashinfer_cutlass`| FlashInfer combined with CUTLASS for high-performance grouped GEMMs in MoE layers, handling FP4/FP8 quantization efficiently. |Optimized for Blackwell (e.g., B200) and FP4/FP8 models. |
31
+
|`flashinfer_trtllm`| FlashInfer integrated with TensorRT-LLM for accelerated MoE computations, supporting FP4 communication operators and high-performance GEMMs. |SM100+ with TRT-LLM. |
32
+
|`flashinfer_cutlass`| FlashInfer combined with CUTLASS for high-performance grouped GEMMs in MoE layers, handling FP4/FP8 quantization efficiently. |SM100+ with FP4/FP8 models. |
33
33
|`flashinfer_mxfp4`| FlashInfer variant optimized for MXFP4 (mixed FP4) quantization in MoE runners, focusing on memory-efficient low-precision inference. | Low-precision models with MXFP4. |
34
-
|`flashinfer_cutedsl`| FlashInfer with a custom DSL for flexible and efficient MoE kernel generation, integrated with modelopt quantization. | Low-precision models with NVFP4. |
34
+
|`flashinfer_cutedsl`| FlashInfer with a custom DSL for flexible and efficient MoE kernel generation, integrated with ModelOpt FP4 quantization. | Low-precision models with NVFP4. |
0 commit comments