Skip to content

Commit e642010

Browse files
authored
sync attention doc and ep doc to doctree (#14257)
Co-authored-by: Brayden Zhong <[email protected]>
1 parent c9e2090 commit e642010

File tree

5 files changed

+10
-9
lines changed

5 files changed

+10
-9
lines changed

docs/advanced_features/attention_backend.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -17,7 +17,7 @@ The support matrix is split into two parts: MHA (standard attention) and MLA (mu
1717
|---------------------------------|-----------------------------|------------------|-----------------|-----------------|--------------------|----------------|
1818
| **FlashInfer** |||||||
1919
| **FA3 (FlashAttention 3)** |||||||
20-
| **FA4 (FlashAttention 4)** | 128 ||||||
20+
| **FA4 (FlashAttention 4)** | ||||||
2121
| **Triton** |||||||
2222
| **Torch Native (SDPA)** |||||||
2323
| **FlexAttention (PyTorch)** |||||||
@@ -33,7 +33,7 @@ The support matrix is split into two parts: MHA (standard attention) and MLA (mu
3333
| **Backend** | **Native Page Sizes** | **FP8 KV Cache** | **Chunked Prefix Cache** | **Spec topk=1** | **Spec topk>1** |
3434
|----------------------------|---------------------------|------------------|--------------------------|-----------------|-----------------|
3535
| **FlashInfer MLA** | 1 |||||
36-
| **FlashMLA** | 64 | ||||
36+
| **FlashMLA** | 64 | ||||
3737
| **Cutlass MLA** | 128 |||||
3838
| **TRTLLM MLA (Blackwell)** | 32 or 64 |||||
3939
| **FA3 (FlashAttention 3)** | n/a |||| ⚠️ (page_size=1 only) |

docs/advanced_features/expert_parallelism.md

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
# Expert Parallelism in SGLang
1+
# Expert Parallelism
22

33
Expert Parallelism (EP) in SGLang distributes expert weights across multiple devices in Mixture-of-Experts (MoE) models, addressing memory bottlenecks and enabling efficient scaling for high-performance inference. It is particularly vital for serving large-scale MoE models where tokens are dynamically routed to specialized experts across GPUs. By leveraging optimized all-to-all communication and grouped matrix multiplications (GEMMs), EP reduces latency, boosts throughput, and minimizes idle GPU time. SGLang's EP offers strong extensibility through its modular framework, allowing seamless integration of custom kernels, backends, and optimizations without refactoring core logic, supporting diverse hardware and quantization schemes.
44

@@ -25,13 +25,13 @@ Currently, DeepEP and Mooncake only support cases where `ep_size = tp_size`. For
2525
| Backend | Description | Use Cases |
2626
|--------------------------|-----------------------------------------------------------------------------|------------------------------------|
2727
| **`auto` (default)** | Automatically selects the optimal backend based on model architecture, hardware (e.g., NVIDIA architecture like Ampere, Hopper, Blackwell), quantization scheme (e.g., FP8, FP4), and runtime conditions. | General-purpose deployments; ensures compatibility and performance without user intervention. |
28-
| `triton` | Triton-based implementation for grouped GEMMs, providing flexible kernel fusion and custom optimizations. | Custom kernel development or scenarios requiring high extensibility with Torch compilation support. |
28+
| `triton` | Triton-based implementation for grouped GEMMs. To achieve higher performance, it's highly recommended to create [tuned configurations](https://github.com/sgl-project/sglang/blob/main/benchmark/kernels/fused_moe_triton/README.md). | Custom kernel development or scenarios requiring high extensibility with Torch compilation support. |
2929
| `deep_gemm` | DeepGEMM backend optimized for MoE matrix multiplications, supporting contiguous layouts for prefill and masked layouts for decode; often JIT-compiled for performance. | Large-scale EP deployments with FP8 block-wise quantization. |
3030
| `cutlass` | CUTLASS-based backend for efficient GEMMs. | NVIDIA architectures with CUTLASS support. |
31-
| `flashinfer_trtllm` | FlashInfer integrated with TensorRT-LLM for accelerated MoE computations, supporting FP4 communication operators and high-performance GEMMs. | NVIDIA architectures with TRT-LLM. |
32-
| `flashinfer_cutlass` | FlashInfer combined with CUTLASS for high-performance grouped GEMMs in MoE layers, handling FP4/FP8 quantization efficiently. | Optimized for Blackwell (e.g., B200) and FP4/FP8 models. |
31+
| `flashinfer_trtllm` | FlashInfer integrated with TensorRT-LLM for accelerated MoE computations, supporting FP4 communication operators and high-performance GEMMs. | SM100+ with TRT-LLM. |
32+
| `flashinfer_cutlass` | FlashInfer combined with CUTLASS for high-performance grouped GEMMs in MoE layers, handling FP4/FP8 quantization efficiently. | SM100+ with FP4/FP8 models. |
3333
| `flashinfer_mxfp4` | FlashInfer variant optimized for MXFP4 (mixed FP4) quantization in MoE runners, focusing on memory-efficient low-precision inference. | Low-precision models with MXFP4. |
34-
| `flashinfer_cutedsl` | FlashInfer with a custom DSL for flexible and efficient MoE kernel generation, integrated with modelopt quantization. | Low-precision models with NVFP4. |
34+
| `flashinfer_cutedsl` | FlashInfer with a custom DSL for flexible and efficient MoE kernel generation, integrated with ModelOpt FP4 quantization. | Low-precision models with NVFP4. |
3535

3636
### Examples
3737

docs/advanced_features/server_arguments.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -272,7 +272,7 @@ Please consult the documentation below and [server_args.py](https://github.com/s
272272
| `--speculative-ngram-branch-length` | The branch length for ngram speculative decoding. | `18` | Type: int |
273273
| `--speculative-ngram-capacity` | The cache capacity for ngram speculative decoding. | `10000000` | Type: int |
274274
275-
## Expert parallelism
275+
## MoE
276276
| Argument | Description | Defaults | Options |
277277
| --- | --- | --- | --- |
278278
| `--expert-parallel-size`<br>`--ep-size`<br>`--ep` | The expert parallelism size. | `1` | Type: int |

docs/basic_usage/popular_model_usage.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,6 @@ Popular Model Usage (DeepSeek, GPT-OSS, Llama, Qwen, and more)
77
deepseek_v3.md
88
deepseek_v32.md
99
gpt_oss.md
10-
llama4.md
1110
qwen3.md
1211
qwen3_vl.md
12+
llama4.md

docs/index.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -41,6 +41,7 @@ Its core features include:
4141
advanced_features/tool_parser.ipynb
4242
advanced_features/separate_reasoning.ipynb
4343
advanced_features/quantization.md
44+
advanced_features/expert_parallelism.md
4445
advanced_features/lora.ipynb
4546
advanced_features/pd_disaggregation.md
4647
advanced_features/hicache.rst

0 commit comments

Comments
 (0)