Skip to content

Commit f938d03

Browse files
saturley-hallBenHammathreesh
authored
docs: add Kubernetes deployment guidance to KV router docs (#3828) (#3838)
Signed-off-by: Ben Hamm <[email protected]> Signed-off-by: Harrison King Saturley-Hall <[email protected]> Co-authored-by: Ben Hamm <[email protected]> Co-authored-by: Anish <[email protected]>
1 parent a8a6ce0 commit f938d03

File tree

1 file changed

+108
-2
lines changed

1 file changed

+108
-2
lines changed

docs/router/README.md

Lines changed: 108 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,9 @@ SPDX-License-Identifier: Apache-2.0
99

1010
The Dynamo KV Router intelligently routes requests by evaluating their computational costs across different workers. It considers both decoding costs (from active blocks) and prefill costs (from newly computed blocks). Optimizing the KV Router is critical for achieving maximum throughput and minimum latency in distributed inference setups.
1111

12-
## KV Router Quick Start
12+
## Quick Start
13+
14+
### Python / CLI Deployment
1315

1416
To launch the Dynamo frontend with the KV Router:
1517

@@ -27,10 +29,53 @@ Backend workers register themselves using the `register_llm` API, after which th
2729
- Makes routing decisions based on KV cache overlap
2830
- Balances load across available workers
2931

30-
### Important Arguments
32+
### Kubernetes Deployment
33+
34+
To enable the KV Router in a Kubernetes deployment, add the `DYN_ROUTER_MODE` environment variable to your frontend service:
35+
36+
```yaml
37+
apiVersion: nvidia.com/v1alpha1
38+
kind: DynamoGraphDeployment
39+
metadata:
40+
name: my-deployment
41+
spec:
42+
services:
43+
Frontend:
44+
dynamoNamespace: my-namespace
45+
componentType: frontend
46+
replicas: 1
47+
envs:
48+
- name: DYN_ROUTER_MODE
49+
value: kv # Enable KV Smart Router
50+
extraPodSpec:
51+
mainContainer:
52+
image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.6.0
53+
Worker:
54+
# ... worker configuration ...
55+
```
56+
57+
**Key Points:**
58+
- Set `DYN_ROUTER_MODE=kv` on the **Frontend** service only
59+
- Workers automatically report KV cache events to the router
60+
- No worker-side configuration changes needed
61+
62+
**Complete K8s Examples:**
63+
- [TRT-LLM aggregated router example](../../components/backends/trtllm/deploy/agg_router.yaml)
64+
- [vLLM aggregated router example](../../components/backends/vllm/deploy/agg_router.yaml)
65+
- [SGLang aggregated router example](../../components/backends/sglang/deploy/agg_router.yaml)
66+
- [Distributed inference tutorial](../../examples/basics/kubernetes/Distributed_Inference/agg_router.yaml)
67+
68+
**For A/B Testing and Advanced K8s Setup:**
69+
See the comprehensive [KV Router A/B Benchmarking Guide](../benchmarks/kv-router-ab-testing.md) for step-by-step instructions on deploying, configuring, and benchmarking the KV router in Kubernetes.
70+
71+
## Configuration Options
72+
73+
### CLI Arguments (Python Deployment)
3174

3275
The KV Router supports several key configuration options:
3376

77+
- **`--router-mode kv`**: Enable KV cache-aware routing (required)
78+
3479
- **`--kv-cache-block-size <size>`**: Sets the KV cache block size (default: backend-specific). Larger blocks reduce overlap detection granularity but improve memory efficiency. This should match your backend configuration.
3580

3681
- **`--router-temperature <float>`**: Controls routing randomness (default: 0.0)
@@ -42,11 +87,72 @@ The KV Router supports several key configuration options:
4287
- `--kv-events`: Uses real-time events from workers for accurate cache tracking
4388
- `--no-kv-events`: Uses approximation based on routing decisions (lower overhead, less accurate)
4489

90+
- **`--kv-overlap-score-weight <float>`**: Balance between prefill and decode optimization (default: 1.0)
91+
- Higher values (> 1.0): Prioritize reducing prefill cost (better TTFT)
92+
- Lower values (< 1.0): Prioritize decode performance (better ITL)
93+
4594
For a complete list of available options:
4695
```bash
4796
python -m dynamo.frontend --help
4897
```
4998

99+
### Kubernetes Environment Variables
100+
101+
All CLI arguments can be configured via environment variables in Kubernetes deployments. Use the `DYN_` prefix with uppercase parameter names:
102+
103+
| CLI Argument | K8s Environment Variable | Default | Description |
104+
|--------------|-------------------------|---------|-------------|
105+
| `--router-mode kv` | `DYN_ROUTER_MODE=kv` | `round_robin` | Enable KV router |
106+
| `--router-temperature <float>` | `DYN_ROUTER_TEMPERATURE=<float>` | `0.0` | Routing randomness |
107+
| `--kv-cache-block-size <size>` | `DYN_KV_CACHE_BLOCK_SIZE=<size>` | Backend-specific | KV cache block size |
108+
| `--no-kv-events` | `DYN_KV_EVENTS=false` | `true` | Disable KV event tracking |
109+
| `--kv-overlap-score-weight <float>` | `DYN_KV_OVERLAP_SCORE_WEIGHT=<float>` | `1.0` | Prefill vs decode weight |
110+
| `--http-port <port>` | `DYN_HTTP_PORT=<port>` | `8000` | HTTP server port |
111+
112+
### Example with Advanced Configuration
113+
114+
```yaml
115+
apiVersion: nvidia.com/v1alpha1
116+
kind: DynamoGraphDeployment
117+
metadata:
118+
name: my-deployment
119+
spec:
120+
services:
121+
Frontend:
122+
dynamoNamespace: my-namespace
123+
componentType: frontend
124+
replicas: 1
125+
envs:
126+
- name: DYN_ROUTER_MODE
127+
value: kv
128+
- name: DYN_ROUTER_TEMPERATURE
129+
value: "0.5" # Add some randomness to prevent worker saturation
130+
- name: DYN_KV_OVERLAP_SCORE_WEIGHT
131+
value: "1.5" # Prioritize TTFT over ITL
132+
- name: DYN_KV_CACHE_BLOCK_SIZE
133+
value: "16"
134+
extraPodSpec:
135+
mainContainer:
136+
image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.6.0
137+
```
138+
139+
### Alternative: Using Command Args in K8s
140+
141+
You can also pass CLI arguments directly in the container command:
142+
143+
```yaml
144+
extraPodSpec:
145+
mainContainer:
146+
image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.6.0
147+
command:
148+
- /bin/sh
149+
- -c
150+
args:
151+
- "python3 -m dynamo.frontend --router-mode kv --router-temperature 0.5 --http-port 8000"
152+
```
153+
154+
**Recommendation:** Use environment variables for easier configuration management and consistency with Dynamo's K8s patterns.
155+
50156
## KV Router Architecture
51157
52158
The KV Router tracks two key metrics for each worker:

0 commit comments

Comments
 (0)