You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The Dynamo KV Router intelligently routes requests by evaluating their computational costs across different workers. It considers both decoding costs (from active blocks) and prefill costs (from newly computed blocks). Optimizing the KV Router is critical for achieving maximum throughput and minimum latency in distributed inference setups.
11
11
12
-
## KV Router Quick Start
12
+
## Quick Start
13
+
14
+
### Python / CLI Deployment
13
15
14
16
To launch the Dynamo frontend with the KV Router:
15
17
@@ -27,10 +29,53 @@ Backend workers register themselves using the `register_llm` API, after which th
27
29
- Makes routing decisions based on KV cache overlap
28
30
- Balances load across available workers
29
31
30
-
### Important Arguments
32
+
### Kubernetes Deployment
33
+
34
+
To enable the KV Router in a Kubernetes deployment, add the `DYN_ROUTER_MODE` environment variable to your frontend service:
See the comprehensive [KV Router A/B Benchmarking Guide](../benchmarks/kv-router-ab-testing.md) for step-by-step instructions on deploying, configuring, and benchmarking the KV router in Kubernetes.
70
+
71
+
## Configuration Options
72
+
73
+
### CLI Arguments (Python Deployment)
31
74
32
75
The KV Router supports several key configuration options:
-**`--kv-cache-block-size <size>`**: Sets the KV cache block size (default: backend-specific). Larger blocks reduce overlap detection granularity but improve memory efficiency. This should match your backend configuration.
0 commit comments