Skip to content

Commit a11b753

Browse files
committed
edit
Signed-off-by: Kuntai Du <[email protected]>
1 parent 444eb53 commit a11b753

File tree

1 file changed

+6
-5
lines changed

1 file changed

+6
-5
lines changed

_posts/2025-07-31-cachegen.md

Lines changed: 6 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -7,21 +7,21 @@ author: Kuntai Du
77
image: /assets/img/cachegen.png
88
---
99

10-
**TL;DR:** 🚀 CacheGen lets you store KV caches on disk or AWS S3 and load them *way* faster than recomputing! It compresses your KV cache up to **3× smaller than quantization** while keeping response quality high. Stop wasting compute—get instant first-token times and smooth LLM serving at cloud scale.
10+
**TL;DR:** 🚀 CacheGen lets you store KV caches on disk or AWS S3 and load them *way* faster than recomputing! It compresses your KV cache up to **3× smaller than quantization** so that you can load your KV cache blazingly fast while keeping response quality high. Stop wasting compute --- use CacheGen to fully utilize your storage and get instant first-token speedup!
1111

1212
<div align="center">
1313
<img src="/assets/img/cachegen.png" alt="comparison" style="width: 97%; vertical-align:middle;">
14-
<p><em>CacheGen slashes KV cache loading time from disk.</em></p>
14+
<p><em>CacheGen reduces KV cache loading time from disk.</em></p>
1515
</div>
1616

1717
---
1818

1919
## Why CacheGen?
2020

2121
Modern LLMs use long contexts, but reprocessing these every time is slow and resource-intensive.
22-
While engines like vLLM (and LMCache) can cache contexts in GPU and CPU memory, that’s not enough for many chat or agent workloads—**hot contexts quickly outgrow memory**.
22+
While engines like vLLM (and LMCache) can cache contexts in GPU and CPU memory, that’s not enough for many chat or agent workloads—**hot contexts are a lot and GPU & CPU memory alone are not enough** --- we need to use disk and even S3 to store all KV caches.
2323

24-
Storing and loading KV caches from disk or S3 is usually even slower than recomputing them from text!
24+
However, storing and loading KV caches from disk or S3 is usually even slower than recomputing them from text!
2525
**CacheGen fixes this**: you can persist KV caches to any storage (S3, disk, etc.) and reload them *much* faster than a fresh prefill. Perfect for keeping valuable context for all your users and agents—without the cold-start penalty.
2626

2727
---
@@ -58,9 +58,10 @@ lmcache_server localhost 65434
5858

5959
# Start vLLM+LMCache server (using CacheGen)
6060
LMCACHE_CONFIG_FILE=example.yaml CUDA_VISIBLE_DEVICES=2 vllm serve meta-llama/Llama-3.1-8B-Instruct --gpu-memory-utilization 0.8 --port 8020 --kv-transfer-config '{"kv_connector":"LMCacheConnectorV1", "kv_role":"kv_both"}'
61+
```
6162

6263

63-
example.yaml
64+
example.yaml:
6465
```yaml
6566
chunk_size: 2048
6667
local_cpu: False

0 commit comments

Comments
 (0)