edit

KuntaiDu · KuntaiDu · commit a11b753030c4 · 2025-08-04T09:01:50.000-07:00
Signed-off-by: Kuntai Du &lt;kuntai@uchicago.edu&gt;
diff --git a/_posts/2025-07-31-cachegen.md b/_posts/2025-07-31-cachegen.md
@@ -7,21 +7,21 @@ author: Kuntai Du
 image: /assets/img/cachegen.png
 ---
 
-**TL;DR:** 🚀 CacheGen lets you store KV caches on disk or AWS S3 and load them *way* faster than recomputing! It compresses your KV cache up to **3× smaller than quantization** while keeping response quality high. Stop wasting compute—get instant first-token times and smooth LLM serving at cloud scale.
+**TL;DR:** 🚀 CacheGen lets you store KV caches on disk or AWS S3 and load them *way* faster than recomputing! It compresses your KV cache up to **3× smaller than quantization** so that you can load your KV cache blazingly fast while keeping response quality high. Stop wasting compute --- use CacheGen to fully utilize your storage and get instant first-token speedup!
 
 <div align="center">
 <img src="/assets/img/cachegen.png" alt="comparison" style="width: 97%; vertical-align:middle;">
-<p><em>CacheGen slashes KV cache loading time from disk.</em></p>
+<p><em>CacheGen reduces KV cache loading time from disk.</em></p>
 </div>
 
 ---
 
 ## Why CacheGen?
 
 Modern LLMs use long contexts, but reprocessing these every time is slow and resource-intensive.  
-While engines like vLLM (and LMCache) can cache contexts in GPU and CPU memory, that’s not enough for many chat or agent workloads—**hot contexts quickly outgrow memory**.
+While engines like vLLM (and LMCache) can cache contexts in GPU and CPU memory, that’s not enough for many chat or agent workloads—**hot contexts are a lot and GPU & CPU memory alone are not enough** --- we need to use disk and even S3 to store all KV caches.
 
-Storing and loading KV caches from disk or S3 is usually even slower than recomputing them from text!  
+However, storing and loading KV caches from disk or S3 is usually even slower than recomputing them from text!  
 **CacheGen fixes this**: you can persist KV caches to any storage (S3, disk, etc.) and reload them *much* faster than a fresh prefill. Perfect for keeping valuable context for all your users and agents—without the cold-start penalty.
 
 ---
@@ -58,9 +58,10 @@ lmcache_server localhost 65434
 
 # Start vLLM+LMCache server (using CacheGen)
 LMCACHE_CONFIG_FILE=example.yaml CUDA_VISIBLE_DEVICES=2 vllm serve meta-llama/Llama-3.1-8B-Instruct --gpu-memory-utilization 0.8 --port 8020 --kv-transfer-config '{"kv_connector":"LMCacheConnectorV1", "kv_role":"kv_both"}'
+```
 
 
-example.yaml
+example.yaml:
 ```yaml
 chunk_size: 2048
 local_cpu: False