You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: _posts/2025-07-31-cachegen.md
+6-5Lines changed: 6 additions & 5 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -7,21 +7,21 @@ author: Kuntai Du
7
7
image: /assets/img/cachegen.png
8
8
---
9
9
10
-
**TL;DR:** 🚀 CacheGen lets you store KV caches on disk or AWS S3 and load them *way* faster than recomputing! It compresses your KV cache up to **3× smaller than quantization** while keeping response quality high. Stop wasting compute—get instant first-token times and smooth LLM serving at cloud scale.
10
+
**TL;DR:** 🚀 CacheGen lets you store KV caches on disk or AWS S3 and load them *way* faster than recomputing! It compresses your KV cache up to **3× smaller than quantization**so that you can load your KV cache blazingly fast while keeping response quality high. Stop wasting compute --- use CacheGen to fully utilize your storage and get instant first-token speedup!
<p><em>CacheGen slashes KV cache loading time from disk.</em></p>
14
+
<p><em>CacheGen reduces KV cache loading time from disk.</em></p>
15
15
</div>
16
16
17
17
---
18
18
19
19
## Why CacheGen?
20
20
21
21
Modern LLMs use long contexts, but reprocessing these every time is slow and resource-intensive.
22
-
While engines like vLLM (and LMCache) can cache contexts in GPU and CPU memory, that’s not enough for many chat or agent workloads—**hot contexts quickly outgrow memory**.
22
+
While engines like vLLM (and LMCache) can cache contexts in GPU and CPU memory, that’s not enough for many chat or agent workloads—**hot contexts are a lot and GPU & CPU memory alone are not enough** --- we need to use disk and even S3 to store all KV caches.
23
23
24
-
Storing and loading KV caches from disk or S3 is usually even slower than recomputing them from text!
24
+
However, storing and loading KV caches from disk or S3 is usually even slower than recomputing them from text!
25
25
**CacheGen fixes this**: you can persist KV caches to any storage (S3, disk, etc.) and reload them *much* faster than a fresh prefill. Perfect for keeping valuable context for all your users and agents—without the cold-start penalty.
0 commit comments