Skip to content

Commit 4af66ce

Browse files
authored
Merge pull request #38 from kobe0938/gpt-oss
gpt-oss
2 parents e34cc7e + fa37e38 commit 4af66ce

File tree

2 files changed

+134
-0
lines changed

2 files changed

+134
-0
lines changed
Lines changed: 134 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,134 @@
1+
---
2+
layout: post
3+
title: "LMCache supports gpt-oss (20B/120B) on Day 1"
4+
subtitle: "Complete integration guide and performance benchmarks"
5+
date: 2025-08-05 12:00:00 -0400
6+
background: '/assets/img/bgimage.png'
7+
author: "Yihua, Kobe"
8+
---
9+
10+
LMCache now supports OpenAI's newly released GPT-OSS models (20B and 120B parameters) from day one! This post provides a complete guide to setting up vLLM with LMCache for GPT-OSS models and demonstrates significant performance improvements through our CPU offloading capabilities.
11+
12+
![LMCache GPT-OSS Integration](/assets/img/gpt-oss-vllm-lmcache.png)
13+
14+
## Step 1: Installing vLLM GPT OSS Version
15+
16+
### Installation
17+
18+
```bash
19+
uv pip install --pre vllm==0.10.1+gptoss \
20+
--extra-index-url https://wheels.vllm.ai/gpt-oss/ \
21+
--extra-index-url https://download.pytorch.org/whl/nightly/cu128 \
22+
--index-strategy unsafe-best-match
23+
```
24+
25+
### Test the Installation
26+
27+
```bash
28+
vllm serve openai/gpt-oss-120b \
29+
--max-model-len 32768 \
30+
--disable-hybrid-kv-cache-manager
31+
```
32+
33+
```bash
34+
curl http://localhost:9000/v1/chat/completions \
35+
-H "Content-Type: application/json" \
36+
-d '{
37+
"model": "openai/gpt-oss-120b",
38+
"messages": [
39+
{
40+
"role": "user",
41+
"content": "Hello how are you today"
42+
}
43+
],
44+
"temperature": 0.7
45+
}'
46+
```
47+
48+
## Step 2: Install LMCache from Source
49+
50+
### Why Install from Source?
51+
52+
vLLM requires nightly built PyTorch to serve GPT models. To ensure compatibility, we highly recommend installing LMCache based on the PyTorch version in your current virtual environment.
53+
54+
### Installation Process
55+
56+
Install LMCache from source (this command may take a few minutes due to CUDA kernel compilations):
57+
58+
```bash
59+
# In your virtual environment
60+
ENABLE_CXX11_ABI=1 uv pip install -e . --no-build-isolation
61+
```
62+
63+
### Test the Installation
64+
65+
```bash
66+
python3 -c "import torch; import lmcache; import lmcache.c_ops"
67+
```
68+
69+
## Step 3: Run vLLM with LMCache
70+
71+
### LMCache Configuration
72+
73+
Create a configuration file `backend_cpu.yaml` for CPU offloading:
74+
75+
```yaml
76+
# Create a CPU offloading buffer with 80G
77+
chunk_size: 256
78+
local_cpu: True
79+
max_local_cpu_size: 80
80+
```
81+
82+
### Launch vLLM with LMCache
83+
84+
```bash
85+
LMCACHE_CONFIG_FILE="./backend_cpu.yaml" \
86+
LMCACHE_USE_EXPERIMENTAL=True \
87+
CUDA_VISIBLE_DEVICES=6,7 \
88+
vllm serve \
89+
openai/gpt-oss-120b \
90+
--max-model-len 32768 \
91+
--disable-log-requests \
92+
--disable-hybrid-kv-cache-manager \
93+
--kv-transfer-config \
94+
'{"kv_connector":"LMCacheConnectorV1", "kv_role":"kv_both"}'
95+
```
96+
97+
## Step 4: Benchmark Results
98+
99+
### Use Case: Long Document Q&A
100+
101+
- **Input**: 20 different documents with an average length of 20K tokens each
102+
- **Output**: 50 tokens per query
103+
104+
1. **Phase 1**: Send all documents to the serving engines to warm up the KV cache
105+
2. **Phase 2**: Shuffle the queries and send them again, measuring TTFT and finish time
106+
107+
### Performance Results
108+
109+
The benchmark results for Phase 2 show impressive improvements:
110+
111+
| Setup | Average TTFT (secs) | Time to finish all queries (secs) |
112+
|-------|--------------------|------------------------------------|
113+
| Vanilla vLLM | 1.20 | 15.70 |
114+
| vLLM + LMCache | **0.39** | **7.73** |
115+
116+
### Why the Performance Gain?
117+
118+
When using a single A100/H100 to serve GPT-120B, the available KV cache GPU buffer is typically less than 10GB. With LMCache's CPU offloading buffer, vLLM can store and reuse KV cache for many more prefixes, resulting in:
119+
120+
- **67% reduction** in Time to First Token (TTFT)
121+
- **51% reduction** in total query completion time
122+
123+
### Running the Benchmark
124+
125+
You can reproduce these results using our benchmark script:
126+
127+
```bash
128+
python long-doc-qa.py --num-documents 20 \
129+
--document-length 20000 --output-len 50 \
130+
--repeat-count 1 --repeat-mode random \
131+
--shuffle-seed 0
132+
```
133+
134+
The complete benchmark script is available at [https://github.com/LMCache/LMCache/blob/dev/benchmarks/long-doc-qa/long-doc-qa.py](https://github.com/LMCache/LMCache/blob/dev/benchmarks/long-doc-qa/long-doc-qa.py).
143 KB
Loading

0 commit comments

Comments
 (0)