Skip to content

Commit a21ab8c

Browse files
committed
init
1 parent d18baf6 commit a21ab8c

File tree

1 file changed

+176
-0
lines changed

1 file changed

+176
-0
lines changed

docs/evaluations/evals.md

Lines changed: 176 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,176 @@
1+
# Reproduce
2+
3+
## Environment
4+
5+
TPU v6e-4.
6+
7+
## Commands
8+
9+
## deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B
10+
11+
### Commands
12+
13+
**launch server**
14+
15+
```bash
16+
# main: 8315531c7bb852b37934611deee051e22726a0ce
17+
JAX_COMPILATION_CACHE_DIR=/tmp/jit_cache python3 -u -m sgl_jax.launch_server \
18+
--model-path deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B \
19+
--trust-remote-code \
20+
--tp-size=4 \
21+
--device=tpu \
22+
--mem-fraction-static=0.8 \
23+
--chunked-prefill-size=2048 \
24+
--download-dir=/tmp \
25+
--dtype=bfloat16 \
26+
--max-running-requests 256 \
27+
--skip-server-warmup \
28+
--page-size=128 \
29+
--disable-radix-cache
30+
```
31+
32+
**eval**
33+
```bash
34+
# MATH-500 Pass@1
35+
# Note:
36+
# 1. Sampling parameters refer to https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B.
37+
# 2. For all our models, the maximum generation length is set to 32,768 tokens. For benchmarks requiring sampling, we use a temperature of $0.6$, a top-p value of $0.95$, and generate 64 responses per query to estimate pass@1.
38+
# 3. It costs about 1 hour.
39+
40+
evalscope eval \
41+
--model deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B \
42+
--api-url http://127.0.0.1:30000/v1/chat/completions \
43+
--api-key EMPTY \
44+
--eval-type service \
45+
--datasets math_500 \
46+
--eval-batch-size 64 \
47+
--dataset-args '{"math_500":{"metric_list":["Pass@1"], "few_shot_num": 4}}' \
48+
--generation-config '{"max_tokens": 32768, "temperature": 0.6, "top_p": 0.95, "n":64}' \
49+
--timeout 120000
50+
51+
+-------------------------------+-----------+----------+----------+-------+---------+---------+
52+
| Model | Dataset | Metric | Subset | Num | Score | Cat.0 |
53+
+===============================+===========+==========+==========+=======+=========+=========+
54+
| DeepSeek-R1-Distill-Qwen-1.5B | math_500 | Pass@1 | Level 1 | 43 | 0.8837 | default |
55+
+-------------------------------+-----------+----------+----------+-------+---------+---------+
56+
| DeepSeek-R1-Distill-Qwen-1.5B | math_500 | Pass@1 | Level 2 | 90 | 0.8222 | default |
57+
+-------------------------------+-----------+----------+----------+-------+---------+---------+
58+
| DeepSeek-R1-Distill-Qwen-1.5B | math_500 | Pass@1 | Level 3 | 105 | 0.7619 | default |
59+
+-------------------------------+-----------+----------+----------+-------+---------+---------+
60+
| DeepSeek-R1-Distill-Qwen-1.5B | math_500 | Pass@1 | Level 4 | 128 | 0.6328 | default |
61+
+-------------------------------+-----------+----------+----------+-------+---------+---------+
62+
| DeepSeek-R1-Distill-Qwen-1.5B | math_500 | Pass@1 | Level 5 | 134 | 0.4328 | default |
63+
+-------------------------------+-----------+----------+----------+-------+---------+---------+
64+
| DeepSeek-R1-Distill-Qwen-1.5B | math_500 | Pass@1 | OVERALL | 500 | 0.662 | - |
65+
+-------------------------------+-----------+----------+----------+-------+---------+---------+
66+
```
67+
68+
```bash
69+
# main: 8315531c7bb852b37934611deee051e22726a0ce
70+
# 3. It costs about 1.5 hour.
71+
JAX_COMPILATION_CACHE_DIR=/tmp/jit_cache python3 -u -m sgl_jax.launch_server \
72+
--model-path deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B \
73+
--trust-remote-code \
74+
--tp-size=4 \
75+
--device=tpu \
76+
--mem-fraction-static=0.8 \
77+
--chunked-prefill-size=2048 \
78+
--download-dir=/tmp \
79+
--dtype=bfloat16 \
80+
--max-running-requests 256 \
81+
--skip-server-warmup \
82+
--page-size=128 \
83+
--disable-radix-cache \
84+
--precompile-token-paddings 2048 \
85+
--precompile-bs-paddings 256
86+
87+
evalscope eval \
88+
--model deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B \
89+
--api-url http://127.0.0.1:30000/v1/chat/completions \
90+
--api-key EMPTY \
91+
--eval-type service \
92+
--datasets math_500 \
93+
--eval-batch-size 64 \
94+
--dataset-args '{"math_500":{"metric_list":["Pass@1"], "few_shot_num": 4}}' \
95+
--generation-config '{"max_tokens": 32768, "temperature": 0.6, "top_p": 0.95, "n":64}' \
96+
--timeout 120000
97+
98+
+-------------------------------+-----------+----------+----------+-------+---------+---------+
99+
| Model | Dataset | Metric | Subset | Num | Score | Cat.0 |
100+
+===============================+===========+==========+==========+=======+=========+=========+
101+
| DeepSeek-R1-Distill-Qwen-1.5B | math_500 | Pass@1 | Level 1 | 43 | 0.814 | default |
102+
+-------------------------------+-----------+----------+----------+-------+---------+---------+
103+
| DeepSeek-R1-Distill-Qwen-1.5B | math_500 | Pass@1 | Level 2 | 90 | 0.8111 | default |
104+
+-------------------------------+-----------+----------+----------+-------+---------+---------+
105+
| DeepSeek-R1-Distill-Qwen-1.5B | math_500 | Pass@1 | Level 3 | 105 | 0.8476 | default |
106+
+-------------------------------+-----------+----------+----------+-------+---------+---------+
107+
| DeepSeek-R1-Distill-Qwen-1.5B | math_500 | Pass@1 | Level 4 | 128 | 0.7266 | default |
108+
+-------------------------------+-----------+----------+----------+-------+---------+---------+
109+
| DeepSeek-R1-Distill-Qwen-1.5B | math_500 | Pass@1 | Level 5 | 134 | 0.4328 | default |
110+
+-------------------------------+-----------+----------+----------+-------+---------+---------+
111+
| DeepSeek-R1-Distill-Qwen-1.5B | math_500 | Pass@1 | OVERALL | 500 | 0.696 | - |
112+
+-------------------------------+-----------+----------+----------+-------+---------+---------+
113+
114+
```
115+
116+
```bash
117+
# main: 4fa24afba321579fb17cc813883f8ea9614b4c36
118+
JAX_COMPILATION_CACHE_DIR=/tmp/jit_cache python3 -u -m sgl_jax.launch_server \
119+
--model-path deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B \
120+
--trust-remote-code \
121+
--tp-size=4 \
122+
--device=tpu \
123+
--mem-fraction-static=0.8 \
124+
--chunked-prefill-size=2048 \
125+
--download-dir=/tmp \
126+
--dtype=bfloat16 \
127+
--max-running-requests 256 \
128+
--skip-server-warmup \
129+
--page-size=128 \
130+
--disable-radix-cache \
131+
--precompile-token-paddings 2048 \
132+
--precompile-bs-paddings 256
133+
134+
evalscope eval \
135+
--model deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B \
136+
--api-url http://127.0.0.1:30000/v1/chat/completions \
137+
--api-key EMPTY \
138+
--eval-type service \
139+
--datasets math_500 \
140+
--eval-batch-size 64 \
141+
--dataset-args '{"math_500":{"metric_list":["Pass@1"], "few_shot_num": 4}}' \
142+
--generation-config '{"max_tokens": 32768, "temperature": 0.6, "top_p": 0.95, "n":64}' \
143+
--timeout 120000
144+
145+
+-------------------------------+-----------+----------+----------+-------+---------+---------+
146+
| Model | Dataset | Metric | Subset | Num | Score | Cat.0 |
147+
+===============================+===========+==========+==========+=======+=========+=========+
148+
| DeepSeek-R1-Distill-Qwen-1.5B | math_500 | Pass@1 | Level 1 | 43 | 0.9302 | default |
149+
+-------------------------------+-----------+----------+----------+-------+---------+---------+
150+
| DeepSeek-R1-Distill-Qwen-1.5B | math_500 | Pass@1 | Level 2 | 90 | 0.9222 | default |
151+
+-------------------------------+-----------+----------+----------+-------+---------+---------+
152+
| DeepSeek-R1-Distill-Qwen-1.5B | math_500 | Pass@1 | Level 3 | 105 | 0.9143 | default |
153+
+-------------------------------+-----------+----------+----------+-------+---------+---------+
154+
| DeepSeek-R1-Distill-Qwen-1.5B | math_500 | Pass@1 | Level 4 | 128 | 0.8047 | default |
155+
+-------------------------------+-----------+----------+----------+-------+---------+---------+
156+
| DeepSeek-R1-Distill-Qwen-1.5B | math_500 | Pass@1 | Level 5 | 134 | 0.7164 | default |
157+
+-------------------------------+-----------+----------+----------+-------+---------+---------+
158+
| DeepSeek-R1-Distill-Qwen-1.5B | math_500 | Pass@1 | OVERALL | 500 | 0.836 | - |
159+
+-------------------------------+-----------+----------+----------+-------+---------+---------+
160+
161+
```
162+
163+
164+
165+
166+
167+
168+
169+
# Temp
170+
171+
```bash
172+
JAX_COMPILATION_CACHE_DIR=/tmp/jit_cache python3 -u -m sgl_jax.launch_server --model-path deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B --trust-remote-code --tp-size=4 --device=tpu --mem-fraction-static=0.8 --chunked-prefill-size=2048 --download-dir=/tmp --dtype=bfloat16 --max-running-requests 256 --skip-server-warmup --page-size=128 --disable-radix-cache --precompile-token-paddings 2048 --precompile-bs-paddings 256
173+
174+
175+
evalscope eval --model deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B --api-url http://127.0.0.1:30000/v1/chat/completions --api-key EMPTY --eval-type service --datasets math_500 --eval-batch-size 64 --dataset-args '{"math_500":{"metric_list":["Pass@1"], "few_shot_num": 4}}' --generation-config '{"max_tokens": 32768, "temperature": 0.6, "top_p": 0.95, "n":64}' --timeout 120000 --limit 15
176+
```

0 commit comments

Comments
 (0)