Skip to content

Commit aea2127

Browse files
committed
add math-500 reproduce instructions and result under bfloat16 and float32
1 parent 3e3de0c commit aea2127

File tree

2 files changed

+257
-7
lines changed

2 files changed

+257
-7
lines changed

README.md

Lines changed: 8 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -47,13 +47,14 @@ For more features and usage details, please read the documents in the [`docs`](h
4747

4848
SGL-JAX is designed for easy extension to new model architectures. It currently provides first-class, optimized support for:
4949

50-
- **Qwen**
51-
- **Qwen 2**
52-
- **Qwen 2 MOE**
53-
- **Qwen 3**
54-
- **Qwen 3 MoE**
55-
- **Llama**
56-
- **Bailing MoE**
50+
- **Qwen**: Performance needs to improve.
51+
- **Qwen 2**: Performance needs to improve.
52+
- **Qwen 2 MOE**: Performance needs to improve.
53+
- **Qwen 3**: Currently these series have achieved our best performance.
54+
- **Qwen 3 MoE**: Apart from models like Qwen-coder3-480B with large parameters, these series have achieved our best performance.
55+
- **Llama**: Performance needs to improve.
56+
- **Bailing MoE**: Performance needs to improve.
57+
5758

5859
## Performance and Benchmarking
5960

docs/evaluations/evaluations.md

Lines changed: 249 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,249 @@
1+
# deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B
2+
3+
## Math-500
4+
5+
### Introduction
6+
7+
Environment: TPU v6e-4.
8+
Version: main-51e4987a7942ac936bc0e58d77b78174b71eefa5
9+
10+
### Instructions
11+
12+
```bash
13+
# launch server, precision = bfloat16
14+
JAX_COMPILATION_CACHE_DIR=/tmp/jit_cache python3 -u -m sgl_jax.launch_server \
15+
--model-path deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B \
16+
--trust-remote-code \
17+
--tp-size=4 \
18+
--device=tpu \
19+
--mem-fraction-static=0.8 \
20+
--chunked-prefill-size=2048 \
21+
--download-dir=/tmp \
22+
--dtype=bfloat16 \
23+
--max-running-requests 256 \
24+
--skip-server-warmup \
25+
--page-size=128 \
26+
--disable-radix-cache \
27+
--use-sort-for-toppk-minp
28+
29+
# launch server, precision = float32
30+
JAX_COMPILATION_CACHE_DIR=/tmp/jit_cache python3 -u -m sgl_jax.launch_server \
31+
--model-path deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B \
32+
--trust-remote-code \
33+
--tp-size=4 \
34+
--device=tpu \
35+
--mem-fraction-static=0.8 \
36+
--chunked-prefill-size=2048 \
37+
--download-dir=/tmp \
38+
--dtype=float32 \
39+
--max-running-requests 256 \
40+
--skip-server-warmup \
41+
--page-size=128 \
42+
--disable-radix-cache \
43+
--use-sort-for-toppk-minp
44+
45+
# eval: evalscope = 0.17.1
46+
## Sampling parameters refer to https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B.
47+
## Eval generation config refers to https://evalscope.readthedocs.io/zh-cn/latest/best_practice/eval_qwq.html#id5.
48+
## Note: n in generation-config does not take effect due to https://github.com/sgl-project/sglang-jax/issues/296. So please get mean grade manually.
49+
evalscope eval --model deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B \
50+
--api-url http://127.0.0.1:30000/v1/chat/completions \
51+
--api-key EMPTY \
52+
--eval-type service \
53+
--datasets math_500 \
54+
--eval-batch-size 64 \
55+
--dataset-args '{"math_500":{"metric_list":["Pass@1"]}}' --generation-config '{"max_tokens": 32768, "temperature": 0.6, "top_p": 0.95}' \
56+
--timeout 120000
57+
```
58+
59+
### Evaluation Results
60+
61+
- bloat16: 0.818, 0.82, 0.826.
62+
- float32: 0.81, 0.796, 0.808.
63+
64+
#### Details
65+
66+
Note:
67+
- Every test under bfloat16 costs about 35 minutes.
68+
- Every test under float32 costs about 40 minutes.
69+
70+
```bash
71+
###############################################################################################
72+
#################################### Precision: bfloat16 ######################################
73+
###############################################################################################
74+
#################################### First time result ######################################
75+
+-------------------------------+-----------+----------+----------+-------+---------+---------+
76+
| Model | Dataset | Metric | Subset | Num | Score | Cat.0 |
77+
+===============================+===========+==========+==========+=======+=========+=========+
78+
| DeepSeek-R1-Distill-Qwen-1.5B | math_500 | Pass@1 | Level 1 | 43 | 0.907 | default |
79+
+-------------------------------+-----------+----------+----------+-------+---------+---------+
80+
| DeepSeek-R1-Distill-Qwen-1.5B | math_500 | Pass@1 | Level 2 | 90 | 0.9889 | default |
81+
+-------------------------------+-----------+----------+----------+-------+---------+---------+
82+
| DeepSeek-R1-Distill-Qwen-1.5B | math_500 | Pass@1 | Level 3 | 105 | 0.8857 | default |
83+
+-------------------------------+-----------+----------+----------+-------+---------+---------+
84+
| DeepSeek-R1-Distill-Qwen-1.5B | math_500 | Pass@1 | Level 4 | 128 | 0.8125 | default |
85+
+-------------------------------+-----------+----------+----------+-------+---------+---------+
86+
| DeepSeek-R1-Distill-Qwen-1.5B | math_500 | Pass@1 | Level 5 | 134 | 0.6269 | default |
87+
+-------------------------------+-----------+----------+----------+-------+---------+---------+
88+
| DeepSeek-R1-Distill-Qwen-1.5B | math_500 | Pass@1 | OVERALL | 500 | 0.818 | - |
89+
+-------------------------------+-----------+----------+----------+-------+---------+---------+
90+
#################################### Second time result #####################################
91+
+-------------------------------+-----------+----------+----------+-------+---------+---------+
92+
| Model | Dataset | Metric | Subset | Num | Score | Cat.0 |
93+
+===============================+===========+==========+==========+=======+=========+=========+
94+
| DeepSeek-R1-Distill-Qwen-1.5B | math_500 | Pass@1 | Level 1 | 43 | 0.9302 | default |
95+
+-------------------------------+-----------+----------+----------+-------+---------+---------+
96+
| DeepSeek-R1-Distill-Qwen-1.5B | math_500 | Pass@1 | Level 2 | 90 | 0.9444 | default |
97+
+-------------------------------+-----------+----------+----------+-------+---------+---------+
98+
| DeepSeek-R1-Distill-Qwen-1.5B | math_500 | Pass@1 | Level 3 | 105 | 0.9048 | default |
99+
+-------------------------------+-----------+----------+----------+-------+---------+---------+
100+
| DeepSeek-R1-Distill-Qwen-1.5B | math_500 | Pass@1 | Level 4 | 128 | 0.8125 | default |
101+
+-------------------------------+-----------+----------+----------+-------+---------+---------+
102+
| DeepSeek-R1-Distill-Qwen-1.5B | math_500 | Pass@1 | Level 5 | 134 | 0.6418 | default |
103+
+-------------------------------+-----------+----------+----------+-------+---------+---------+
104+
| DeepSeek-R1-Distill-Qwen-1.5B | math_500 | Pass@1 | OVERALL | 500 | 0.82 | - |
105+
+-------------------------------+-----------+----------+----------+-------+---------+---------+
106+
#################################### Third time result ######################################
107+
+-------------------------------+-----------+----------+----------+-------+---------+---------+
108+
| Model | Dataset | Metric | Subset | Num | Score | Cat.0 |
109+
+===============================+===========+==========+==========+=======+=========+=========+
110+
| DeepSeek-R1-Distill-Qwen-1.5B | math_500 | Pass@1 | Level 1 | 43 | 0.9535 | default |
111+
+-------------------------------+-----------+----------+----------+-------+---------+---------+
112+
| DeepSeek-R1-Distill-Qwen-1.5B | math_500 | Pass@1 | Level 2 | 90 | 0.9444 | default |
113+
+-------------------------------+-----------+----------+----------+-------+---------+---------+
114+
| DeepSeek-R1-Distill-Qwen-1.5B | math_500 | Pass@1 | Level 3 | 105 | 0.9048 | default |
115+
+-------------------------------+-----------+----------+----------+-------+---------+---------+
116+
| DeepSeek-R1-Distill-Qwen-1.5B | math_500 | Pass@1 | Level 4 | 128 | 0.7969 | default |
117+
+-------------------------------+-----------+----------+----------+-------+---------+---------+
118+
| DeepSeek-R1-Distill-Qwen-1.5B | math_500 | Pass@1 | Level 5 | 134 | 0.6716 | default |
119+
+-------------------------------+-----------+----------+----------+-------+---------+---------+
120+
| DeepSeek-R1-Distill-Qwen-1.5B | math_500 | Pass@1 | OVERALL | 500 | 0.826 | - |
121+
+-------------------------------+-----------+----------+----------+-------+---------+---------+
122+
123+
124+
125+
###############################################################################################
126+
#################################### Precision: float32 #######################################
127+
###############################################################################################
128+
#################################### First time result ######################################
129+
+-------------------------------+-----------+----------+----------+-------+---------+---------+
130+
| Model | Dataset | Metric | Subset | Num | Score | Cat.0 |
131+
+===============================+===========+==========+==========+=======+=========+=========+
132+
| DeepSeek-R1-Distill-Qwen-1.5B | math_500 | Pass@1 | Level 1 | 43 | 0.907 | default |
133+
+-------------------------------+-----------+----------+----------+-------+---------+---------+
134+
| DeepSeek-R1-Distill-Qwen-1.5B | math_500 | Pass@1 | Level 2 | 90 | 0.9444 | default |
135+
+-------------------------------+-----------+----------+----------+-------+---------+---------+
136+
| DeepSeek-R1-Distill-Qwen-1.5B | math_500 | Pass@1 | Level 3 | 105 | 0.9143 | default |
137+
+-------------------------------+-----------+----------+----------+-------+---------+---------+
138+
| DeepSeek-R1-Distill-Qwen-1.5B | math_500 | Pass@1 | Level 4 | 128 | 0.8047 | default |
139+
+-------------------------------+-----------+----------+----------+-------+---------+---------+
140+
| DeepSeek-R1-Distill-Qwen-1.5B | math_500 | Pass@1 | Level 5 | 134 | 0.6119 | default |
141+
+-------------------------------+-----------+----------+----------+-------+---------+---------+
142+
| DeepSeek-R1-Distill-Qwen-1.5B | math_500 | Pass@1 | OVERALL | 500 | 0.81 | - |
143+
+-------------------------------+-----------+----------+----------+-------+---------+---------+
144+
#################################### Second time result #####################################
145+
+-------------------------------+-----------+----------+----------+-------+---------+---------+
146+
| Model | Dataset | Metric | Subset | Num | Score | Cat.0 |
147+
+===============================+===========+==========+==========+=======+=========+=========+
148+
| DeepSeek-R1-Distill-Qwen-1.5B | math_500 | Pass@1 | Level 1 | 43 | 0.907 | default |
149+
+-------------------------------+-----------+----------+----------+-------+---------+---------+
150+
| DeepSeek-R1-Distill-Qwen-1.5B | math_500 | Pass@1 | Level 2 | 90 | 0.9111 | default |
151+
+-------------------------------+-----------+----------+----------+-------+---------+---------+
152+
| DeepSeek-R1-Distill-Qwen-1.5B | math_500 | Pass@1 | Level 3 | 105 | 0.9238 | default |
153+
+-------------------------------+-----------+----------+----------+-------+---------+---------+
154+
| DeepSeek-R1-Distill-Qwen-1.5B | math_500 | Pass@1 | Level 4 | 128 | 0.7422 | default |
155+
+-------------------------------+-----------+----------+----------+-------+---------+---------+
156+
| DeepSeek-R1-Distill-Qwen-1.5B | math_500 | Pass@1 | Level 5 | 134 | 0.6343 | default |
157+
+-------------------------------+-----------+----------+----------+-------+---------+---------+
158+
| DeepSeek-R1-Distill-Qwen-1.5B | math_500 | Pass@1 | OVERALL | 500 | 0.796 | - |
159+
+-------------------------------+-----------+----------+----------+-------+---------+---------+
160+
#################################### Third time result ######################################
161+
+-------------------------------+-----------+----------+----------+-------+---------+---------+
162+
| Model | Dataset | Metric | Subset | Num | Score | Cat.0 |
163+
+===============================+===========+==========+==========+=======+=========+=========+
164+
| DeepSeek-R1-Distill-Qwen-1.5B | math_500 | Pass@1 | Level 1 | 43 | 0.9535 | default |
165+
+-------------------------------+-----------+----------+----------+-------+---------+---------+
166+
| DeepSeek-R1-Distill-Qwen-1.5B | math_500 | Pass@1 | Level 2 | 90 | 0.9444 | default |
167+
+-------------------------------+-----------+----------+----------+-------+---------+---------+
168+
| DeepSeek-R1-Distill-Qwen-1.5B | math_500 | Pass@1 | Level 3 | 105 | 0.8857 | default |
169+
+-------------------------------+-----------+----------+----------+-------+---------+---------+
170+
| DeepSeek-R1-Distill-Qwen-1.5B | math_500 | Pass@1 | Level 4 | 128 | 0.7812 | default |
171+
+-------------------------------+-----------+----------+----------+-------+---------+---------+
172+
| DeepSeek-R1-Distill-Qwen-1.5B | math_500 | Pass@1 | Level 5 | 134 | 0.6343 | default |
173+
+-------------------------------+-----------+----------+----------+-------+---------+---------+
174+
| DeepSeek-R1-Distill-Qwen-1.5B | math_500 | Pass@1 | OVERALL | 500 | 0.808 | - |
175+
+-------------------------------+-----------+----------+----------+-------+---------+---------+
176+
```
177+
178+
#### Complete Evaluation Configuration
179+
```yaml
180+
analysis_report: false
181+
api_key: EMPTY
182+
api_url: http://127.0.0.1:30000/v1/chat/completions
183+
chat_template: null
184+
dataset_args:
185+
math_500:
186+
dataset_id: AI-ModelScope/MATH-500
187+
description: MATH-500 is a benchmark for evaluating mathematical reasoning capabilities
188+
of AI models. It consists of 500 diverse math problems across five levels of
189+
difficulty, designed to test a model's ability to solve complex mathematical
190+
problems by generating step-by-step solutions and providing the correct final
191+
answer.
192+
eval_split: test
193+
extra_params: {}
194+
few_shot_num: 0
195+
few_shot_random: false
196+
filters: null
197+
metric_list:
198+
- Pass@1
199+
model_adapter: generation
200+
name: math_500
201+
output_types:
202+
- generation
203+
pretty_name: MATH-500
204+
prompt_template: '{query}
205+
206+
Please reason step by step, and put your final answer within \boxed{{}}.'
207+
query_template: null
208+
subset_list:
209+
- Level 1
210+
- Level 2
211+
- Level 3
212+
- Level 4
213+
- Level 5
214+
system_prompt: null
215+
tags:
216+
- Mathematics
217+
train_split: null
218+
dataset_dir: /home/gcpuser/.cache/modelscope/hub/datasets
219+
dataset_hub: modelscope
220+
datasets:
221+
- math_500
222+
debug: false
223+
dry_run: false
224+
eval_backend: Native
225+
eval_batch_size: 64
226+
eval_config: null
227+
eval_type: service
228+
generation_config:
229+
max_tokens: 32768
230+
temperature: 0.6
231+
top_p: 0.95
232+
ignore_errors: false
233+
judge_model_args: {}
234+
judge_strategy: auto
235+
judge_worker_num: 1
236+
limit: null
237+
mem_cache: false
238+
model: deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B
239+
model_args: {}
240+
model_id: DeepSeek-R1-Distill-Qwen-1.5B
241+
model_task: text_generation
242+
outputs: null
243+
seed: 42
244+
stage: all
245+
stream: false
246+
template_type: null
247+
timeout: 120000.0
248+
use_cache: null
249+
```

0 commit comments

Comments
 (0)