Skip to content

Commit 1e3e697

Browse files
authored
[docs] add evaluations math (#286)
1 parent db105ba commit 1e3e697

File tree

2 files changed

+156
-7
lines changed

2 files changed

+156
-7
lines changed

README.md

Lines changed: 8 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -47,13 +47,14 @@ For more features and usage details, please read the documents in the [`docs`](h
4747

4848
SGL-JAX is designed for easy extension to new model architectures. It currently provides first-class, optimized support for:
4949

50-
- **Qwen**
51-
- **Qwen 2**
52-
- **Qwen 2 MOE**
53-
- **Qwen 3**
54-
- **Qwen 3 MoE**
55-
- **Llama**
56-
- **Bailing MoE**
50+
- **Qwen**: Performance needs to improve.
51+
- **Qwen 2**: Performance needs to improve.
52+
- **Qwen 2 MoE**: Performance needs to improve.
53+
- **Qwen 3**: Currently these series have achieved our best performance.
54+
- **Qwen 3 MoE**: Apart from models like Qwen-coder3-480B with large parameters, these series have achieved our best performance.
55+
- **Llama**: Performance needs to improve.
56+
- **Bailing MoE**: Performance needs to improve.
57+
5758

5859
## Performance and Benchmarking
5960

docs/evaluations/evaluations.md

Lines changed: 148 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,148 @@
1+
# deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B
2+
3+
## Math-500
4+
5+
### Introduction
6+
7+
Environment: TPU v6e-1.
8+
9+
Version:
10+
```bash
11+
flax 0.12.1
12+
jax 0.8.1
13+
jaxlib 0.8.1
14+
libtpu 0.0.30
15+
sglang-jax main-5fc4fa54a12ea0cbf05c4e304f0f69595e556aa7
16+
```
17+
18+
19+
### Instructions
20+
21+
```bash
22+
# launch server, precision = bfloat16
23+
JAX_COMPILATION_CACHE_DIR=/tmp/jit_cache python3 -u -m sgl_jax.launch_server \
24+
--model-path deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B \
25+
--trust-remote-code \
26+
--tp-size=1 \
27+
--device=tpu \
28+
--mem-fraction-static=0.8 \
29+
--chunked-prefill-size=2048 \
30+
--download-dir=/tmp \
31+
--dtype=bfloat16 \
32+
--max-running-requests 256 \
33+
--skip-server-warmup \
34+
--page-size=128 \
35+
--disable-radix-cache \
36+
--use-sort-for-toppk-minp
37+
38+
# launch server, precision = float32
39+
JAX_COMPILATION_CACHE_DIR=/tmp/jit_cache python3 -u -m sgl_jax.launch_server \
40+
--model-path deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B \
41+
--trust-remote-code \
42+
--tp-size=1 \
43+
--device=tpu \
44+
--mem-fraction-static=0.8 \
45+
--chunked-prefill-size=2048 \
46+
--download-dir=/tmp \
47+
--dtype=float32 \
48+
--max-running-requests 256 \
49+
--skip-server-warmup \
50+
--page-size=128 \
51+
--disable-radix-cache \
52+
--use-sort-for-toppk-minp
53+
54+
# eval: evalscope = 0.17.1
55+
## Sampling parameters refer to https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B.
56+
## Eval generation config refers to https://evalscope.readthedocs.io/zh-cn/latest/best_practice/eval_qwq.html#id5.
57+
evalscope eval --model deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B \
58+
--api-url http://127.0.0.1:30000/v1/chat/completions \
59+
--api-key EMPTY \
60+
--eval-type service \
61+
--datasets math_500 \
62+
--eval-batch-size 64 \
63+
--dataset-args '{"math_500":{"metric_list":["Pass@1"]}}' --generation-config '{"max_tokens": 32768, "temperature": 0.6, "top_p": 0.95}' \
64+
--timeout 120000
65+
```
66+
67+
### Evaluation Results
68+
69+
- bloat16: average is **0.808**
70+
- 0.82,0.804,0.822,0.814,0.796,0.798,0.802,0.796,0.798,0.816,0.812,0.83,0.792,0.796,0.82
71+
- 15 turns
72+
- float32: average is **0.804**
73+
- 0.806,0.806,0.81,0.818,0.79,0.8,0.806,0.822,0.78
74+
- 9 turns
75+
76+
#### Complete Evaluation Configuration
77+
```yaml
78+
analysis_report: false
79+
api_key: EMPTY
80+
api_url: http://127.0.0.1:30000/v1/chat/completions
81+
chat_template: null
82+
dataset_args:
83+
math_500:
84+
dataset_id: AI-ModelScope/MATH-500
85+
description: MATH-500 is a benchmark for evaluating mathematical reasoning capabilities
86+
of AI models. It consists of 500 diverse math problems across five levels of
87+
difficulty, designed to test a model's ability to solve complex mathematical
88+
problems by generating step-by-step solutions and providing the correct final
89+
answer.
90+
eval_split: test
91+
extra_params: {}
92+
few_shot_num: 0
93+
few_shot_random: false
94+
filters: null
95+
metric_list:
96+
- Pass@1
97+
model_adapter: generation
98+
name: math_500
99+
output_types:
100+
- generation
101+
pretty_name: MATH-500
102+
prompt_template: '{query}
103+
104+
Please reason step by step, and put your final answer within \boxed{{}}.'
105+
query_template: null
106+
subset_list:
107+
- Level 1
108+
- Level 2
109+
- Level 3
110+
- Level 4
111+
- Level 5
112+
system_prompt: null
113+
tags:
114+
- Mathematics
115+
train_split: null
116+
dataset_dir: /home/gcpuser/.cache/modelscope/hub/datasets
117+
dataset_hub: modelscope
118+
datasets:
119+
- math_500
120+
debug: false
121+
dry_run: false
122+
eval_backend: Native
123+
eval_batch_size: 64
124+
eval_config: null
125+
eval_type: service
126+
generation_config:
127+
max_tokens: 32768
128+
temperature: 0.6
129+
top_p: 0.95
130+
ignore_errors: false
131+
judge_model_args: {}
132+
judge_strategy: auto
133+
judge_worker_num: 1
134+
limit: null
135+
mem_cache: false
136+
model: deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B
137+
model_args: {}
138+
model_id: DeepSeek-R1-Distill-Qwen-1.5B
139+
model_task: text_generation
140+
outputs: null
141+
seed: 42
142+
stage: all
143+
stream: false
144+
template_type: null
145+
timeout: 120000.0
146+
use_cache: null
147+
work_dir: ./outputs/20251121_043226
148+
```

0 commit comments

Comments
 (0)