[docs] add evaluations math (#286)

aolemila · web-flow · commit 1e3e6979af57 · 2025-11-25T17:17:56.000+08:00
diff --git a/README.md b/README.md
@@ -47,13 +47,14 @@ For more features and usage details, please read the documents in the [`docs`](h
 
 SGL-JAX is designed for easy extension to new model architectures. It currently provides first-class, optimized support for:
 
--   **Qwen**
--   **Qwen 2**
--   **Qwen 2 MOE**
--   **Qwen 3**
--   **Qwen 3 MoE**
--   **Llama**
--   **Bailing MoE**
+-   **Qwen**: Performance needs to improve.
+-   **Qwen 2**: Performance needs to improve.
+-   **Qwen 2 MoE**: Performance needs to improve.
+-   **Qwen 3**: Currently these series have achieved our best performance.
+-   **Qwen 3 MoE**: Apart from models like Qwen-coder3-480B with large parameters, these series have achieved our best performance.
+-   **Llama**: Performance needs to improve.
+-   **Bailing MoE**: Performance needs to improve.
+
 
 ## Performance and Benchmarking
 
diff --git a/docs/evaluations/evaluations.md b/docs/evaluations/evaluations.md
@@ -0,0 +1,148 @@
+# deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B
+
+## Math-500
+
+### Introduction
+
+Environment: TPU v6e-1.
+
+Version:
+```bash
+flax                0.12.1
+jax                 0.8.1
+jaxlib              0.8.1
+libtpu              0.0.30
+sglang-jax          main-5fc4fa54a12ea0cbf05c4e304f0f69595e556aa7
+```
+
+
+### Instructions
+
+```bash
+# launch server, precision = bfloat16
+JAX_COMPILATION_CACHE_DIR=/tmp/jit_cache python3 -u -m sgl_jax.launch_server \
+--model-path deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B \
+--trust-remote-code  \
+--tp-size=1 \
+--device=tpu \
+--mem-fraction-static=0.8 \
+--chunked-prefill-size=2048 \
+--download-dir=/tmp \
+--dtype=bfloat16 \
+--max-running-requests 256 \
+--skip-server-warmup \
+--page-size=128  \
+--disable-radix-cache \
+--use-sort-for-toppk-minp
+
+# launch server, precision = float32
+JAX_COMPILATION_CACHE_DIR=/tmp/jit_cache python3 -u -m sgl_jax.launch_server \
+--model-path deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B \
+--trust-remote-code  \
+--tp-size=1 \
+--device=tpu \
+--mem-fraction-static=0.8 \
+--chunked-prefill-size=2048 \
+--download-dir=/tmp \
+--dtype=float32 \
+--max-running-requests 256 \
+--skip-server-warmup \
+--page-size=128  \
+--disable-radix-cache \
+--use-sort-for-toppk-minp
+
+# eval: evalscope = 0.17.1
+## Sampling parameters refer to https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B.
+## Eval generation config refers to https://evalscope.readthedocs.io/zh-cn/latest/best_practice/eval_qwq.html#id5.
+evalscope eval  --model deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B \
+--api-url http://127.0.0.1:30000/v1/chat/completions \
+--api-key EMPTY \
+--eval-type service \
+--datasets math_500  \
+--eval-batch-size 64 \
+--dataset-args '{"math_500":{"metric_list":["Pass@1"]}}' --generation-config '{"max_tokens": 32768, "temperature": 0.6, "top_p": 0.95}' \
+--timeout 120000
+```
+
+### Evaluation Results
+
+- bloat16: average is **0.808**
+  - 0.82,0.804,0.822,0.814,0.796,0.798,0.802,0.796,0.798,0.816,0.812,0.83,0.792,0.796,0.82
+  - 15 turns
+- float32: average is **0.804**
+  - 0.806,0.806,0.81,0.818,0.79,0.8,0.806,0.822,0.78
+  - 9 turns
+
+#### Complete Evaluation Configuration
+```yaml
+analysis_report: false
+api_key: EMPTY
+api_url: http://127.0.0.1:30000/v1/chat/completions
+chat_template: null
+dataset_args:
+  math_500:
+    dataset_id: AI-ModelScope/MATH-500
+    description: MATH-500 is a benchmark for evaluating mathematical reasoning capabilities
+      of AI models. It consists of 500 diverse math problems across five levels of
+      difficulty, designed to test a model's ability to solve complex mathematical
+      problems by generating step-by-step solutions and providing the correct final
+      answer.
+    eval_split: test
+    extra_params: {}
+    few_shot_num: 0
+    few_shot_random: false
+    filters: null
+    metric_list:
+    - Pass@1
+    model_adapter: generation
+    name: math_500
+    output_types:
+    - generation
+    pretty_name: MATH-500
+    prompt_template: '{query}
+
+      Please reason step by step, and put your final answer within \boxed{{}}.'
+    query_template: null
+    subset_list:
+    - Level 1
+    - Level 2
+    - Level 3
+    - Level 4
+    - Level 5
+    system_prompt: null
+    tags:
+    - Mathematics
+    train_split: null
+dataset_dir: /home/gcpuser/.cache/modelscope/hub/datasets
+dataset_hub: modelscope
+datasets:
+- math_500
+debug: false
+dry_run: false
+eval_backend: Native
+eval_batch_size: 64
+eval_config: null
+eval_type: service
+generation_config:
+  max_tokens: 32768
+  temperature: 0.6
+  top_p: 0.95
+ignore_errors: false
+judge_model_args: {}
+judge_strategy: auto
+judge_worker_num: 1
+limit: null
+mem_cache: false
+model: deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B
+model_args: {}
+model_id: DeepSeek-R1-Distill-Qwen-1.5B
+model_task: text_generation
+outputs: null
+seed: 42
+stage: all
+stream: false
+template_type: null
+timeout: 120000.0
+use_cache: null
+work_dir: ./outputs/20251121_043226
+```