From 1e7001b22129a6dc5c8462c3606fa4d286c66568 Mon Sep 17 00:00:00 2001
From: aolemila <aolemilaluo@gmail.com>
Date: Fri, 31 Oct 2025 14:19:40 +0800
Subject: [PATCH 1/6] add math-500 reproduce instructions and result under
 bfloat16 and float32

---
 README.md                       |  15 +-
 docs/evaluations/evaluations.md | 249 ++++++++++++++++++++++++++++++++
 2 files changed, 257 insertions(+), 7 deletions(-)
 create mode 100644 docs/evaluations/evaluations.md

diff --git a/README.md b/README.md
index c8520d063..c48670fe9 100644
--- a/README.md
+++ b/README.md
@@ -47,13 +47,14 @@ For more features and usage details, please read the documents in the [`docs`](h
 
 SGL-JAX is designed for easy extension to new model architectures. It currently provides first-class, optimized support for:
 
--   **Qwen**
--   **Qwen 2**
--   **Qwen 2 MOE**
--   **Qwen 3**
--   **Qwen 3 MoE**
--   **Llama**
--   **Bailing MoE**
+-   **Qwen**: Performance needs to improve.
+-   **Qwen 2**: Performance needs to improve.
+-   **Qwen 2 MOE**: Performance needs to improve.
+-   **Qwen 3**: Currently these series have achieved our best performance.
+-   **Qwen 3 MoE**: Apart from models like Qwen-coder3-480B with large parameters, these series have achieved our best performance.
+-   **Llama**: Performance needs to improve.
+-   **Bailing MoE**: Performance needs to improve.
+
 
 ## Performance and Benchmarking
 
diff --git a/docs/evaluations/evaluations.md b/docs/evaluations/evaluations.md
new file mode 100644
index 000000000..599c89516
--- /dev/null
+++ b/docs/evaluations/evaluations.md
@@ -0,0 +1,249 @@
+# deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B
+
+## Math-500
+
+### Introduction
+
+Environment: TPU v6e-4.
+Version: main-51e4987a7942ac936bc0e58d77b78174b71eefa5
+
+### Instructions
+
+```bash
+# launch server, precision = bfloat16
+JAX_COMPILATION_CACHE_DIR=/tmp/jit_cache python3 -u -m sgl_jax.launch_server \
+--model-path deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B \
+--trust-remote-code  \
+--tp-size=4 \
+--device=tpu \
+--mem-fraction-static=0.8 \
+--chunked-prefill-size=2048 \
+--download-dir=/tmp \
+--dtype=bfloat16 \
+--max-running-requests 256 \
+--skip-server-warmup \
+--page-size=128  \
+--disable-radix-cache \
+--use-sort-for-toppk-minp
+
+# launch server, precision = float32
+JAX_COMPILATION_CACHE_DIR=/tmp/jit_cache python3 -u -m sgl_jax.launch_server \
+--model-path deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B \
+--trust-remote-code  \
+--tp-size=4 \
+--device=tpu \
+--mem-fraction-static=0.8 \
+--chunked-prefill-size=2048 \
+--download-dir=/tmp \
+--dtype=float32 \
+--max-running-requests 256 \
+--skip-server-warmup \
+--page-size=128  \
+--disable-radix-cache \
+--use-sort-for-toppk-minp
+
+# eval: evalscope = 0.17.1
+## Sampling parameters refer to https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B.
+## Eval generation config refers to https://evalscope.readthedocs.io/zh-cn/latest/best_practice/eval_qwq.html#id5.
+## Note: n in generation-config does not take effect due to https://github.com/sgl-project/sglang-jax/issues/296. So please get mean grade manually.
+evalscope eval  --model deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B \
+--api-url http://127.0.0.1:30000/v1/chat/completions \
+--api-key EMPTY \
+--eval-type service \
+--datasets math_500  \
+--eval-batch-size 64 \
+--dataset-args '{"math_500":{"metric_list":["Pass@1"]}}' --generation-config '{"max_tokens": 32768, "temperature": 0.6, "top_p": 0.95}' \
+--timeout 120000
+```
+
+### Evaluation Results
+
+- bloat16: 0.818, 0.82, 0.826.
+- float32: 0.81, 0.796, 0.808.
+
+#### Details
+
+Note:
+- Every test under bfloat16 costs about 35 minutes.
+- Every test under float32 costs about 40 minutes.
+
+```bash
+###############################################################################################
+#################################### Precision: bfloat16 ######################################
+###############################################################################################
+#################################### First time result   ######################################
++-------------------------------+-----------+----------+----------+-------+---------+---------+
+| Model                         | Dataset   | Metric   | Subset   |   Num |   Score | Cat.0   |
++===============================+===========+==========+==========+=======+=========+=========+
+| DeepSeek-R1-Distill-Qwen-1.5B | math_500  | Pass@1   | Level 1  |    43 |  0.907  | default |
++-------------------------------+-----------+----------+----------+-------+---------+---------+
+| DeepSeek-R1-Distill-Qwen-1.5B | math_500  | Pass@1   | Level 2  |    90 |  0.9889 | default |
++-------------------------------+-----------+----------+----------+-------+---------+---------+
+| DeepSeek-R1-Distill-Qwen-1.5B | math_500  | Pass@1   | Level 3  |   105 |  0.8857 | default |
++-------------------------------+-----------+----------+----------+-------+---------+---------+
+| DeepSeek-R1-Distill-Qwen-1.5B | math_500  | Pass@1   | Level 4  |   128 |  0.8125 | default |
++-------------------------------+-----------+----------+----------+-------+---------+---------+
+| DeepSeek-R1-Distill-Qwen-1.5B | math_500  | Pass@1   | Level 5  |   134 |  0.6269 | default |
++-------------------------------+-----------+----------+----------+-------+---------+---------+
+| DeepSeek-R1-Distill-Qwen-1.5B | math_500  | Pass@1   | OVERALL  |   500 |  0.818  | -       |
++-------------------------------+-----------+----------+----------+-------+---------+---------+
+#################################### Second time result   #####################################
++-------------------------------+-----------+----------+----------+-------+---------+---------+
+| Model                         | Dataset   | Metric   | Subset   |   Num |   Score | Cat.0   |
++===============================+===========+==========+==========+=======+=========+=========+
+| DeepSeek-R1-Distill-Qwen-1.5B | math_500  | Pass@1   | Level 1  |    43 |  0.9302 | default |
++-------------------------------+-----------+----------+----------+-------+---------+---------+
+| DeepSeek-R1-Distill-Qwen-1.5B | math_500  | Pass@1   | Level 2  |    90 |  0.9444 | default |
++-------------------------------+-----------+----------+----------+-------+---------+---------+
+| DeepSeek-R1-Distill-Qwen-1.5B | math_500  | Pass@1   | Level 3  |   105 |  0.9048 | default |
++-------------------------------+-----------+----------+----------+-------+---------+---------+
+| DeepSeek-R1-Distill-Qwen-1.5B | math_500  | Pass@1   | Level 4  |   128 |  0.8125 | default |
++-------------------------------+-----------+----------+----------+-------+---------+---------+
+| DeepSeek-R1-Distill-Qwen-1.5B | math_500  | Pass@1   | Level 5  |   134 |  0.6418 | default |
++-------------------------------+-----------+----------+----------+-------+---------+---------+
+| DeepSeek-R1-Distill-Qwen-1.5B | math_500  | Pass@1   | OVERALL  |   500 |  0.82   | -       |
++-------------------------------+-----------+----------+----------+-------+---------+---------+
+#################################### Third time result   ######################################
++-------------------------------+-----------+----------+----------+-------+---------+---------+
+| Model                         | Dataset   | Metric   | Subset   |   Num |   Score | Cat.0   |
++===============================+===========+==========+==========+=======+=========+=========+
+| DeepSeek-R1-Distill-Qwen-1.5B | math_500  | Pass@1   | Level 1  |    43 |  0.9535 | default |
++-------------------------------+-----------+----------+----------+-------+---------+---------+
+| DeepSeek-R1-Distill-Qwen-1.5B | math_500  | Pass@1   | Level 2  |    90 |  0.9444 | default |
++-------------------------------+-----------+----------+----------+-------+---------+---------+
+| DeepSeek-R1-Distill-Qwen-1.5B | math_500  | Pass@1   | Level 3  |   105 |  0.9048 | default |
++-------------------------------+-----------+----------+----------+-------+---------+---------+
+| DeepSeek-R1-Distill-Qwen-1.5B | math_500  | Pass@1   | Level 4  |   128 |  0.7969 | default |
++-------------------------------+-----------+----------+----------+-------+---------+---------+
+| DeepSeek-R1-Distill-Qwen-1.5B | math_500  | Pass@1   | Level 5  |   134 |  0.6716 | default |
++-------------------------------+-----------+----------+----------+-------+---------+---------+
+| DeepSeek-R1-Distill-Qwen-1.5B | math_500  | Pass@1   | OVERALL  |   500 |  0.826  | -       |
++-------------------------------+-----------+----------+----------+-------+---------+---------+
+
+
+
+###############################################################################################
+#################################### Precision: float32 #######################################
+###############################################################################################
+#################################### First time result   ######################################
++-------------------------------+-----------+----------+----------+-------+---------+---------+
+| Model                         | Dataset   | Metric   | Subset   |   Num |   Score | Cat.0   |
++===============================+===========+==========+==========+=======+=========+=========+
+| DeepSeek-R1-Distill-Qwen-1.5B | math_500  | Pass@1   | Level 1  |    43 |  0.907  | default |
++-------------------------------+-----------+----------+----------+-------+---------+---------+
+| DeepSeek-R1-Distill-Qwen-1.5B | math_500  | Pass@1   | Level 2  |    90 |  0.9444 | default |
++-------------------------------+-----------+----------+----------+-------+---------+---------+
+| DeepSeek-R1-Distill-Qwen-1.5B | math_500  | Pass@1   | Level 3  |   105 |  0.9143 | default |
++-------------------------------+-----------+----------+----------+-------+---------+---------+
+| DeepSeek-R1-Distill-Qwen-1.5B | math_500  | Pass@1   | Level 4  |   128 |  0.8047 | default |
++-------------------------------+-----------+----------+----------+-------+---------+---------+
+| DeepSeek-R1-Distill-Qwen-1.5B | math_500  | Pass@1   | Level 5  |   134 |  0.6119 | default |
++-------------------------------+-----------+----------+----------+-------+---------+---------+
+| DeepSeek-R1-Distill-Qwen-1.5B | math_500  | Pass@1   | OVERALL  |   500 |  0.81   | -       |
++-------------------------------+-----------+----------+----------+-------+---------+---------+
+#################################### Second time result   #####################################
++-------------------------------+-----------+----------+----------+-------+---------+---------+
+| Model                         | Dataset   | Metric   | Subset   |   Num |   Score | Cat.0   |
++===============================+===========+==========+==========+=======+=========+=========+
+| DeepSeek-R1-Distill-Qwen-1.5B | math_500  | Pass@1   | Level 1  |    43 |  0.907  | default |
++-------------------------------+-----------+----------+----------+-------+---------+---------+
+| DeepSeek-R1-Distill-Qwen-1.5B | math_500  | Pass@1   | Level 2  |    90 |  0.9111 | default |
++-------------------------------+-----------+----------+----------+-------+---------+---------+
+| DeepSeek-R1-Distill-Qwen-1.5B | math_500  | Pass@1   | Level 3  |   105 |  0.9238 | default |
++-------------------------------+-----------+----------+----------+-------+---------+---------+
+| DeepSeek-R1-Distill-Qwen-1.5B | math_500  | Pass@1   | Level 4  |   128 |  0.7422 | default |
++-------------------------------+-----------+----------+----------+-------+---------+---------+
+| DeepSeek-R1-Distill-Qwen-1.5B | math_500  | Pass@1   | Level 5  |   134 |  0.6343 | default |
++-------------------------------+-----------+----------+----------+-------+---------+---------+
+| DeepSeek-R1-Distill-Qwen-1.5B | math_500  | Pass@1   | OVERALL  |   500 |  0.796  | -       |
++-------------------------------+-----------+----------+----------+-------+---------+---------+
+#################################### Third time result   ######################################
++-------------------------------+-----------+----------+----------+-------+---------+---------+
+| Model                         | Dataset   | Metric   | Subset   |   Num |   Score | Cat.0   |
++===============================+===========+==========+==========+=======+=========+=========+
+| DeepSeek-R1-Distill-Qwen-1.5B | math_500  | Pass@1   | Level 1  |    43 |  0.9535  | default |
++-------------------------------+-----------+----------+----------+-------+---------+---------+
+| DeepSeek-R1-Distill-Qwen-1.5B | math_500  | Pass@1   | Level 2  |    90 |  0.9444 | default |
++-------------------------------+-----------+----------+----------+-------+---------+---------+
+| DeepSeek-R1-Distill-Qwen-1.5B | math_500  | Pass@1   | Level 3  |   105 |  0.8857 | default |
++-------------------------------+-----------+----------+----------+-------+---------+---------+
+| DeepSeek-R1-Distill-Qwen-1.5B | math_500  | Pass@1   | Level 4  |   128 |  0.7812 | default |
++-------------------------------+-----------+----------+----------+-------+---------+---------+
+| DeepSeek-R1-Distill-Qwen-1.5B | math_500  | Pass@1   | Level 5  |   134 |  0.6343 | default |
++-------------------------------+-----------+----------+----------+-------+---------+---------+
+| DeepSeek-R1-Distill-Qwen-1.5B | math_500  | Pass@1   | OVERALL  |   500 |  0.808  | -       |
++-------------------------------+-----------+----------+----------+-------+---------+---------+
+```
+
+#### Complete Evaluation Configuration
+```yaml
+analysis_report: false
+api_key: EMPTY
+api_url: http://127.0.0.1:30000/v1/chat/completions
+chat_template: null
+dataset_args:
+  math_500:
+    dataset_id: AI-ModelScope/MATH-500
+    description: MATH-500 is a benchmark for evaluating mathematical reasoning capabilities
+      of AI models. It consists of 500 diverse math problems across five levels of
+      difficulty, designed to test a model's ability to solve complex mathematical
+      problems by generating step-by-step solutions and providing the correct final
+      answer.
+    eval_split: test
+    extra_params: {}
+    few_shot_num: 0
+    few_shot_random: false
+    filters: null
+    metric_list:
+    - Pass@1
+    model_adapter: generation
+    name: math_500
+    output_types:
+    - generation
+    pretty_name: MATH-500
+    prompt_template: '{query}
+
+      Please reason step by step, and put your final answer within \boxed{{}}.'
+    query_template: null
+    subset_list:
+    - Level 1
+    - Level 2
+    - Level 3
+    - Level 4
+    - Level 5
+    system_prompt: null
+    tags:
+    - Mathematics
+    train_split: null
+dataset_dir: /home/gcpuser/.cache/modelscope/hub/datasets
+dataset_hub: modelscope
+datasets:
+- math_500
+debug: false
+dry_run: false
+eval_backend: Native
+eval_batch_size: 64
+eval_config: null
+eval_type: service
+generation_config:
+  max_tokens: 32768
+  temperature: 0.6
+  top_p: 0.95
+ignore_errors: false
+judge_model_args: {}
+judge_strategy: auto
+judge_worker_num: 1
+limit: null
+mem_cache: false
+model: deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B
+model_args: {}
+model_id: DeepSeek-R1-Distill-Qwen-1.5B
+model_task: text_generation
+outputs: null
+seed: 42
+stage: all
+stream: false
+template_type: null
+timeout: 120000.0
+use_cache: null
+```

From 6cc0b94e02d529290713a317c88324cfebd724bb Mon Sep 17 00:00:00 2001
From: aolemila <aolemilaluo@gmail.com>
Date: Thu, 20 Nov 2025 20:51:40 +0800
Subject: [PATCH 2/6] update

---
 README.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/README.md b/README.md
index c48670fe9..031f36e05 100644
--- a/README.md
+++ b/README.md
@@ -49,7 +49,7 @@ SGL-JAX is designed for easy extension to new model architectures. It currently
 
 -   **Qwen**: Performance needs to improve.
 -   **Qwen 2**: Performance needs to improve.
--   **Qwen 2 MOE**: Performance needs to improve.
+-   **Qwen 2 MoE**: Performance needs to improve.
 -   **Qwen 3**: Currently these series have achieved our best performance.
 -   **Qwen 3 MoE**: Apart from models like Qwen-coder3-480B with large parameters, these series have achieved our best performance.
 -   **Llama**: Performance needs to improve.

From 9f234427a3942b46b7ca71fb6fb48ab15d3c5239 Mon Sep 17 00:00:00 2001
From: aolemila <aolemilaluo@gmail.com>
Date: Fri, 21 Nov 2025 14:15:19 +0800
Subject: [PATCH 3/6] update version and try to retest

---
 docs/evaluations/evaluations.md | 11 +++++++----
 1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/docs/evaluations/evaluations.md b/docs/evaluations/evaluations.md
index 599c89516..6a500e7f7 100644
--- a/docs/evaluations/evaluations.md
+++ b/docs/evaluations/evaluations.md
@@ -4,17 +4,18 @@
 
 ### Introduction
 
-Environment: TPU v6e-4.
-Version: main-51e4987a7942ac936bc0e58d77b78174b71eefa5
+Environment: TPU v6e-1.
+Version: main-5fc4fa54a12ea0cbf05c4e304f0f69595e556aa7
 
 ### Instructions
 
 ```bash
+# sky-31d4-pseudonym
 # launch server, precision = bfloat16
 JAX_COMPILATION_CACHE_DIR=/tmp/jit_cache python3 -u -m sgl_jax.launch_server \
 --model-path deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B \
 --trust-remote-code  \
---tp-size=4 \
+--tp-size=1 \
 --device=tpu \
 --mem-fraction-static=0.8 \
 --chunked-prefill-size=2048 \
@@ -26,11 +27,12 @@ JAX_COMPILATION_CACHE_DIR=/tmp/jit_cache python3 -u -m sgl_jax.launch_server \
 --disable-radix-cache \
 --use-sort-for-toppk-minp
 
+# sky-495d-pseudonym
 # launch server, precision = float32
 JAX_COMPILATION_CACHE_DIR=/tmp/jit_cache python3 -u -m sgl_jax.launch_server \
 --model-path deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B \
 --trust-remote-code  \
---tp-size=4 \
+--tp-size=1 \
 --device=tpu \
 --mem-fraction-static=0.8 \
 --chunked-prefill-size=2048 \
@@ -246,4 +248,5 @@ stream: false
 template_type: null
 timeout: 120000.0
 use_cache: null
+work_dir: ./outputs/20251121_043226
 ```

From a40bdbfd6f3c450bcac241ab5d345f00b474f119 Mon Sep 17 00:00:00 2001
From: aolemila <aolemilaluo@gmail.com>
Date: Fri, 21 Nov 2025 22:58:22 +0800
Subject: [PATCH 4/6] update math-500 with latest main

---
 docs/evaluations/evaluations.md | 123 ++------------------------------
 1 file changed, 6 insertions(+), 117 deletions(-)

diff --git a/docs/evaluations/evaluations.md b/docs/evaluations/evaluations.md
index 6a500e7f7..f6f81d715 100644
--- a/docs/evaluations/evaluations.md
+++ b/docs/evaluations/evaluations.md
@@ -47,7 +47,6 @@ JAX_COMPILATION_CACHE_DIR=/tmp/jit_cache python3 -u -m sgl_jax.launch_server \
 # eval: evalscope = 0.17.1
 ## Sampling parameters refer to https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B.
 ## Eval generation config refers to https://evalscope.readthedocs.io/zh-cn/latest/best_practice/eval_qwq.html#id5.
-## Note: n in generation-config does not take effect due to https://github.com/sgl-project/sglang-jax/issues/296. So please get mean grade manually.
 evalscope eval  --model deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B \
 --api-url http://127.0.0.1:30000/v1/chat/completions \
 --api-key EMPTY \
@@ -60,122 +59,12 @@ evalscope eval  --model deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B \
 
 ### Evaluation Results
 
-- bloat16: 0.818, 0.82, 0.826.
-- float32: 0.81, 0.796, 0.808.
-
-#### Details
-
-Note:
-- Every test under bfloat16 costs about 35 minutes.
-- Every test under float32 costs about 40 minutes.
-
-```bash
-###############################################################################################
-#################################### Precision: bfloat16 ######################################
-###############################################################################################
-#################################### First time result   ######################################
-+-------------------------------+-----------+----------+----------+-------+---------+---------+
-| Model                         | Dataset   | Metric   | Subset   |   Num |   Score | Cat.0   |
-+===============================+===========+==========+==========+=======+=========+=========+
-| DeepSeek-R1-Distill-Qwen-1.5B | math_500  | Pass@1   | Level 1  |    43 |  0.907  | default |
-+-------------------------------+-----------+----------+----------+-------+---------+---------+
-| DeepSeek-R1-Distill-Qwen-1.5B | math_500  | Pass@1   | Level 2  |    90 |  0.9889 | default |
-+-------------------------------+-----------+----------+----------+-------+---------+---------+
-| DeepSeek-R1-Distill-Qwen-1.5B | math_500  | Pass@1   | Level 3  |   105 |  0.8857 | default |
-+-------------------------------+-----------+----------+----------+-------+---------+---------+
-| DeepSeek-R1-Distill-Qwen-1.5B | math_500  | Pass@1   | Level 4  |   128 |  0.8125 | default |
-+-------------------------------+-----------+----------+----------+-------+---------+---------+
-| DeepSeek-R1-Distill-Qwen-1.5B | math_500  | Pass@1   | Level 5  |   134 |  0.6269 | default |
-+-------------------------------+-----------+----------+----------+-------+---------+---------+
-| DeepSeek-R1-Distill-Qwen-1.5B | math_500  | Pass@1   | OVERALL  |   500 |  0.818  | -       |
-+-------------------------------+-----------+----------+----------+-------+---------+---------+
-#################################### Second time result   #####################################
-+-------------------------------+-----------+----------+----------+-------+---------+---------+
-| Model                         | Dataset   | Metric   | Subset   |   Num |   Score | Cat.0   |
-+===============================+===========+==========+==========+=======+=========+=========+
-| DeepSeek-R1-Distill-Qwen-1.5B | math_500  | Pass@1   | Level 1  |    43 |  0.9302 | default |
-+-------------------------------+-----------+----------+----------+-------+---------+---------+
-| DeepSeek-R1-Distill-Qwen-1.5B | math_500  | Pass@1   | Level 2  |    90 |  0.9444 | default |
-+-------------------------------+-----------+----------+----------+-------+---------+---------+
-| DeepSeek-R1-Distill-Qwen-1.5B | math_500  | Pass@1   | Level 3  |   105 |  0.9048 | default |
-+-------------------------------+-----------+----------+----------+-------+---------+---------+
-| DeepSeek-R1-Distill-Qwen-1.5B | math_500  | Pass@1   | Level 4  |   128 |  0.8125 | default |
-+-------------------------------+-----------+----------+----------+-------+---------+---------+
-| DeepSeek-R1-Distill-Qwen-1.5B | math_500  | Pass@1   | Level 5  |   134 |  0.6418 | default |
-+-------------------------------+-----------+----------+----------+-------+---------+---------+
-| DeepSeek-R1-Distill-Qwen-1.5B | math_500  | Pass@1   | OVERALL  |   500 |  0.82   | -       |
-+-------------------------------+-----------+----------+----------+-------+---------+---------+
-#################################### Third time result   ######################################
-+-------------------------------+-----------+----------+----------+-------+---------+---------+
-| Model                         | Dataset   | Metric   | Subset   |   Num |   Score | Cat.0   |
-+===============================+===========+==========+==========+=======+=========+=========+
-| DeepSeek-R1-Distill-Qwen-1.5B | math_500  | Pass@1   | Level 1  |    43 |  0.9535 | default |
-+-------------------------------+-----------+----------+----------+-------+---------+---------+
-| DeepSeek-R1-Distill-Qwen-1.5B | math_500  | Pass@1   | Level 2  |    90 |  0.9444 | default |
-+-------------------------------+-----------+----------+----------+-------+---------+---------+
-| DeepSeek-R1-Distill-Qwen-1.5B | math_500  | Pass@1   | Level 3  |   105 |  0.9048 | default |
-+-------------------------------+-----------+----------+----------+-------+---------+---------+
-| DeepSeek-R1-Distill-Qwen-1.5B | math_500  | Pass@1   | Level 4  |   128 |  0.7969 | default |
-+-------------------------------+-----------+----------+----------+-------+---------+---------+
-| DeepSeek-R1-Distill-Qwen-1.5B | math_500  | Pass@1   | Level 5  |   134 |  0.6716 | default |
-+-------------------------------+-----------+----------+----------+-------+---------+---------+
-| DeepSeek-R1-Distill-Qwen-1.5B | math_500  | Pass@1   | OVERALL  |   500 |  0.826  | -       |
-+-------------------------------+-----------+----------+----------+-------+---------+---------+
-
-
-
-###############################################################################################
-#################################### Precision: float32 #######################################
-###############################################################################################
-#################################### First time result   ######################################
-+-------------------------------+-----------+----------+----------+-------+---------+---------+
-| Model                         | Dataset   | Metric   | Subset   |   Num |   Score | Cat.0   |
-+===============================+===========+==========+==========+=======+=========+=========+
-| DeepSeek-R1-Distill-Qwen-1.5B | math_500  | Pass@1   | Level 1  |    43 |  0.907  | default |
-+-------------------------------+-----------+----------+----------+-------+---------+---------+
-| DeepSeek-R1-Distill-Qwen-1.5B | math_500  | Pass@1   | Level 2  |    90 |  0.9444 | default |
-+-------------------------------+-----------+----------+----------+-------+---------+---------+
-| DeepSeek-R1-Distill-Qwen-1.5B | math_500  | Pass@1   | Level 3  |   105 |  0.9143 | default |
-+-------------------------------+-----------+----------+----------+-------+---------+---------+
-| DeepSeek-R1-Distill-Qwen-1.5B | math_500  | Pass@1   | Level 4  |   128 |  0.8047 | default |
-+-------------------------------+-----------+----------+----------+-------+---------+---------+
-| DeepSeek-R1-Distill-Qwen-1.5B | math_500  | Pass@1   | Level 5  |   134 |  0.6119 | default |
-+-------------------------------+-----------+----------+----------+-------+---------+---------+
-| DeepSeek-R1-Distill-Qwen-1.5B | math_500  | Pass@1   | OVERALL  |   500 |  0.81   | -       |
-+-------------------------------+-----------+----------+----------+-------+---------+---------+
-#################################### Second time result   #####################################
-+-------------------------------+-----------+----------+----------+-------+---------+---------+
-| Model                         | Dataset   | Metric   | Subset   |   Num |   Score | Cat.0   |
-+===============================+===========+==========+==========+=======+=========+=========+
-| DeepSeek-R1-Distill-Qwen-1.5B | math_500  | Pass@1   | Level 1  |    43 |  0.907  | default |
-+-------------------------------+-----------+----------+----------+-------+---------+---------+
-| DeepSeek-R1-Distill-Qwen-1.5B | math_500  | Pass@1   | Level 2  |    90 |  0.9111 | default |
-+-------------------------------+-----------+----------+----------+-------+---------+---------+
-| DeepSeek-R1-Distill-Qwen-1.5B | math_500  | Pass@1   | Level 3  |   105 |  0.9238 | default |
-+-------------------------------+-----------+----------+----------+-------+---------+---------+
-| DeepSeek-R1-Distill-Qwen-1.5B | math_500  | Pass@1   | Level 4  |   128 |  0.7422 | default |
-+-------------------------------+-----------+----------+----------+-------+---------+---------+
-| DeepSeek-R1-Distill-Qwen-1.5B | math_500  | Pass@1   | Level 5  |   134 |  0.6343 | default |
-+-------------------------------+-----------+----------+----------+-------+---------+---------+
-| DeepSeek-R1-Distill-Qwen-1.5B | math_500  | Pass@1   | OVERALL  |   500 |  0.796  | -       |
-+-------------------------------+-----------+----------+----------+-------+---------+---------+
-#################################### Third time result   ######################################
-+-------------------------------+-----------+----------+----------+-------+---------+---------+
-| Model                         | Dataset   | Metric   | Subset   |   Num |   Score | Cat.0   |
-+===============================+===========+==========+==========+=======+=========+=========+
-| DeepSeek-R1-Distill-Qwen-1.5B | math_500  | Pass@1   | Level 1  |    43 |  0.9535  | default |
-+-------------------------------+-----------+----------+----------+-------+---------+---------+
-| DeepSeek-R1-Distill-Qwen-1.5B | math_500  | Pass@1   | Level 2  |    90 |  0.9444 | default |
-+-------------------------------+-----------+----------+----------+-------+---------+---------+
-| DeepSeek-R1-Distill-Qwen-1.5B | math_500  | Pass@1   | Level 3  |   105 |  0.8857 | default |
-+-------------------------------+-----------+----------+----------+-------+---------+---------+
-| DeepSeek-R1-Distill-Qwen-1.5B | math_500  | Pass@1   | Level 4  |   128 |  0.7812 | default |
-+-------------------------------+-----------+----------+----------+-------+---------+---------+
-| DeepSeek-R1-Distill-Qwen-1.5B | math_500  | Pass@1   | Level 5  |   134 |  0.6343 | default |
-+-------------------------------+-----------+----------+----------+-------+---------+---------+
-| DeepSeek-R1-Distill-Qwen-1.5B | math_500  | Pass@1   | OVERALL  |   500 |  0.808  | -       |
-+-------------------------------+-----------+----------+----------+-------+---------+---------+
-```
+- bloat16: average is **0.808**
+  - 0.82,0.804,0.822,0.814,0.796,0.798,0.802,0.796,0.798,0.816,0.812,0.83,0.792,0.796,0.82
+  - 15 turns
+- float32: average is **0.804**
+  - 0.806,0.806,0.81,0.818,0.79,0.8,0.806,0.822,0.78
+  - 9 turns
 
 #### Complete Evaluation Configuration
 ```yaml

From ed384c6ac7d3487bd7283b4c2c34f0f1a068349f Mon Sep 17 00:00:00 2001
From: aolemila <aolemilaluo@gmail.com>
Date: Fri, 21 Nov 2025 23:00:39 +0800
Subject: [PATCH 5/6] add versions

---
 docs/evaluations/evaluations.md | 11 ++++++++++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/docs/evaluations/evaluations.md b/docs/evaluations/evaluations.md
index f6f81d715..5c721dbc3 100644
--- a/docs/evaluations/evaluations.md
+++ b/docs/evaluations/evaluations.md
@@ -5,7 +5,16 @@
 ### Introduction
 
 Environment: TPU v6e-1.
-Version: main-5fc4fa54a12ea0cbf05c4e304f0f69595e556aa7
+
+Version:
+```bash
+flax                0.12.1
+jax                 0.8.1
+jaxlib              0.8.1
+libtpu              0.0.30
+sglang-jax          main-5fc4fa54a12ea0cbf05c4e304f0f69595e556aa7
+```
+
 
 ### Instructions
 

From f35b6d5e4145abacccc91db6d449adb068fddba1 Mon Sep 17 00:00:00 2001
From: aolemila <aolemilaluo@gmail.com>
Date: Tue, 25 Nov 2025 17:00:15 +0800
Subject: [PATCH 6/6] remove useless annotations

---
 docs/evaluations/evaluations.md | 2 --
 1 file changed, 2 deletions(-)

diff --git a/docs/evaluations/evaluations.md b/docs/evaluations/evaluations.md
index 5c721dbc3..f12402f71 100644
--- a/docs/evaluations/evaluations.md
+++ b/docs/evaluations/evaluations.md
@@ -19,7 +19,6 @@ sglang-jax          main-5fc4fa54a12ea0cbf05c4e304f0f69595e556aa7
 ### Instructions
 
 ```bash
-# sky-31d4-pseudonym
 # launch server, precision = bfloat16
 JAX_COMPILATION_CACHE_DIR=/tmp/jit_cache python3 -u -m sgl_jax.launch_server \
 --model-path deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B \
@@ -36,7 +35,6 @@ JAX_COMPILATION_CACHE_DIR=/tmp/jit_cache python3 -u -m sgl_jax.launch_server \
 --disable-radix-cache \
 --use-sort-for-toppk-minp
 
-# sky-495d-pseudonym
 # launch server, precision = float32
 JAX_COMPILATION_CACHE_DIR=/tmp/jit_cache python3 -u -m sgl_jax.launch_server \
 --model-path deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B \