From 1e7001b22129a6dc5c8462c3606fa4d286c66568 Mon Sep 17 00:00:00 2001 From: aolemila Date: Fri, 31 Oct 2025 14:19:40 +0800 Subject: [PATCH 1/6] add math-500 reproduce instructions and result under bfloat16 and float32 --- README.md | 15 +- docs/evaluations/evaluations.md | 249 ++++++++++++++++++++++++++++++++ 2 files changed, 257 insertions(+), 7 deletions(-) create mode 100644 docs/evaluations/evaluations.md diff --git a/README.md b/README.md index c8520d063..c48670fe9 100644 --- a/README.md +++ b/README.md @@ -47,13 +47,14 @@ For more features and usage details, please read the documents in the [`docs`](h SGL-JAX is designed for easy extension to new model architectures. It currently provides first-class, optimized support for: -- **Qwen** -- **Qwen 2** -- **Qwen 2 MOE** -- **Qwen 3** -- **Qwen 3 MoE** -- **Llama** -- **Bailing MoE** +- **Qwen**: Performance needs to improve. +- **Qwen 2**: Performance needs to improve. +- **Qwen 2 MOE**: Performance needs to improve. +- **Qwen 3**: Currently these series have achieved our best performance. +- **Qwen 3 MoE**: Apart from models like Qwen-coder3-480B with large parameters, these series have achieved our best performance. +- **Llama**: Performance needs to improve. +- **Bailing MoE**: Performance needs to improve. + ## Performance and Benchmarking diff --git a/docs/evaluations/evaluations.md b/docs/evaluations/evaluations.md new file mode 100644 index 000000000..599c89516 --- /dev/null +++ b/docs/evaluations/evaluations.md @@ -0,0 +1,249 @@ +# deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B + +## Math-500 + +### Introduction + +Environment: TPU v6e-4. +Version: main-51e4987a7942ac936bc0e58d77b78174b71eefa5 + +### Instructions + +```bash +# launch server, precision = bfloat16 +JAX_COMPILATION_CACHE_DIR=/tmp/jit_cache python3 -u -m sgl_jax.launch_server \ +--model-path deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B \ +--trust-remote-code \ +--tp-size=4 \ +--device=tpu \ +--mem-fraction-static=0.8 \ +--chunked-prefill-size=2048 \ +--download-dir=/tmp \ +--dtype=bfloat16 \ +--max-running-requests 256 \ +--skip-server-warmup \ +--page-size=128 \ +--disable-radix-cache \ +--use-sort-for-toppk-minp + +# launch server, precision = float32 +JAX_COMPILATION_CACHE_DIR=/tmp/jit_cache python3 -u -m sgl_jax.launch_server \ +--model-path deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B \ +--trust-remote-code \ +--tp-size=4 \ +--device=tpu \ +--mem-fraction-static=0.8 \ +--chunked-prefill-size=2048 \ +--download-dir=/tmp \ +--dtype=float32 \ +--max-running-requests 256 \ +--skip-server-warmup \ +--page-size=128 \ +--disable-radix-cache \ +--use-sort-for-toppk-minp + +# eval: evalscope = 0.17.1 +## Sampling parameters refer to https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B. +## Eval generation config refers to https://evalscope.readthedocs.io/zh-cn/latest/best_practice/eval_qwq.html#id5. +## Note: n in generation-config does not take effect due to https://github.com/sgl-project/sglang-jax/issues/296. So please get mean grade manually. +evalscope eval --model deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B \ +--api-url http://127.0.0.1:30000/v1/chat/completions \ +--api-key EMPTY \ +--eval-type service \ +--datasets math_500 \ +--eval-batch-size 64 \ +--dataset-args '{"math_500":{"metric_list":["Pass@1"]}}' --generation-config '{"max_tokens": 32768, "temperature": 0.6, "top_p": 0.95}' \ +--timeout 120000 +``` + +### Evaluation Results + +- bloat16: 0.818, 0.82, 0.826. +- float32: 0.81, 0.796, 0.808. + +#### Details + +Note: +- Every test under bfloat16 costs about 35 minutes. +- Every test under float32 costs about 40 minutes. + +```bash +############################################################################################### +#################################### Precision: bfloat16 ###################################### +############################################################################################### +#################################### First time result ###################################### ++-------------------------------+-----------+----------+----------+-------+---------+---------+ +| Model | Dataset | Metric | Subset | Num | Score | Cat.0 | ++===============================+===========+==========+==========+=======+=========+=========+ +| DeepSeek-R1-Distill-Qwen-1.5B | math_500 | Pass@1 | Level 1 | 43 | 0.907 | default | ++-------------------------------+-----------+----------+----------+-------+---------+---------+ +| DeepSeek-R1-Distill-Qwen-1.5B | math_500 | Pass@1 | Level 2 | 90 | 0.9889 | default | ++-------------------------------+-----------+----------+----------+-------+---------+---------+ +| DeepSeek-R1-Distill-Qwen-1.5B | math_500 | Pass@1 | Level 3 | 105 | 0.8857 | default | ++-------------------------------+-----------+----------+----------+-------+---------+---------+ +| DeepSeek-R1-Distill-Qwen-1.5B | math_500 | Pass@1 | Level 4 | 128 | 0.8125 | default | ++-------------------------------+-----------+----------+----------+-------+---------+---------+ +| DeepSeek-R1-Distill-Qwen-1.5B | math_500 | Pass@1 | Level 5 | 134 | 0.6269 | default | ++-------------------------------+-----------+----------+----------+-------+---------+---------+ +| DeepSeek-R1-Distill-Qwen-1.5B | math_500 | Pass@1 | OVERALL | 500 | 0.818 | - | ++-------------------------------+-----------+----------+----------+-------+---------+---------+ +#################################### Second time result ##################################### ++-------------------------------+-----------+----------+----------+-------+---------+---------+ +| Model | Dataset | Metric | Subset | Num | Score | Cat.0 | ++===============================+===========+==========+==========+=======+=========+=========+ +| DeepSeek-R1-Distill-Qwen-1.5B | math_500 | Pass@1 | Level 1 | 43 | 0.9302 | default | ++-------------------------------+-----------+----------+----------+-------+---------+---------+ +| DeepSeek-R1-Distill-Qwen-1.5B | math_500 | Pass@1 | Level 2 | 90 | 0.9444 | default | ++-------------------------------+-----------+----------+----------+-------+---------+---------+ +| DeepSeek-R1-Distill-Qwen-1.5B | math_500 | Pass@1 | Level 3 | 105 | 0.9048 | default | ++-------------------------------+-----------+----------+----------+-------+---------+---------+ +| DeepSeek-R1-Distill-Qwen-1.5B | math_500 | Pass@1 | Level 4 | 128 | 0.8125 | default | ++-------------------------------+-----------+----------+----------+-------+---------+---------+ +| DeepSeek-R1-Distill-Qwen-1.5B | math_500 | Pass@1 | Level 5 | 134 | 0.6418 | default | ++-------------------------------+-----------+----------+----------+-------+---------+---------+ +| DeepSeek-R1-Distill-Qwen-1.5B | math_500 | Pass@1 | OVERALL | 500 | 0.82 | - | ++-------------------------------+-----------+----------+----------+-------+---------+---------+ +#################################### Third time result ###################################### ++-------------------------------+-----------+----------+----------+-------+---------+---------+ +| Model | Dataset | Metric | Subset | Num | Score | Cat.0 | ++===============================+===========+==========+==========+=======+=========+=========+ +| DeepSeek-R1-Distill-Qwen-1.5B | math_500 | Pass@1 | Level 1 | 43 | 0.9535 | default | ++-------------------------------+-----------+----------+----------+-------+---------+---------+ +| DeepSeek-R1-Distill-Qwen-1.5B | math_500 | Pass@1 | Level 2 | 90 | 0.9444 | default | ++-------------------------------+-----------+----------+----------+-------+---------+---------+ +| DeepSeek-R1-Distill-Qwen-1.5B | math_500 | Pass@1 | Level 3 | 105 | 0.9048 | default | ++-------------------------------+-----------+----------+----------+-------+---------+---------+ +| DeepSeek-R1-Distill-Qwen-1.5B | math_500 | Pass@1 | Level 4 | 128 | 0.7969 | default | ++-------------------------------+-----------+----------+----------+-------+---------+---------+ +| DeepSeek-R1-Distill-Qwen-1.5B | math_500 | Pass@1 | Level 5 | 134 | 0.6716 | default | ++-------------------------------+-----------+----------+----------+-------+---------+---------+ +| DeepSeek-R1-Distill-Qwen-1.5B | math_500 | Pass@1 | OVERALL | 500 | 0.826 | - | ++-------------------------------+-----------+----------+----------+-------+---------+---------+ + + + +############################################################################################### +#################################### Precision: float32 ####################################### +############################################################################################### +#################################### First time result ###################################### ++-------------------------------+-----------+----------+----------+-------+---------+---------+ +| Model | Dataset | Metric | Subset | Num | Score | Cat.0 | ++===============================+===========+==========+==========+=======+=========+=========+ +| DeepSeek-R1-Distill-Qwen-1.5B | math_500 | Pass@1 | Level 1 | 43 | 0.907 | default | ++-------------------------------+-----------+----------+----------+-------+---------+---------+ +| DeepSeek-R1-Distill-Qwen-1.5B | math_500 | Pass@1 | Level 2 | 90 | 0.9444 | default | ++-------------------------------+-----------+----------+----------+-------+---------+---------+ +| DeepSeek-R1-Distill-Qwen-1.5B | math_500 | Pass@1 | Level 3 | 105 | 0.9143 | default | ++-------------------------------+-----------+----------+----------+-------+---------+---------+ +| DeepSeek-R1-Distill-Qwen-1.5B | math_500 | Pass@1 | Level 4 | 128 | 0.8047 | default | ++-------------------------------+-----------+----------+----------+-------+---------+---------+ +| DeepSeek-R1-Distill-Qwen-1.5B | math_500 | Pass@1 | Level 5 | 134 | 0.6119 | default | ++-------------------------------+-----------+----------+----------+-------+---------+---------+ +| DeepSeek-R1-Distill-Qwen-1.5B | math_500 | Pass@1 | OVERALL | 500 | 0.81 | - | ++-------------------------------+-----------+----------+----------+-------+---------+---------+ +#################################### Second time result ##################################### ++-------------------------------+-----------+----------+----------+-------+---------+---------+ +| Model | Dataset | Metric | Subset | Num | Score | Cat.0 | ++===============================+===========+==========+==========+=======+=========+=========+ +| DeepSeek-R1-Distill-Qwen-1.5B | math_500 | Pass@1 | Level 1 | 43 | 0.907 | default | ++-------------------------------+-----------+----------+----------+-------+---------+---------+ +| DeepSeek-R1-Distill-Qwen-1.5B | math_500 | Pass@1 | Level 2 | 90 | 0.9111 | default | ++-------------------------------+-----------+----------+----------+-------+---------+---------+ +| DeepSeek-R1-Distill-Qwen-1.5B | math_500 | Pass@1 | Level 3 | 105 | 0.9238 | default | ++-------------------------------+-----------+----------+----------+-------+---------+---------+ +| DeepSeek-R1-Distill-Qwen-1.5B | math_500 | Pass@1 | Level 4 | 128 | 0.7422 | default | ++-------------------------------+-----------+----------+----------+-------+---------+---------+ +| DeepSeek-R1-Distill-Qwen-1.5B | math_500 | Pass@1 | Level 5 | 134 | 0.6343 | default | ++-------------------------------+-----------+----------+----------+-------+---------+---------+ +| DeepSeek-R1-Distill-Qwen-1.5B | math_500 | Pass@1 | OVERALL | 500 | 0.796 | - | ++-------------------------------+-----------+----------+----------+-------+---------+---------+ +#################################### Third time result ###################################### ++-------------------------------+-----------+----------+----------+-------+---------+---------+ +| Model | Dataset | Metric | Subset | Num | Score | Cat.0 | ++===============================+===========+==========+==========+=======+=========+=========+ +| DeepSeek-R1-Distill-Qwen-1.5B | math_500 | Pass@1 | Level 1 | 43 | 0.9535 | default | ++-------------------------------+-----------+----------+----------+-------+---------+---------+ +| DeepSeek-R1-Distill-Qwen-1.5B | math_500 | Pass@1 | Level 2 | 90 | 0.9444 | default | ++-------------------------------+-----------+----------+----------+-------+---------+---------+ +| DeepSeek-R1-Distill-Qwen-1.5B | math_500 | Pass@1 | Level 3 | 105 | 0.8857 | default | ++-------------------------------+-----------+----------+----------+-------+---------+---------+ +| DeepSeek-R1-Distill-Qwen-1.5B | math_500 | Pass@1 | Level 4 | 128 | 0.7812 | default | ++-------------------------------+-----------+----------+----------+-------+---------+---------+ +| DeepSeek-R1-Distill-Qwen-1.5B | math_500 | Pass@1 | Level 5 | 134 | 0.6343 | default | ++-------------------------------+-----------+----------+----------+-------+---------+---------+ +| DeepSeek-R1-Distill-Qwen-1.5B | math_500 | Pass@1 | OVERALL | 500 | 0.808 | - | ++-------------------------------+-----------+----------+----------+-------+---------+---------+ +``` + +#### Complete Evaluation Configuration +```yaml +analysis_report: false +api_key: EMPTY +api_url: http://127.0.0.1:30000/v1/chat/completions +chat_template: null +dataset_args: + math_500: + dataset_id: AI-ModelScope/MATH-500 + description: MATH-500 is a benchmark for evaluating mathematical reasoning capabilities + of AI models. It consists of 500 diverse math problems across five levels of + difficulty, designed to test a model's ability to solve complex mathematical + problems by generating step-by-step solutions and providing the correct final + answer. + eval_split: test + extra_params: {} + few_shot_num: 0 + few_shot_random: false + filters: null + metric_list: + - Pass@1 + model_adapter: generation + name: math_500 + output_types: + - generation + pretty_name: MATH-500 + prompt_template: '{query} + + Please reason step by step, and put your final answer within \boxed{{}}.' + query_template: null + subset_list: + - Level 1 + - Level 2 + - Level 3 + - Level 4 + - Level 5 + system_prompt: null + tags: + - Mathematics + train_split: null +dataset_dir: /home/gcpuser/.cache/modelscope/hub/datasets +dataset_hub: modelscope +datasets: +- math_500 +debug: false +dry_run: false +eval_backend: Native +eval_batch_size: 64 +eval_config: null +eval_type: service +generation_config: + max_tokens: 32768 + temperature: 0.6 + top_p: 0.95 +ignore_errors: false +judge_model_args: {} +judge_strategy: auto +judge_worker_num: 1 +limit: null +mem_cache: false +model: deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B +model_args: {} +model_id: DeepSeek-R1-Distill-Qwen-1.5B +model_task: text_generation +outputs: null +seed: 42 +stage: all +stream: false +template_type: null +timeout: 120000.0 +use_cache: null +``` From 6cc0b94e02d529290713a317c88324cfebd724bb Mon Sep 17 00:00:00 2001 From: aolemila Date: Thu, 20 Nov 2025 20:51:40 +0800 Subject: [PATCH 2/6] update --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index c48670fe9..031f36e05 100644 --- a/README.md +++ b/README.md @@ -49,7 +49,7 @@ SGL-JAX is designed for easy extension to new model architectures. It currently - **Qwen**: Performance needs to improve. - **Qwen 2**: Performance needs to improve. -- **Qwen 2 MOE**: Performance needs to improve. +- **Qwen 2 MoE**: Performance needs to improve. - **Qwen 3**: Currently these series have achieved our best performance. - **Qwen 3 MoE**: Apart from models like Qwen-coder3-480B with large parameters, these series have achieved our best performance. - **Llama**: Performance needs to improve. From 9f234427a3942b46b7ca71fb6fb48ab15d3c5239 Mon Sep 17 00:00:00 2001 From: aolemila Date: Fri, 21 Nov 2025 14:15:19 +0800 Subject: [PATCH 3/6] update version and try to retest --- docs/evaluations/evaluations.md | 11 +++++++---- 1 file changed, 7 insertions(+), 4 deletions(-) diff --git a/docs/evaluations/evaluations.md b/docs/evaluations/evaluations.md index 599c89516..6a500e7f7 100644 --- a/docs/evaluations/evaluations.md +++ b/docs/evaluations/evaluations.md @@ -4,17 +4,18 @@ ### Introduction -Environment: TPU v6e-4. -Version: main-51e4987a7942ac936bc0e58d77b78174b71eefa5 +Environment: TPU v6e-1. +Version: main-5fc4fa54a12ea0cbf05c4e304f0f69595e556aa7 ### Instructions ```bash +# sky-31d4-pseudonym # launch server, precision = bfloat16 JAX_COMPILATION_CACHE_DIR=/tmp/jit_cache python3 -u -m sgl_jax.launch_server \ --model-path deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B \ --trust-remote-code \ ---tp-size=4 \ +--tp-size=1 \ --device=tpu \ --mem-fraction-static=0.8 \ --chunked-prefill-size=2048 \ @@ -26,11 +27,12 @@ JAX_COMPILATION_CACHE_DIR=/tmp/jit_cache python3 -u -m sgl_jax.launch_server \ --disable-radix-cache \ --use-sort-for-toppk-minp +# sky-495d-pseudonym # launch server, precision = float32 JAX_COMPILATION_CACHE_DIR=/tmp/jit_cache python3 -u -m sgl_jax.launch_server \ --model-path deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B \ --trust-remote-code \ ---tp-size=4 \ +--tp-size=1 \ --device=tpu \ --mem-fraction-static=0.8 \ --chunked-prefill-size=2048 \ @@ -246,4 +248,5 @@ stream: false template_type: null timeout: 120000.0 use_cache: null +work_dir: ./outputs/20251121_043226 ``` From a40bdbfd6f3c450bcac241ab5d345f00b474f119 Mon Sep 17 00:00:00 2001 From: aolemila Date: Fri, 21 Nov 2025 22:58:22 +0800 Subject: [PATCH 4/6] update math-500 with latest main --- docs/evaluations/evaluations.md | 123 ++------------------------------ 1 file changed, 6 insertions(+), 117 deletions(-) diff --git a/docs/evaluations/evaluations.md b/docs/evaluations/evaluations.md index 6a500e7f7..f6f81d715 100644 --- a/docs/evaluations/evaluations.md +++ b/docs/evaluations/evaluations.md @@ -47,7 +47,6 @@ JAX_COMPILATION_CACHE_DIR=/tmp/jit_cache python3 -u -m sgl_jax.launch_server \ # eval: evalscope = 0.17.1 ## Sampling parameters refer to https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B. ## Eval generation config refers to https://evalscope.readthedocs.io/zh-cn/latest/best_practice/eval_qwq.html#id5. -## Note: n in generation-config does not take effect due to https://github.com/sgl-project/sglang-jax/issues/296. So please get mean grade manually. evalscope eval --model deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B \ --api-url http://127.0.0.1:30000/v1/chat/completions \ --api-key EMPTY \ @@ -60,122 +59,12 @@ evalscope eval --model deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B \ ### Evaluation Results -- bloat16: 0.818, 0.82, 0.826. -- float32: 0.81, 0.796, 0.808. - -#### Details - -Note: -- Every test under bfloat16 costs about 35 minutes. -- Every test under float32 costs about 40 minutes. - -```bash -############################################################################################### -#################################### Precision: bfloat16 ###################################### -############################################################################################### -#################################### First time result ###################################### -+-------------------------------+-----------+----------+----------+-------+---------+---------+ -| Model | Dataset | Metric | Subset | Num | Score | Cat.0 | -+===============================+===========+==========+==========+=======+=========+=========+ -| DeepSeek-R1-Distill-Qwen-1.5B | math_500 | Pass@1 | Level 1 | 43 | 0.907 | default | -+-------------------------------+-----------+----------+----------+-------+---------+---------+ -| DeepSeek-R1-Distill-Qwen-1.5B | math_500 | Pass@1 | Level 2 | 90 | 0.9889 | default | -+-------------------------------+-----------+----------+----------+-------+---------+---------+ -| DeepSeek-R1-Distill-Qwen-1.5B | math_500 | Pass@1 | Level 3 | 105 | 0.8857 | default | -+-------------------------------+-----------+----------+----------+-------+---------+---------+ -| DeepSeek-R1-Distill-Qwen-1.5B | math_500 | Pass@1 | Level 4 | 128 | 0.8125 | default | -+-------------------------------+-----------+----------+----------+-------+---------+---------+ -| DeepSeek-R1-Distill-Qwen-1.5B | math_500 | Pass@1 | Level 5 | 134 | 0.6269 | default | -+-------------------------------+-----------+----------+----------+-------+---------+---------+ -| DeepSeek-R1-Distill-Qwen-1.5B | math_500 | Pass@1 | OVERALL | 500 | 0.818 | - | -+-------------------------------+-----------+----------+----------+-------+---------+---------+ -#################################### Second time result ##################################### -+-------------------------------+-----------+----------+----------+-------+---------+---------+ -| Model | Dataset | Metric | Subset | Num | Score | Cat.0 | -+===============================+===========+==========+==========+=======+=========+=========+ -| DeepSeek-R1-Distill-Qwen-1.5B | math_500 | Pass@1 | Level 1 | 43 | 0.9302 | default | -+-------------------------------+-----------+----------+----------+-------+---------+---------+ -| DeepSeek-R1-Distill-Qwen-1.5B | math_500 | Pass@1 | Level 2 | 90 | 0.9444 | default | -+-------------------------------+-----------+----------+----------+-------+---------+---------+ -| DeepSeek-R1-Distill-Qwen-1.5B | math_500 | Pass@1 | Level 3 | 105 | 0.9048 | default | -+-------------------------------+-----------+----------+----------+-------+---------+---------+ -| DeepSeek-R1-Distill-Qwen-1.5B | math_500 | Pass@1 | Level 4 | 128 | 0.8125 | default | -+-------------------------------+-----------+----------+----------+-------+---------+---------+ -| DeepSeek-R1-Distill-Qwen-1.5B | math_500 | Pass@1 | Level 5 | 134 | 0.6418 | default | -+-------------------------------+-----------+----------+----------+-------+---------+---------+ -| DeepSeek-R1-Distill-Qwen-1.5B | math_500 | Pass@1 | OVERALL | 500 | 0.82 | - | -+-------------------------------+-----------+----------+----------+-------+---------+---------+ -#################################### Third time result ###################################### -+-------------------------------+-----------+----------+----------+-------+---------+---------+ -| Model | Dataset | Metric | Subset | Num | Score | Cat.0 | -+===============================+===========+==========+==========+=======+=========+=========+ -| DeepSeek-R1-Distill-Qwen-1.5B | math_500 | Pass@1 | Level 1 | 43 | 0.9535 | default | -+-------------------------------+-----------+----------+----------+-------+---------+---------+ -| DeepSeek-R1-Distill-Qwen-1.5B | math_500 | Pass@1 | Level 2 | 90 | 0.9444 | default | -+-------------------------------+-----------+----------+----------+-------+---------+---------+ -| DeepSeek-R1-Distill-Qwen-1.5B | math_500 | Pass@1 | Level 3 | 105 | 0.9048 | default | -+-------------------------------+-----------+----------+----------+-------+---------+---------+ -| DeepSeek-R1-Distill-Qwen-1.5B | math_500 | Pass@1 | Level 4 | 128 | 0.7969 | default | -+-------------------------------+-----------+----------+----------+-------+---------+---------+ -| DeepSeek-R1-Distill-Qwen-1.5B | math_500 | Pass@1 | Level 5 | 134 | 0.6716 | default | -+-------------------------------+-----------+----------+----------+-------+---------+---------+ -| DeepSeek-R1-Distill-Qwen-1.5B | math_500 | Pass@1 | OVERALL | 500 | 0.826 | - | -+-------------------------------+-----------+----------+----------+-------+---------+---------+ - - - -############################################################################################### -#################################### Precision: float32 ####################################### -############################################################################################### -#################################### First time result ###################################### -+-------------------------------+-----------+----------+----------+-------+---------+---------+ -| Model | Dataset | Metric | Subset | Num | Score | Cat.0 | -+===============================+===========+==========+==========+=======+=========+=========+ -| DeepSeek-R1-Distill-Qwen-1.5B | math_500 | Pass@1 | Level 1 | 43 | 0.907 | default | -+-------------------------------+-----------+----------+----------+-------+---------+---------+ -| DeepSeek-R1-Distill-Qwen-1.5B | math_500 | Pass@1 | Level 2 | 90 | 0.9444 | default | -+-------------------------------+-----------+----------+----------+-------+---------+---------+ -| DeepSeek-R1-Distill-Qwen-1.5B | math_500 | Pass@1 | Level 3 | 105 | 0.9143 | default | -+-------------------------------+-----------+----------+----------+-------+---------+---------+ -| DeepSeek-R1-Distill-Qwen-1.5B | math_500 | Pass@1 | Level 4 | 128 | 0.8047 | default | -+-------------------------------+-----------+----------+----------+-------+---------+---------+ -| DeepSeek-R1-Distill-Qwen-1.5B | math_500 | Pass@1 | Level 5 | 134 | 0.6119 | default | -+-------------------------------+-----------+----------+----------+-------+---------+---------+ -| DeepSeek-R1-Distill-Qwen-1.5B | math_500 | Pass@1 | OVERALL | 500 | 0.81 | - | -+-------------------------------+-----------+----------+----------+-------+---------+---------+ -#################################### Second time result ##################################### -+-------------------------------+-----------+----------+----------+-------+---------+---------+ -| Model | Dataset | Metric | Subset | Num | Score | Cat.0 | -+===============================+===========+==========+==========+=======+=========+=========+ -| DeepSeek-R1-Distill-Qwen-1.5B | math_500 | Pass@1 | Level 1 | 43 | 0.907 | default | -+-------------------------------+-----------+----------+----------+-------+---------+---------+ -| DeepSeek-R1-Distill-Qwen-1.5B | math_500 | Pass@1 | Level 2 | 90 | 0.9111 | default | -+-------------------------------+-----------+----------+----------+-------+---------+---------+ -| DeepSeek-R1-Distill-Qwen-1.5B | math_500 | Pass@1 | Level 3 | 105 | 0.9238 | default | -+-------------------------------+-----------+----------+----------+-------+---------+---------+ -| DeepSeek-R1-Distill-Qwen-1.5B | math_500 | Pass@1 | Level 4 | 128 | 0.7422 | default | -+-------------------------------+-----------+----------+----------+-------+---------+---------+ -| DeepSeek-R1-Distill-Qwen-1.5B | math_500 | Pass@1 | Level 5 | 134 | 0.6343 | default | -+-------------------------------+-----------+----------+----------+-------+---------+---------+ -| DeepSeek-R1-Distill-Qwen-1.5B | math_500 | Pass@1 | OVERALL | 500 | 0.796 | - | -+-------------------------------+-----------+----------+----------+-------+---------+---------+ -#################################### Third time result ###################################### -+-------------------------------+-----------+----------+----------+-------+---------+---------+ -| Model | Dataset | Metric | Subset | Num | Score | Cat.0 | -+===============================+===========+==========+==========+=======+=========+=========+ -| DeepSeek-R1-Distill-Qwen-1.5B | math_500 | Pass@1 | Level 1 | 43 | 0.9535 | default | -+-------------------------------+-----------+----------+----------+-------+---------+---------+ -| DeepSeek-R1-Distill-Qwen-1.5B | math_500 | Pass@1 | Level 2 | 90 | 0.9444 | default | -+-------------------------------+-----------+----------+----------+-------+---------+---------+ -| DeepSeek-R1-Distill-Qwen-1.5B | math_500 | Pass@1 | Level 3 | 105 | 0.8857 | default | -+-------------------------------+-----------+----------+----------+-------+---------+---------+ -| DeepSeek-R1-Distill-Qwen-1.5B | math_500 | Pass@1 | Level 4 | 128 | 0.7812 | default | -+-------------------------------+-----------+----------+----------+-------+---------+---------+ -| DeepSeek-R1-Distill-Qwen-1.5B | math_500 | Pass@1 | Level 5 | 134 | 0.6343 | default | -+-------------------------------+-----------+----------+----------+-------+---------+---------+ -| DeepSeek-R1-Distill-Qwen-1.5B | math_500 | Pass@1 | OVERALL | 500 | 0.808 | - | -+-------------------------------+-----------+----------+----------+-------+---------+---------+ -``` +- bloat16: average is **0.808** + - 0.82,0.804,0.822,0.814,0.796,0.798,0.802,0.796,0.798,0.816,0.812,0.83,0.792,0.796,0.82 + - 15 turns +- float32: average is **0.804** + - 0.806,0.806,0.81,0.818,0.79,0.8,0.806,0.822,0.78 + - 9 turns #### Complete Evaluation Configuration ```yaml From ed384c6ac7d3487bd7283b4c2c34f0f1a068349f Mon Sep 17 00:00:00 2001 From: aolemila Date: Fri, 21 Nov 2025 23:00:39 +0800 Subject: [PATCH 5/6] add versions --- docs/evaluations/evaluations.md | 11 ++++++++++- 1 file changed, 10 insertions(+), 1 deletion(-) diff --git a/docs/evaluations/evaluations.md b/docs/evaluations/evaluations.md index f6f81d715..5c721dbc3 100644 --- a/docs/evaluations/evaluations.md +++ b/docs/evaluations/evaluations.md @@ -5,7 +5,16 @@ ### Introduction Environment: TPU v6e-1. -Version: main-5fc4fa54a12ea0cbf05c4e304f0f69595e556aa7 + +Version: +```bash +flax 0.12.1 +jax 0.8.1 +jaxlib 0.8.1 +libtpu 0.0.30 +sglang-jax main-5fc4fa54a12ea0cbf05c4e304f0f69595e556aa7 +``` + ### Instructions From f35b6d5e4145abacccc91db6d449adb068fddba1 Mon Sep 17 00:00:00 2001 From: aolemila Date: Tue, 25 Nov 2025 17:00:15 +0800 Subject: [PATCH 6/6] remove useless annotations --- docs/evaluations/evaluations.md | 2 -- 1 file changed, 2 deletions(-) diff --git a/docs/evaluations/evaluations.md b/docs/evaluations/evaluations.md index 5c721dbc3..f12402f71 100644 --- a/docs/evaluations/evaluations.md +++ b/docs/evaluations/evaluations.md @@ -19,7 +19,6 @@ sglang-jax main-5fc4fa54a12ea0cbf05c4e304f0f69595e556aa7 ### Instructions ```bash -# sky-31d4-pseudonym # launch server, precision = bfloat16 JAX_COMPILATION_CACHE_DIR=/tmp/jit_cache python3 -u -m sgl_jax.launch_server \ --model-path deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B \ @@ -36,7 +35,6 @@ JAX_COMPILATION_CACHE_DIR=/tmp/jit_cache python3 -u -m sgl_jax.launch_server \ --disable-radix-cache \ --use-sort-for-toppk-minp -# sky-495d-pseudonym # launch server, precision = float32 JAX_COMPILATION_CACHE_DIR=/tmp/jit_cache python3 -u -m sgl_jax.launch_server \ --model-path deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B \