Skip to content

Conversation

@aolemila
Copy link
Collaborator

@aolemila aolemila commented Nov 3, 2025

Tests

Results

case1: Profile

# launch server
JAX_COMPILATION_CACHE_DIR=/tmp/jit_cache \
uv run python -u -m sgl_jax.launch_server \
--model-path Qwen/Qwen3-32B \
--trust-remote-code  \
--tp-size=4 \
--device=tpu \
--mem-fraction-static=0.95 \
--chunked-prefill-size=2048 \
--download-dir=/tmp \
--dtype=bfloat16 \
--max-running-requests 256 \
--skip-server-warmup \
--page-size=128  \
--disable-radix-cache

# profile
uv run python -m sgl_jax.bench_serving --backend sglang-oai --dataset-name random --num-prompts 384 --random-input-len 4096 --random-output-len 1 --max-concurrency 128 --random-range-ratio 1  --disable-ignore-eos --warmup-requests 0

No all_gather in sampling. We add lax.cond for penalty so there exists two conditions.

image

case2: Benchmark

Note:

  • When input length is 4096 and batch_size is 128, current median TTFT is 25672.42ms, compared with 25731.12ms in blog.
  • Launching server and doing benchmark refer to Instructions for Qwen3-32B on TPU v6e-4 #270.
  • Different tpu-v6e-4 machines' performance vary from each other. Here I tested #af32f095880ff676ed23eec19bc79584b5e20717 again on the machine which is used to test the current branch.

Conclusion: Current branch does not perform worse.

Same Machine

qwen3-32B_input4096_output1024 qwen3-32B_input8192_output1024

Different Machines
qwen3-32B_input4096_output1024

qwen3-32B_input8192_output1024

case3: Math-500

# launch server
JAX_COMPILATION_CACHE_DIR=/tmp/jit_cache python3 -u -m sgl_jax.launch_server \
--model-path deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B \
--trust-remote-code  \
--tp-size=4 \
--device=tpu \
--mem-fraction-static=0.8 \
--chunked-prefill-size=2048 \
--download-dir=/tmp \
--dtype=bfloat16 \
--max-running-requests 256 \
--skip-server-warmup \
--page-size=128  \
--disable-radix-cache \
--use-sort-for-toppk-minp

# eval (evalscope==0.17.1), more config information see at the Complete eval config.
## Sampling parameters refer to https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B.
## Eval generation config refers to https://evalscope.readthedocs.io/zh-cn/latest/best_practice/eval_qwq.html#id5.
evalscope eval  --model deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B \
--api-url http://127.0.0.1:30000/v1/chat/completions \
--api-key EMPTY \
--eval-type service \
--datasets math_500  \
--eval-batch-size 64 \
--dataset-args '{"math_500":{"metric_list":["Pass@1"]}}' --generation-config '{"max_tokens": 32768, "temperature": 0.6, "top_p": 0.95, "n": 1}' \
--timeout 120000 \
--model-args precision=jnp.bfloat16

Grades

+-------------------------------+-----------+----------+----------+-------+---------+---------+
| Model                         | Dataset   | Metric   | Subset   |   Num |   Score | Cat.0   |
+===============================+===========+==========+==========+=======+=========+=========+
| DeepSeek-R1-Distill-Qwen-1.5B | math_500  | Pass@1   | Level 1  |    43 |  0.9535 | default |
+-------------------------------+-----------+----------+----------+-------+---------+---------+
| DeepSeek-R1-Distill-Qwen-1.5B | math_500  | Pass@1   | Level 2  |    90 |  0.9444 | default |
+-------------------------------+-----------+----------+----------+-------+---------+---------+
| DeepSeek-R1-Distill-Qwen-1.5B | math_500  | Pass@1   | Level 3  |   105 |  0.8667 | default |
+-------------------------------+-----------+----------+----------+-------+---------+---------+
| DeepSeek-R1-Distill-Qwen-1.5B | math_500  | Pass@1   | Level 4  |   128 |  0.7969 | default |
+-------------------------------+-----------+----------+----------+-------+---------+---------+
| DeepSeek-R1-Distill-Qwen-1.5B | math_500  | Pass@1   | Level 5  |   134 |  0.5896 | default |
+-------------------------------+-----------+----------+----------+-------+---------+---------+
| DeepSeek-R1-Distill-Qwen-1.5B | math_500  | Pass@1   | OVERALL  |   500 |  0.796  | -       |
+-------------------------------+-----------+----------+----------+-------+---------+---------+

+-------------------------------+-----------+----------+----------+-------+---------+---------+
| Model                         | Dataset   | Metric   | Subset   |   Num |   Score | Cat.0   |
+===============================+===========+==========+==========+=======+=========+=========+
| DeepSeek-R1-Distill-Qwen-1.5B | math_500  | Pass@1   | Level 1  |    43 |  0.9302 | default |
+-------------------------------+-----------+----------+----------+-------+---------+---------+
| DeepSeek-R1-Distill-Qwen-1.5B | math_500  | Pass@1   | Level 2  |    90 |  0.9222 | default |
+-------------------------------+-----------+----------+----------+-------+---------+---------+
| DeepSeek-R1-Distill-Qwen-1.5B | math_500  | Pass@1   | Level 3  |   105 |  0.8857 | default |
+-------------------------------+-----------+----------+----------+-------+---------+---------+
| DeepSeek-R1-Distill-Qwen-1.5B | math_500  | Pass@1   | Level 4  |   128 |  0.7812 | default |
+-------------------------------+-----------+----------+----------+-------+---------+---------+
| DeepSeek-R1-Distill-Qwen-1.5B | math_500  | Pass@1   | Level 5  |   134 |  0.6418 | default |
+-------------------------------+-----------+----------+----------+-------+---------+---------+
| DeepSeek-R1-Distill-Qwen-1.5B | math_500  | Pass@1   | OVERALL  |   500 |  0.804  | -       |
+-------------------------------+-----------+----------+----------+-------+---------+---------+

case4: Temperature & Penalty

# launch server
JAX_COMPILATION_CACHE_DIR=/tmp/jit_cache \
python3 -u -m sgl_jax.launch_server \
--model-path Qwen/Qwen3-8B \
--trust-remote-code  \
--tp-size=4 \
--device=tpu \
--mem-fraction-static=0.8 \
--chunked-prefill-size=2048 \
--download-dir=/tmp \
--dtype=bfloat16 \
--max-running-requests 256 \
--skip-server-warmup \
--page-size=128  \
--disable-radix-cache


# eval(evalscope==0.17.1)
## eval_without_temperature
evalscope eval  --model Qwen/Qwen3-8B --api-url http://127.0.0.1:30000/v1/chat/completions --api-key EMPTY --eval-type service --datasets gsm8k --eval-batch-size 64
## eval_with_temperature_penalty
- Note: Parameters refer to https://huggingface.co/Qwen/Qwen3-8B non-thinking mode
evalscope eval  --model Qwen/Qwen3-8B --api-url http://127.0.0.1:30000/v1/chat/completions --api-key EMPTY --eval-type service --datasets gsm8k --eval-batch-size 64 --generation-config '{"temperature": 0.7,"top_p":0.8,"top_k":20,"min_p":0.0,"presence_penalty":0.5}'

eval_without_temperature

Compared with #245, 0.9083 vs 0.9083.

+----------+-----------+-----------------+----------+-------+---------+---------+
| Model    | Dataset   | Metric          | Subset   |   Num |   Score | Cat.0   |
+==========+===========+=================+==========+=======+=========+=========+
| Qwen3-8B | gsm8k     | AverageAccuracy | main     |  1319 |  0.9083 | default |
+----------+-----------+-----------------+----------+-------+---------+---------+

eval_with_temperature_penalty

Compared with #245, 0.9462 vs 0.953.

+----------+-----------+-----------------+----------+-------+---------+---------+
| Model    | Dataset   | Metric          | Subset   |   Num |   Score | Cat.0   |
+==========+===========+=================+==========+=======+=========+=========+
| Qwen3-8B | gsm8k     | AverageAccuracy | main     |  1319 |   0.953 | default |
+----------+-----------+-----------------+----------+-------+---------+---------+

@aolemila aolemila requested a review from JamesBrianD November 3, 2025 11:24
@gemini-code-assist
Copy link

Summary of Changes

Hello @aolemila, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request refines the sampler.py module by eliminating a specific sharding constraint that was previously applied to logits during the regular sampling process. This change simplifies the code by removing a potentially redundant or unnecessary sharding operation, aiming for cleaner and potentially more efficient execution without altering the fundamental sampling logic.

Highlights

  • Sharding Constraint Removal: The lax.with_sharding_constraint call has been removed from the _regular_sampling function in sampler.py.
  • Import Cleanup: Unused imports for NamedSharding and PartitionSpec from jax.sharding have been removed, cleaning up the module's dependencies.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@aolemila aolemila linked an issue Nov 4, 2025 that may be closed by this pull request
Copy link
Collaborator

@JamesBrianD JamesBrianD left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@aolemila aolemila merged commit ea02f58 into sgl-project:main Nov 4, 2025
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] Current TTFT in main is worse than it in Blog

2 participants