remove with_sharding_constraint #308

aolemila · 2025-11-03T11:24:42Z

Tests

case1: test profile whether all_gather exists
case2: test benchmark, compared with TPU benchmark results for the release blog (2025-10-29-sglang-jax) #297
case3: test math-500 accuracy, compared with add use-sort-for-toppk-minp in server_args #287
case4: with_temperature_penalty & without_temperature, compared with use topk_mask and topp_mask to replace sort when sampling #245

Results

case1: Profile

# launch server
JAX_COMPILATION_CACHE_DIR=/tmp/jit_cache \
uv run python -u -m sgl_jax.launch_server \
--model-path Qwen/Qwen3-32B \
--trust-remote-code  \
--tp-size=4 \
--device=tpu \
--mem-fraction-static=0.95 \
--chunked-prefill-size=2048 \
--download-dir=/tmp \
--dtype=bfloat16 \
--max-running-requests 256 \
--skip-server-warmup \
--page-size=128  \
--disable-radix-cache

# profile
uv run python -m sgl_jax.bench_serving --backend sglang-oai --dataset-name random --num-prompts 384 --random-input-len 4096 --random-output-len 1 --max-concurrency 128 --random-range-ratio 1  --disable-ignore-eos --warmup-requests 0

No all_gather in sampling. We add lax.cond for penalty so there exists two conditions.

case2: Benchmark

Note:

When input length is 4096 and batch_size is 128, current median TTFT is 25672.42ms, compared with 25731.12ms in blog.
Launching server and doing benchmark refer to Instructions for Qwen3-32B on TPU v6e-4 #270.
Different tpu-v6e-4 machines' performance vary from each other. Here I tested #af32f095880ff676ed23eec19bc79584b5e20717 again on the machine which is used to test the current branch.

Conclusion: Current branch does not perform worse.

Same Machine

Different Machines

case3: Math-500

# launch server
JAX_COMPILATION_CACHE_DIR=/tmp/jit_cache python3 -u -m sgl_jax.launch_server \
--model-path deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B \
--trust-remote-code  \
--tp-size=4 \
--device=tpu \
--mem-fraction-static=0.8 \
--chunked-prefill-size=2048 \
--download-dir=/tmp \
--dtype=bfloat16 \
--max-running-requests 256 \
--skip-server-warmup \
--page-size=128  \
--disable-radix-cache \
--use-sort-for-toppk-minp

# eval (evalscope==0.17.1), more config information see at the Complete eval config.
## Sampling parameters refer to https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B.
## Eval generation config refers to https://evalscope.readthedocs.io/zh-cn/latest/best_practice/eval_qwq.html#id5.
evalscope eval  --model deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B \
--api-url http://127.0.0.1:30000/v1/chat/completions \
--api-key EMPTY \
--eval-type service \
--datasets math_500  \
--eval-batch-size 64 \
--dataset-args '{"math_500":{"metric_list":["Pass@1"]}}' --generation-config '{"max_tokens": 32768, "temperature": 0.6, "top_p": 0.95, "n": 1}' \
--timeout 120000 \
--model-args precision=jnp.bfloat16

Grades

+-------------------------------+-----------+----------+----------+-------+---------+---------+
| Model                         | Dataset   | Metric   | Subset   |   Num |   Score | Cat.0   |
+===============================+===========+==========+==========+=======+=========+=========+
| DeepSeek-R1-Distill-Qwen-1.5B | math_500  | Pass@1   | Level 1  |    43 |  0.9535 | default |
+-------------------------------+-----------+----------+----------+-------+---------+---------+
| DeepSeek-R1-Distill-Qwen-1.5B | math_500  | Pass@1   | Level 2  |    90 |  0.9444 | default |
+-------------------------------+-----------+----------+----------+-------+---------+---------+
| DeepSeek-R1-Distill-Qwen-1.5B | math_500  | Pass@1   | Level 3  |   105 |  0.8667 | default |
+-------------------------------+-----------+----------+----------+-------+---------+---------+
| DeepSeek-R1-Distill-Qwen-1.5B | math_500  | Pass@1   | Level 4  |   128 |  0.7969 | default |
+-------------------------------+-----------+----------+----------+-------+---------+---------+
| DeepSeek-R1-Distill-Qwen-1.5B | math_500  | Pass@1   | Level 5  |   134 |  0.5896 | default |
+-------------------------------+-----------+----------+----------+-------+---------+---------+
| DeepSeek-R1-Distill-Qwen-1.5B | math_500  | Pass@1   | OVERALL  |   500 |  0.796  | -       |
+-------------------------------+-----------+----------+----------+-------+---------+---------+

+-------------------------------+-----------+----------+----------+-------+---------+---------+
| Model                         | Dataset   | Metric   | Subset   |   Num |   Score | Cat.0   |
+===============================+===========+==========+==========+=======+=========+=========+
| DeepSeek-R1-Distill-Qwen-1.5B | math_500  | Pass@1   | Level 1  |    43 |  0.9302 | default |
+-------------------------------+-----------+----------+----------+-------+---------+---------+
| DeepSeek-R1-Distill-Qwen-1.5B | math_500  | Pass@1   | Level 2  |    90 |  0.9222 | default |
+-------------------------------+-----------+----------+----------+-------+---------+---------+
| DeepSeek-R1-Distill-Qwen-1.5B | math_500  | Pass@1   | Level 3  |   105 |  0.8857 | default |
+-------------------------------+-----------+----------+----------+-------+---------+---------+
| DeepSeek-R1-Distill-Qwen-1.5B | math_500  | Pass@1   | Level 4  |   128 |  0.7812 | default |
+-------------------------------+-----------+----------+----------+-------+---------+---------+
| DeepSeek-R1-Distill-Qwen-1.5B | math_500  | Pass@1   | Level 5  |   134 |  0.6418 | default |
+-------------------------------+-----------+----------+----------+-------+---------+---------+
| DeepSeek-R1-Distill-Qwen-1.5B | math_500  | Pass@1   | OVERALL  |   500 |  0.804  | -       |
+-------------------------------+-----------+----------+----------+-------+---------+---------+

case4: Temperature & Penalty

# launch server
JAX_COMPILATION_CACHE_DIR=/tmp/jit_cache \
python3 -u -m sgl_jax.launch_server \
--model-path Qwen/Qwen3-8B \
--trust-remote-code  \
--tp-size=4 \
--device=tpu \
--mem-fraction-static=0.8 \
--chunked-prefill-size=2048 \
--download-dir=/tmp \
--dtype=bfloat16 \
--max-running-requests 256 \
--skip-server-warmup \
--page-size=128  \
--disable-radix-cache


# eval(evalscope==0.17.1)
## eval_without_temperature
evalscope eval  --model Qwen/Qwen3-8B --api-url http://127.0.0.1:30000/v1/chat/completions --api-key EMPTY --eval-type service --datasets gsm8k --eval-batch-size 64
## eval_with_temperature_penalty
- Note: Parameters refer to https://huggingface.co/Qwen/Qwen3-8B non-thinking mode
evalscope eval  --model Qwen/Qwen3-8B --api-url http://127.0.0.1:30000/v1/chat/completions --api-key EMPTY --eval-type service --datasets gsm8k --eval-batch-size 64 --generation-config '{"temperature": 0.7,"top_p":0.8,"top_k":20,"min_p":0.0,"presence_penalty":0.5}'

eval_without_temperature

Compared with #245, 0.9083 vs 0.9083.

+----------+-----------+-----------------+----------+-------+---------+---------+
| Model    | Dataset   | Metric          | Subset   |   Num |   Score | Cat.0   |
+==========+===========+=================+==========+=======+=========+=========+
| Qwen3-8B | gsm8k     | AverageAccuracy | main     |  1319 |  0.9083 | default |
+----------+-----------+-----------------+----------+-------+---------+---------+

eval_with_temperature_penalty

Compared with #245, 0.9462 vs 0.953.

+----------+-----------+-----------------+----------+-------+---------+---------+
| Model    | Dataset   | Metric          | Subset   |   Num |   Score | Cat.0   |
+==========+===========+=================+==========+=======+=========+=========+
| Qwen3-8B | gsm8k     | AverageAccuracy | main     |  1319 |   0.953 | default |
+----------+-----------+-----------------+----------+-------+---------+---------+

gemini-code-assist · 2025-11-03T11:24:55Z

Summary of Changes

Hello @aolemila, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request refines the sampler.py module by eliminating a specific sharding constraint that was previously applied to logits during the regular sampling process. This change simplifies the code by removing a potentially redundant or unnecessary sharding operation, aiming for cleaner and potentially more efficient execution without altering the fundamental sampling logic.

Highlights

Sharding Constraint Removal: The lax.with_sharding_constraint call has been removed from the _regular_sampling function in sampler.py.
Import Cleanup: Unused imports for NamedSharding and PartitionSpec from jax.sharding have been removed, cleaning up the module's dependencies.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

JamesBrianD

LGTM

remove with_sharding_constraint

a38673a

aolemila requested a review from JamesBrianD November 3, 2025 11:24

aolemila linked an issue Nov 4, 2025 that may be closed by this pull request

[Bug] Current TTFT in main is worse than it in Blog #307

Closed

JamesBrianD approved these changes Nov 4, 2025

View reviewed changes

aolemila merged commit ea02f58 into sgl-project:main Nov 4, 2025
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

remove with_sharding_constraint #308

remove with_sharding_constraint #308

Uh oh!

aolemila commented Nov 3, 2025 •

edited

Loading

Uh oh!

gemini-code-assist bot commented Nov 3, 2025

Uh oh!

JamesBrianD left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

remove with_sharding_constraint #308

remove with_sharding_constraint #308

Uh oh!

Conversation

aolemila commented Nov 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Tests

Results

case1: Profile

case2: Benchmark

case3: Math-500

case4: Temperature & Penalty

Uh oh!

gemini-code-assist bot commented Nov 3, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

JamesBrianD left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

aolemila commented Nov 3, 2025 •

edited

Loading