Feat: Partial support for xAI Grok with OOM errors #239

Kipsora · 2025-10-14T02:37:51Z

This PR adds support for the xAI Grok-2 model. However, currently the execution of the Grok model still suffers from OOM error and it seems that it is due to the model not being correctly sharded across multiple hosts/TPUs.

Since this PR contains OOM errors, we currently manually set the number of hidden layers to 1 and use dummy weights for fast development:

sglang-jax/python/sgl_jax/srt/models/grok.py

Lines 589 to 591 in 7edb5d6

    
           # TODO (chhzh123): remove this 
        
           config.num_hidden_layers = 1 
        
           print(f"config: {config}")

The following scripts can be used to execute the model directly on tpu-v6e-32:

pip3 install -e python
cd python/sgl_jax
python3 bench_one_batch.py \
  --model-path xai-org/grok-2 \
  --tokenizer-path Xenova/grok-1-tokenizer \
  --correct \
  --tp-size 32 \
  --mem-fraction-static 0.4 \
  --download-dir /mnt \
  --load-format dummy

Example Outputs:

...
TPU_24(process=6,(0,6,0,0)) {'num_allocs': 71, 'bytes_in_use': 12892489984, 'peak_bytes_in_use': 14976714112, 'largest_alloc_size': 2147483648, 'bytes_limit': 33550221312, 'bytes_reserved': 67371008, 'peak_bytes_reserved': 67371008, 'bytes_reservable_limit': 29253468032, 'largest_free_block_bytes': 16301195136}
TPU_25(process=6,(1,6,0,0)) {'num_allocs': 67, 'bytes_in_use': 12892451584, 'peak_bytes_in_use': 14976678272, 'largest_alloc_size': 2147483648, 'bytes_limit': 33550221312, 'bytes_reserved': 67371008, 'peak_bytes_reserved': 67371008, 'bytes_reservable_limit': 29253542016, 'largest_free_block_bytes': 16301269120}
TPU_28(process=6,(0,7,0,0)) {'num_allocs': 67, 'bytes_in_use': 12892451584, 'peak_bytes_in_use': 14976678272, 'largest_alloc_size': 2147483648, 'bytes_limit': 33550221312, 'bytes_reserved': 67371008, 'peak_bytes_reserved': 67371008, 'bytes_reservable_limit': 29253542016, 'largest_free_block_bytes': 16301269120}
TPU_29(process=6,(1,7,0,0)) {'num_allocs': 67, 'bytes_in_use': 12892451584, 'peak_bytes_in_use': 14976678272, 'largest_alloc_size': 2147483648, 'bytes_limit': 33550221312, 'bytes_reserved': 67371008, 'peak_bytes_reserved': 67371008, 'bytes_reservable_limit': 29253542016, 'largest_free_block_bytes': 16301269120}
...

Clearly, each device has 12892451584 / 1024 / 1024 / 1024 = 12GB memory in use, which however is about the same as the model's total memory size.

To reproduce this behavior on a smaller TPU machines like tpu-v6e-4, the following commands can be used instead (basically change --tp-size 32 to --tp-size 4):

pip3 install -e python
cd python/sgl_jax
python3 bench_one_batch.py \
  --model-path xai-org/grok-2 \
  --tokenizer-path Xenova/grok-1-tokenizer \
  --correct \
  --tp-size 4 \
  --mem-fraction-static 0.4 \
  --download-dir /mnt \
  --load-format dummy

You can also compare this with one TPU setups by setting --tp-size 1, and its outputs are:

TPU_0(process=0,(0,0,0,0)) {'num_allocs': 65, 'bytes_in_use': 14874966656, 'peak_bytes_in_use': 14874967168, 'largest_alloc_size': 2147483648, 'bytes_limit': 33550235648, 'bytes_reserved': 67371008, 'peak_bytes_reserved': 67371008, 'bytes_reservable_limit': 29253612416, 'largest_free_block_bytes': 16301339520}
TPU_1(process=0,(1,0,0,0)) {'num_allocs': 2, 'bytes_in_use': 32384, 'peak_bytes_in_use': 32384, 'largest_alloc_size': 30720, 'bytes_limit': 33550235648, 'bytes_reserved': 0, 'peak_bytes_reserved': 0, 'bytes_reservable_limit': 33550235648, 'largest_free_block_bytes': 33550203264}
TPU_2(process=0,(0,1,0,0)) {'num_allocs': 2, 'bytes_in_use': 32384, 'peak_bytes_in_use': 32384, 'largest_alloc_size': 30720, 'bytes_limit': 33550235648, 'bytes_reserved': 0, 'peak_bytes_reserved': 0, 'bytes_reservable_limit': 33550235648, 'largest_free_block_bytes': 33550203264}
TPU_3(process=0,(1,1,0,0)) {'num_allocs': 2, 'bytes_in_use': 32384, 'peak_bytes_in_use': 32384, 'largest_alloc_size': 30720, 'bytes_limit': 33550235648, 'bytes_reserved': 0, 'peak_bytes_reserved': 0, 'bytes_reservable_limit': 33550235648, 'largest_free_block_bytes': 33550203264}

We have TPU0 used 14874966656 / 1024 / 1024 / 1024 = 13.8GB and all the other TPUs are almost untouched. It is expected that when using tp_size > 1, we should have much less memory usage than tp_size=1

@Prayer3th

gemini-code-assist · 2025-10-14T02:37:55Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

Prayer3th · 2025-10-14T12:28:12Z

I'm wondering if this TPU memory usage figure includes the KV cache size, since it might be affected by the mem-fraction-static parameter.

Kipsora requested a review from Prayer3th October 14, 2025 02:37

Kipsora changed the title ~~Partial support for xAI Grok with OOM errors~~ Feat: Partial support for xAI Grok with OOM errors Oct 14, 2025

chhzh123 mentioned this pull request Oct 14, 2025

[Feature] Deferred Weight Initialization #241

Closed

2 tasks

Kipsora force-pushed the feat/grok-rebase branch from 0725846 to 4f12f7d Compare November 2, 2025 22:02

Kipsora added 2 commits November 2, 2025 21:43

add xai temperature

c855f44

adjust to the new grok model

cedf0c0

Kipsora force-pushed the feat/grok-rebase branch from 4f12f7d to cedf0c0 Compare November 3, 2025 02:52

Kipsora added 6 commits November 2, 2025 22:04

fix bugs

38e6d68

fix bugs

02af531

fix bugs

e9d3571

fix bugs

cfcbeeb

fix bugs

626dbae

fix bugs

fdbaf1b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Feat: Partial support for xAI Grok with OOM errors #239

Feat: Partial support for xAI Grok with OOM errors #239

Uh oh!

Kipsora commented Oct 14, 2025

Uh oh!

gemini-code-assist bot commented Oct 14, 2025

Uh oh!

Prayer3th commented Oct 14, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	# TODO (chhzh123): remove this
	config.num_hidden_layers = 1
	print(f"config: {config}")

Feat: Partial support for xAI Grok with OOM errors #239

Are you sure you want to change the base?

Feat: Partial support for xAI Grok with OOM errors #239

Uh oh!

Conversation

Kipsora commented Oct 14, 2025

Uh oh!

gemini-code-assist bot commented Oct 14, 2025

Uh oh!

Prayer3th commented Oct 14, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants