Optimize Runtime Perf #806

AndreSlavescu · 2025-07-12T22:33:54Z

Summary

Optimizing Softmax and RMSNorm runtime performance on hidden_size >= 64k

Testing Done

Added large tests for 64K dim

Hardware Type: RTX 3090
run make test to ensure correctness
run make checkstyle to ensure code style
run make test-convergence to ensure convergence

Tcc0403 · 2025-07-17T01:53:34Z

src/liger_kernel/ops/rms_norm.py

        W: (H,)
        """
        Y, X, RSTD, BLOCK_SIZE, num_warps, casting_mode = rms_norm_forward(X, W, eps, offset, casting_mode, row_mode)
+        num_stages = calculate_num_stages()


nit: We can just return num_stages from rms_norm_forward() like num_warps to avoid calling it again.

Tcc0403 · 2025-07-17T02:02:21Z

src/liger_kernel/ops/utils.py

+    }
+
+
+def calculate_num_stages():


Is there a table where we can look up these properties?

Tcc0403 · 2025-07-17T02:10:39Z

test/transformers/test_softmax.py

+        ),
+    ],
+)
+def test_large_64k_softmax_correctness(dtype, atol, rtol):


Is there any considerations of not just adding a test case to original tests?

Tcc0403 · 2025-07-17T02:10:49Z

test/transformers/test_rms_norm.py

+        (GemmaRMSNorm, 1.0, "gemma"),
+    ],
+)
+def test_large_64k_correctness(dtype, atol, rtol, reference, offset, casting_mode):


Tcc0403 · 2025-07-17T02:13:22Z

src/liger_kernel/ops/utils.py

+    device = torch.cuda.current_device()
+    torch_device_props = torch.cuda.get_device_properties(device)


We should make it xpu compatible
https://docs.pytorch.org/docs/stable/generated/torch.xpu.get_device_properties.html#torch.xpu.get_device_properties

Tcc0403 · 2025-07-17T02:19:40Z

src/liger_kernel/ops/utils.py


    num_warps = 4
-    if BLOCK_SIZE >= 32768:
+    if BLOCK_SIZE >= 65536:


By the way, I'm always wondering why we don't take element_size into account. Do you have any idea?

shimizust · 2025-07-18T06:09:39Z

Can you show some plots comparing the perf before/after?

AndreSlavescu and others added 2 commits July 12, 2025 18:32

optimization progress

3c53a80

Merge branch 'main' into optim

ebe619f

Tcc0403 reviewed Jul 17, 2025

View reviewed changes

AndreSlavescu and others added 4 commits September 1, 2025 23:24

Merge branch 'main' into optim

5f43261

Merge remote-tracking branch 'refs/remotes/andre/optim' into optim

814a174

Merge remote-tracking branch 'origin/main' into optim

80e2ded

Merge branch 'main' into optim

f57ca3a

upskyy mentioned this pull request Nov 8, 2025

Add TiledMLP Implementation #935

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Optimize Runtime Perf #806

Optimize Runtime Perf #806

Uh oh!

AndreSlavescu commented Jul 12, 2025 •

edited

Loading

Uh oh!

Tcc0403 Jul 17, 2025

Uh oh!

Tcc0403 Jul 17, 2025

Uh oh!

Tcc0403 Jul 17, 2025

Uh oh!

Tcc0403 Jul 17, 2025

Uh oh!

Tcc0403 Jul 17, 2025

Uh oh!

Tcc0403 Jul 17, 2025

Uh oh!

shimizust commented Jul 18, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

		device = torch.cuda.current_device()
		torch_device_props = torch.cuda.get_device_properties(device)

		}


		def calculate_num_stages():

Optimize Runtime Perf #806

Are you sure you want to change the base?

Optimize Runtime Perf #806

Uh oh!

Conversation

AndreSlavescu commented Jul 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Testing Done

Uh oh!

Tcc0403 Jul 17, 2025

Choose a reason for hiding this comment

Uh oh!

Tcc0403 Jul 17, 2025

Choose a reason for hiding this comment

Uh oh!

Tcc0403 Jul 17, 2025

Choose a reason for hiding this comment

Uh oh!

Tcc0403 Jul 17, 2025

Choose a reason for hiding this comment

Uh oh!

Tcc0403 Jul 17, 2025

Choose a reason for hiding this comment

Uh oh!

Tcc0403 Jul 17, 2025

Choose a reason for hiding this comment

Uh oh!

shimizust commented Jul 18, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

AndreSlavescu commented Jul 12, 2025 •

edited

Loading