MXFP4 8w Optimizations Dynamic by adedespirlet · Pull Request #1211 · iree-org/wave

adedespirlet · 2026-03-31T00:34:47Z

This PR:

Optimizes MXFP 8w schedule with respect to counters and memory ops.
Includes a handwritten MLIR snippet that performs swizzle and dword stores to global memory (instead of u shorts). This optmization brings approx 7% improprement.
Unrolls the kernel twice to remove forced vmcnt(0) + v_mov copies at the end of the loop
Without unrolling, scale loads and scale consumption overlap within the same iteration, forcing the compiler to load into temporary VGPRs and copy them back to the loop iter_args's registers at the end of the loop (vmcnt(0) + v_mov) . With 2x unrolling, odd/even iterations alternate scale register sets, so loads target already "dead" registers directly. This prevents copies and vmcnt(0) which breaks perf. Scale waits now happen right before the MFMAs, maximizing latency hiding.
Adds tkw.assumptions constraints to optimize dynamic kernels.
In this case, the assumptions allow the compiler to omit masking logic when generating dynamic kernels. Specifically, the assumption states that the shape dimension is a perfect multiple of the tile size. With this guarantee, the compiler can safely eliminate bounds checks associated with gather_to_lds operations. This avoids inserting costly masking logic in the dynamic case and improves performance.

TODO: implement a pass that automatically emits the optimized epilogue storing logic.

xintin · 2026-04-15T20:25:26Z

"TODO: implement a pass that automatically emits the optimized epilogue storing logic."

Nice. Let's add this TODO in the code.

xintin · 2026-04-15T20:36:17Z

As discussed, test on: shape=(1792, 5376, 4096), block=(256, 192, 256), dynamic=True to ensure no race condition.

Signed-off-by: Aurore De Spirlet <aurore.despirlet@amd.com>

ftynse · 2026-04-16T11:44:10Z

    )
    options.specialize = True
    options.use_buffer_ops = True
    options.minimize_shared_allocs = False
    options.linearize_shared_access = True
-
+    options.wave_runtime = True
+    # options.override_mlir = mlir_256x192


Is the override currently disabled? Meaning the big string above is basically dead code?

ftynse · 2026-04-16T11:44:47Z

+    UNROLL_FACTOR = tkl.sym.UNROLL_FACTOR
+    options.subs[UNROLL_FACTOR] = 2
+    options.postprocess = """


Is the dance with subs needed or can we just substitute 2 into the string below at the Python level?

ftynse · 2026-04-16T11:45:51Z

    options.specialize = True
    options.use_buffer_ops = True
-    options.minimize_shared_allocs = True
+    options.minimize_shared_allocs = False


A comment as to why this is turned off is welcome

ftynse · 2026-04-16T11:46:01Z

+    UNROLL_FACTOR = tkl.sym.UNROLL_FACTOR
+    options.subs[UNROLL_FACTOR] = 2


Same as above

xintin · 2026-04-16T12:48:34Z

 ]


-_DYNAMIC_ALLOWED_PRESHUFFLE_8WAVE_BLOCKS = {


question: why are we removing these tests? If it is covered elsewhere, then it is fine.

I’ve streamlined the tests because the previous distinction between dynamic and static block tiles is no longer necessary following optimizations that were implemented, both now run on the same tile sizes.

I remember simplifying the coverage of shapes because the 8wave pingpong schedule didnt support the remaining shapes. While we could add support for them it would require more effort which is not priority anymore

If it was working before this pr, that means although this optimization works, it is not generic until we have the pass in place.
Removing these tests will just make it lost.

What we can do instead:
xfail and document the regression (in the code) with the reason. eg:
pytest.param((1024, 1024, 8192), (128, 128, 256), marks=pytest.mark.xfail(reason=" ...")),

Signed-off-by: Aurore De Spirlet <aurore.despirlet@amd.com>

xintin

LGTM! Left two more minor comments.
Once the ci is green, we can merge it.

Signed-off-by: Aurore De Spirlet <aurore.despirlet@amd.com>

adedespirlet force-pushed the dynamic_divisibility_assumption branch from 006ceb1 to 12caa2a Compare March 31, 2026 02:02

adedespirlet force-pushed the dynamic_divisibility_assumption branch from f676f92 to eb2597e Compare April 10, 2026 18:19

adedespirlet requested a review from xintin April 14, 2026 20:33

adedespirlet added 12 commits April 15, 2026 22:02

add divisbilty assumption for dynamic kernels

4e39932

Signed-off-by: Aurore De Spirlet <aurore.despirlet@amd.com>

updated ping pong 8w kernel

730e53d

Signed-off-by: Aurore De Spirlet <aurore.despirlet@amd.com>

replace all llvm ops with arith

7141382

Signed-off-by: Aurore De Spirlet <aurore.despirlet@amd.com>

remove support for some tiles

577fd4d

Signed-off-by: Aurore De Spirlet <aurore.despirlet@amd.com>

fix tests

822be24

Signed-off-by: Aurore De Spirlet <aurore.despirlet@amd.com>

add transposed kernel using dwordx4 stores

fc675e7

Signed-off-by: Aurore De Spirlet <aurore.despirlet@amd.com>

remove hardcoded counters

368d24d

Signed-off-by: Aurore De Spirlet <aurore.despirlet@amd.com>

add tests for divisibility assumption and no masked load

6b2e107

Signed-off-by: Aurore De Spirlet <aurore.despirlet@amd.com>

remove extra barrier

6b9538d

Signed-off-by: Aurore De Spirlet <aurore.despirlet@amd.com>

clean

05be51c

Signed-off-by: Aurore De Spirlet <aurore.despirlet@amd.com>

nfc cleaning

cbce1fb

Signed-off-by: Aurore De Spirlet <aurore.despirlet@amd.com>

add todo comment

02ee665

Signed-off-by: Aurore De Spirlet <aurore.despirlet@amd.com>

adedespirlet force-pushed the dynamic_divisibility_assumption branch from 23c6006 to 02ee665 Compare April 15, 2026 22:06

ftynse approved these changes Apr 16, 2026

View reviewed changes

xintin requested changes Apr 16, 2026

View reviewed changes

adedespirlet added 3 commits April 16, 2026 14:57

extract inline MLIR strings to examples/python/mlir/

5566620

Signed-off-by: Aurore De Spirlet <aurore.despirlet@amd.com>

add explanation for safe

87443ac

Signed-off-by: Aurore De Spirlet <aurore.despirlet@amd.com>

clean tests

78b0ea5

Signed-off-by: Aurore De Spirlet <aurore.despirlet@amd.com>

xintin self-requested a review April 16, 2026 18:41

xintin approved these changes Apr 16, 2026

View reviewed changes

adedespirlet added 2 commits April 16, 2026 19:44

pre commit applied

748e233

Signed-off-by: Aurore De Spirlet <aurore.despirlet@amd.com>

add tests back

a77b86f

Signed-off-by: Aurore De Spirlet <aurore.despirlet@amd.com>

adedespirlet merged commit a69ed25 into iree-org:main Apr 17, 2026
18 of 19 checks passed

		UNROLL_FACTOR = tkl.sym.UNROLL_FACTOR
		options.subs[UNROLL_FACTOR] = 2

		]


		_DYNAMIC_ALLOWED_PRESHUFFLE_8WAVE_BLOCKS = {

Conversation

adedespirlet commented Mar 31, 2026

Uh oh!

xintin commented Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

xintin commented Apr 15, 2026

Uh oh!

ftynse Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

ftynse Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

ftynse Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

ftynse Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

xintin Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

adedespirlet Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

xintin Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

xintin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

xintin commented Apr 15, 2026 •

edited

Loading