MXFP4 8w Optimizations Dynamic#1211
Conversation
006ceb1 to
12caa2a
Compare
f676f92 to
eb2597e
Compare
Nice. Let's add this TODO in the code. |
|
As discussed, test on: |
Signed-off-by: Aurore De Spirlet <aurore.despirlet@amd.com>
Signed-off-by: Aurore De Spirlet <aurore.despirlet@amd.com>
Signed-off-by: Aurore De Spirlet <aurore.despirlet@amd.com>
Signed-off-by: Aurore De Spirlet <aurore.despirlet@amd.com>
Signed-off-by: Aurore De Spirlet <aurore.despirlet@amd.com>
Signed-off-by: Aurore De Spirlet <aurore.despirlet@amd.com>
Signed-off-by: Aurore De Spirlet <aurore.despirlet@amd.com>
Signed-off-by: Aurore De Spirlet <aurore.despirlet@amd.com>
Signed-off-by: Aurore De Spirlet <aurore.despirlet@amd.com>
Signed-off-by: Aurore De Spirlet <aurore.despirlet@amd.com>
23c6006 to
02ee665
Compare
| ) | ||
| options.specialize = True | ||
| options.use_buffer_ops = True | ||
| options.minimize_shared_allocs = False | ||
| options.linearize_shared_access = True | ||
|
|
||
| options.wave_runtime = True | ||
| # options.override_mlir = mlir_256x192 |
There was a problem hiding this comment.
Is the override currently disabled? Meaning the big string above is basically dead code?
| UNROLL_FACTOR = tkl.sym.UNROLL_FACTOR | ||
| options.subs[UNROLL_FACTOR] = 2 | ||
| options.postprocess = """ |
There was a problem hiding this comment.
Is the dance with subs needed or can we just substitute 2 into the string below at the Python level?
| options.specialize = True | ||
| options.use_buffer_ops = True | ||
| options.minimize_shared_allocs = True | ||
| options.minimize_shared_allocs = False |
There was a problem hiding this comment.
A comment as to why this is turned off is welcome
| UNROLL_FACTOR = tkl.sym.UNROLL_FACTOR | ||
| options.subs[UNROLL_FACTOR] = 2 |
| ] | ||
|
|
||
|
|
||
| _DYNAMIC_ALLOWED_PRESHUFFLE_8WAVE_BLOCKS = { |
There was a problem hiding this comment.
question: why are we removing these tests? If it is covered elsewhere, then it is fine.
There was a problem hiding this comment.
I’ve streamlined the tests because the previous distinction between dynamic and static block tiles is no longer necessary following optimizations that were implemented, both now run on the same tile sizes.
I remember simplifying the coverage of shapes because the 8wave pingpong schedule didnt support the remaining shapes. While we could add support for them it would require more effort which is not priority anymore
There was a problem hiding this comment.
If it was working before this pr, that means although this optimization works, it is not generic until we have the pass in place.
Removing these tests will just make it lost.
What we can do instead:
xfail and document the regression (in the code) with the reason. eg:
pytest.param((1024, 1024, 8192), (128, 128, 256), marks=pytest.mark.xfail(reason=" ...")),
Signed-off-by: Aurore De Spirlet <aurore.despirlet@amd.com>
Signed-off-by: Aurore De Spirlet <aurore.despirlet@amd.com>
Signed-off-by: Aurore De Spirlet <aurore.despirlet@amd.com>
xintin
left a comment
There was a problem hiding this comment.
LGTM! Left two more minor comments.
Once the ci is green, we can merge it.
Signed-off-by: Aurore De Spirlet <aurore.despirlet@amd.com>
Signed-off-by: Aurore De Spirlet <aurore.despirlet@amd.com>
This PR:
Optimizes MXFP 8w schedule with respect to counters and memory ops.
Includes a handwritten MLIR snippet that performs swizzle and dword stores to global memory (instead of u shorts). This optmization brings approx 7% improprement.
Unrolls the kernel twice to remove forced vmcnt(0) + v_mov copies at the end of the loop
Without unrolling, scale loads and scale consumption overlap within the same iteration, forcing the compiler to load into temporary VGPRs and copy them back to the loop iter_args's registers at the end of the loop (vmcnt(0) + v_mov) . With 2x unrolling, odd/even iterations alternate scale register sets, so loads target already "dead" registers directly. This prevents copies and vmcnt(0) which breaks perf. Scale waits now happen right before the MFMAs, maximizing latency hiding.
Adds tkw.assumptions constraints to optimize dynamic kernels.
In this case, the assumptions allow the compiler to omit masking logic when generating dynamic kernels. Specifically, the assumption states that the shape dimension is a perfect multiple of the tile size. With this guarantee, the compiler can safely eliminate bounds checks associated with gather_to_lds operations. This avoids inserting costly masking logic in the dynamic case and improves performance.
TODO: implement a pass that automatically emits the optimized epilogue storing logic.