Padding support for wave transfer #3537

EnricoDeg · 2026-01-09T08:39:08Z

Proposed changes

Summary:

Add padding support for wave transfer with transpose:
- if loading index is in the padding region, read data at index 0 (always valid) to be able to use global load with transpose at wave level
- before writing to lds, set register data to 0 if loading index was in the padding region
there are still some validity restrictions with transpose which are checked before dispatching the kernel (specific for wave transfer):
- for 16 bit types, each 8x8 subtile must be fully in the valid or the padding region
- for 8 bit types, each 8x16 subtile must be fully in the valid or the padding region
New test cases added for gemm universal to check new validity restrictions

Wave transfer can now be applied when both the vector size for loading from Vmem and the vector size for storing to LDS are equal to 8.

Next step: integrate wave transfer in convolution when it maps to explicit gemm (for default convolution, the thread transfer will still be used)

Checklist

Please put an x into the boxes that apply. You can also fill these out after creating the PR. If you're not sure, please don't hesitate to ask.

I have added tests relevant to the introduced functionality, and the unit tests are passing locally
I have added the test to REGRESSION_TESTS list defined at the top of CMakeLists.txt in tests/CMakeLists.txt, IF the test takes more than 30 seconds to run.
I have added inline documentation which enables the maintainers with understanding the motivation
I have removed the stale documentation which is no longer relevant after this pull request
(If this change is user-facing) I have added release notes which provide the end users with a brief summary of the improvement from this pull request
I have run clang-format on all changed files
Any dependent changes have been merged

Discussion

If this is a relatively large or complex change, feel free to start a discussion by explaining why you chose the solution you did and what alternatives you considered

krithalith · 2026-01-20T14:05:31Z

...ensor_operation/gpu/device/impl/device_grouped_conv_bwd_data_multiple_d_wmma_cshuffle_v3.hpp

+        false,
+        false,
+        true>;


Can we add comments here to indicate the names of the template parameters? It can be a bit hard to tell with this many bools in a row. Same for the CTranspose version.

krithalith · 2026-01-20T14:08:24Z

Looks good! I had one small comment and also I was wondering if we still need to force threadTileTransfer for the convolution implementations. It seems that we still set this to true for all of them, with the exception of a small handful of special Fwd instances without CTranspose.

ErwinTerpstra

Nice improvements! I had some small questions and comments, but nothing major. I also have to admit I didn't fully grok the changes in the tensor slice transfer, so couldn't comment on that too much.

...nsor_operation/gpu/device/impl/device_grouped_gemm_multiple_d_wmma_cshuffle_tile_loop_v3.hpp

include/ck/tensor_operation/gpu/device/impl/device_batched_gemm_multiple_d_wmma_cshuffle_v3.hpp

include/ck/tensor_operation/gpu/grid/gridwise_gemm_wmma_cshuffle_v3_common.hpp

EnricoDeg · 2026-01-22T09:39:05Z

Looks good! I had one small comment and also I was wondering if we still need to force threadTileTransfer for the convolution implementations. It seems that we still set this to true for all of them, with the exception of a small handful of special Fwd instances without CTranspose.

In order to have better support in convolution, we need to change the handling of grid descriptors like it was done in conv fwd: create M,K grid descriptors on host and then modify them on the device to be K0,M,K1 for thread transfer and something more complicated for wave transfer. This is already work in progress for conv bwd

Also move check before writing storing is_src_valid during reading

Condition is changed so now the vectorsize of vmem reading and lds writing must be equal to 8 in order to use the wave transfer

Add test case which shows this limitation

EnricoDeg added the organization: streamhpc label Jan 9, 2026

EnricoDeg requested review from ThomasNing, afagaj, andriy-ca, aosewski, asleepzzz, bartekxk, carlushuang, cgmillette, coderfeli, geyyer, illsilin, poyenc, qianfengz, shumway, tenpercent and vidyasagar-amd as code owners January 9, 2026 08:39

krithalith requested review from ErwinTerpstra and krithalith January 9, 2026 12:41

EnricoDeg force-pushed the streamhpc/remove_cshuffle branch from 64462d1 to 7d685e7 Compare January 9, 2026 15:46

EnricoDeg requested a review from a team as a code owner January 9, 2026 15:46

EnricoDeg force-pushed the streamhpc/padding_support_wave_transfer branch from baad16f to 2d789e2 Compare January 9, 2026 16:00

EnricoDeg force-pushed the streamhpc/remove_cshuffle branch from 7d685e7 to ad8995e Compare January 13, 2026 08:34

EnricoDeg force-pushed the streamhpc/padding_support_wave_transfer branch from 2d789e2 to 1af4574 Compare January 13, 2026 09:08

Base automatically changed from streamhpc/remove_cshuffle to develop January 14, 2026 10:02

EnricoDeg force-pushed the streamhpc/padding_support_wave_transfer branch from 1af4574 to 6b0420c Compare January 14, 2026 10:12

krithalith reviewed Jan 20, 2026

View reviewed changes

krithalith previously approved these changes Jan 20, 2026

View reviewed changes

ErwinTerpstra reviewed Jan 21, 2026

View reviewed changes

EnricoDeg dismissed krithalith’s stale review via 865d70d January 22, 2026 10:00

EnricoDeg force-pushed the streamhpc/padding_support_wave_transfer branch from 6b0420c to 865d70d Compare January 22, 2026 10:00

EnricoDeg requested review from Snektron and vpietila-amd as code owners January 22, 2026 10:00

EnricoDeg force-pushed the streamhpc/padding_support_wave_transfer branch from 865d70d to 38a4fe6 Compare January 22, 2026 10:17

ErwinTerpstra approved these changes Jan 22, 2026

View reviewed changes

EnricoDeg added 11 commits January 23, 2026 17:04

Add padding support with transpose

4729140

Also move check before writing storing is_src_valid during reading

Add/modify instances to use wave transfer for gemm universal

6d8206f

Condition is changed so now the vectorsize of vmem reading and lds writing must be equal to 8 in order to use the wave transfer

Fix clang format

a92d2af

Modify example

188a15d

Fix bwd data

2b244d0

Add restriction for wave transfer with padding and transpose

846d5eb

Add test case which shows this limitation

Fix validity checks 8 bit types

932c292

Add validity check gemm_bias_add_reduce

5237de5

Add validity check grouped gemm tile loop

ab5de77

Fix validity checks new flavours

8f4a2e8

Minor fixes

e4ab092

EnricoDeg force-pushed the streamhpc/padding_support_wave_transfer branch from 38a4fe6 to e4ab092 Compare January 23, 2026 17:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Padding support for wave transfer #3537

Padding support for wave transfer #3537

Uh oh!

EnricoDeg commented Jan 9, 2026

Uh oh!

krithalith Jan 20, 2026

Uh oh!

EnricoDeg Jan 22, 2026

Uh oh!

krithalith commented Jan 20, 2026 •

edited

Loading

Uh oh!

ErwinTerpstra left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

EnricoDeg commented Jan 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Padding support for wave transfer #3537

Are you sure you want to change the base?

Padding support for wave transfer #3537

Uh oh!

Conversation

EnricoDeg commented Jan 9, 2026

Proposed changes

Checklist

Discussion

Uh oh!

krithalith Jan 20, 2026

Choose a reason for hiding this comment

Uh oh!

EnricoDeg Jan 22, 2026

Choose a reason for hiding this comment

Uh oh!

krithalith commented Jan 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ErwinTerpstra left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

EnricoDeg commented Jan 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

krithalith commented Jan 20, 2026 •

edited

Loading