Skip to content

Conversation

@EnricoDeg
Copy link
Contributor

Proposed changes

Summary:

  • Add padding support for wave transfer with transpose:
    • if loading index is in the padding region, read data at index 0 (always valid) to be able to use global load with transpose at wave level
    • before writing to lds, set register data to 0 if loading index was in the padding region
  • there are still some validity restrictions with transpose which are checked before dispatching the kernel (specific for wave transfer):
    • for 16 bit types, each 8x8 subtile must be fully in the valid or the padding region
    • for 8 bit types, each 8x16 subtile must be fully in the valid or the padding region
  • New test cases added for gemm universal to check new validity restrictions

Wave transfer can now be applied when both the vector size for loading from Vmem and the vector size for storing to LDS are equal to 8.

Next step: integrate wave transfer in convolution when it maps to explicit gemm (for default convolution, the thread transfer will still be used)

Checklist

Please put an x into the boxes that apply. You can also fill these out after creating the PR. If you're not sure, please don't hesitate to ask.

  • I have added tests relevant to the introduced functionality, and the unit tests are passing locally
  • I have added the test to REGRESSION_TESTS list defined at the top of CMakeLists.txt in tests/CMakeLists.txt, IF the test takes more than 30 seconds to run.
  • I have added inline documentation which enables the maintainers with understanding the motivation
  • I have removed the stale documentation which is no longer relevant after this pull request
  • (If this change is user-facing) I have added release notes which provide the end users with a brief summary of the improvement from this pull request
  • I have run clang-format on all changed files
  • Any dependent changes have been merged

Discussion

If this is a relatively large or complex change, feel free to start a discussion by explaining why you chose the solution you did and what alternatives you considered

@EnricoDeg EnricoDeg force-pushed the streamhpc/remove_cshuffle branch from 64462d1 to 7d685e7 Compare January 9, 2026 15:46
@EnricoDeg EnricoDeg requested a review from a team as a code owner January 9, 2026 15:46
@EnricoDeg EnricoDeg force-pushed the streamhpc/padding_support_wave_transfer branch from baad16f to 2d789e2 Compare January 9, 2026 16:00
@EnricoDeg EnricoDeg force-pushed the streamhpc/remove_cshuffle branch from 7d685e7 to ad8995e Compare January 13, 2026 08:34
@EnricoDeg EnricoDeg force-pushed the streamhpc/padding_support_wave_transfer branch from 2d789e2 to 1af4574 Compare January 13, 2026 09:08
Base automatically changed from streamhpc/remove_cshuffle to develop January 14, 2026 10:02
@EnricoDeg EnricoDeg force-pushed the streamhpc/padding_support_wave_transfer branch from 1af4574 to 6b0420c Compare January 14, 2026 10:12
Comment on lines 454 to 456
false,
false,
true>;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add comments here to indicate the names of the template parameters? It can be a bit hard to tell with this many bools in a row. Same for the CTranspose version.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@krithalith
Copy link
Contributor

krithalith commented Jan 20, 2026

Looks good! I had one small comment and also I was wondering if we still need to force threadTileTransfer for the convolution implementations. It seems that we still set this to true for all of them, with the exception of a small handful of special Fwd instances without CTranspose.

krithalith
krithalith previously approved these changes Jan 20, 2026
Copy link
Contributor

@ErwinTerpstra ErwinTerpstra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice improvements! I had some small questions and comments, but nothing major. I also have to admit I didn't fully grok the changes in the tensor slice transfer, so couldn't comment on that too much.

@EnricoDeg
Copy link
Contributor Author

Looks good! I had one small comment and also I was wondering if we still need to force threadTileTransfer for the convolution implementations. It seems that we still set this to true for all of them, with the exception of a small handful of special Fwd instances without CTranspose.

In order to have better support in convolution, we need to change the handling of grid descriptors like it was done in conv fwd: create M,K grid descriptors on host and then modify them on the device to be K0,M,K1 for thread transfer and something more complicated for wave transfer. This is already work in progress for conv bwd

@EnricoDeg EnricoDeg force-pushed the streamhpc/padding_support_wave_transfer branch from 6b0420c to 865d70d Compare January 22, 2026 10:00
@EnricoDeg EnricoDeg force-pushed the streamhpc/padding_support_wave_transfer branch from 865d70d to 38a4fe6 Compare January 22, 2026 10:17
@EnricoDeg EnricoDeg force-pushed the streamhpc/padding_support_wave_transfer branch from 38a4fe6 to e4ab092 Compare January 23, 2026 17:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants