[CK_TILE] Fix alignment in Stream-K workspace buffer #3625

ecamartins · 2026-01-21T21:17:48Z

Proposed changes

Recently, a Stream-K reduction unit test failed; these tests were temporarily disabled in #3559 since the failure was difficult to reproduce (i.e., the test only failed ~once every 8,000-10,000 runs on one machine). After debugging, the issue was narrowed down to an alignment issue in Stream-K's workspace buffer that resulted in stale data being read by a workgroup. See the Discussion section for more details.

Hence, this PR makes the following changes:

Pads the flags portion of the workspace buffer to be 128B-aligned
Adds unit tests to test this updated functionality in Stream-K's tile partitioner's get_flags_buffer_size class method.
Re-enabled the test_ck_tile_streamk_reduction unit tests.

Checklist

Please put an x into the boxes that apply. You can also fill these out after creating the PR. If you're not sure, please don't hesitate to ask.

I have added tests relevant to the introduced functionality, and the unit tests are passing locally
I have added the test to REGRESSION_TESTS list defined at the top of CMakeLists.txt in tests/CMakeLists.txt, IF the test takes more than 30 seconds to run.
I have added inline documentation which enables the maintainers with understanding the motivation
I have removed the stale documentation which is no longer relevant after this pull request
(If this change is user-facing) I have added release notes which provide the end users with a brief summary of the improvement from this pull request
I have run clang-format on all changed files
Any dependent changes have been merged

Discussion

The Stream-K workspace buffer is a single buffer where the first partition stores flags for the workgroups and the second partition holds each workgroups partials (i.e., partial results for a macro tile in the output tensor) as shown in the following diagram:

But, in this scenario, there is no guarantee that flags will span an entire cache line. So, we could end up with something like this:

In the Stream-K normal and tree reductions, we use cache modifiers to skip cache in certain cases (see #3371 for details). Workgroups skip cache when reading and writing to flags and when writing to partials. But, the cache is not skipped when reading from partials. Using the example above, when a workgroup reads from flags, the entire cache line, which may contain unfinalized partials data, gets stored in cache. Since workgroups don't skip cache to read from partials, they may end up reading incorrect partials data from cache, leading to incorrect results.

While debugging, I ran various experiments to confirm the alignment issue was the cause. The strongest evidence was as follows:

The number of incorrect values in the output tensor equaled the number of partials elements that fit on the same cache line as flags (i.e., the portion labelled beta in the diagram).
Padding the flags buffer by 1 resulted in 1 less incorrect entry in the output tensor

While one solution is to create separate buffers for partials and flags (rather than a single workspace buffer), this option would involve an interface change. Instead, we opted to pad the flags portion of the workspace buffer to be 128B-aligned since this does not involve any interface changes. Hence, the resulting workspace buffer looks something like this:

Copilot

Pull request overview

This PR fixes an alignment issue in the Stream-K workspace buffer that caused stale data to be read due to cache line conflicts between flags and partials data. The fix ensures the flags portion of the buffer is 128-byte aligned to prevent cache coherency issues.

Changes:

Modified get_flags_buffer_size() to pad the flags buffer to 128-byte alignment
Added three new unit tests to verify correct buffer sizing for different flag array sizes
Re-enabled previously disabled Stream-K reduction tests

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated no comments.

Show a summary per file

File	Description
streamk_gemm_tile_partitioner_impl.hpp	Implements 128-byte alignment padding in `get_flags_buffer_size()`
streamk_gemm_tile_partitioner.hpp	Updates documentation for `get_flags_buffer_size()` to reflect alignment requirement
test_streamk_tile_partitioner_common.hpp	Adds three test configuration structs for testing buffer size calculations
test_streamk_tile_partitioner.cpp	Adds unit tests for aligned buffer sizing and updates existing test expectations
CMakeLists.txt	Re-enables the `test_ck_tile_streamk_reduction` test suite

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

In CK Tile Stream-K, the workspace buffer is used to hold flags and partials, where the first i bytes holds the flags and the remaining bytes hold partials. This change adds padding to the flags prefix of the workspace buffer to ensure the number of bytes is 128B-aligned. Without this alignment, since workgroups do not skip cache when reading from partials, they may read stale partials data in cache, leading to incorrect results. The added padding avoids the stale data reading. This change also re-enables the test_ck_tile_streamk_reduction tests.

…ng time

ecamartins self-assigned this Jan 21, 2026

ecamartins requested review from Snektron, ThomasNing, afagaj, andriy-ca, aosewski, asleepzzz, bartekxk, carlushuang, cgmillette, coderfeli, geyyer, illsilin, poyenc, qianfengz, shumway, tenpercent, vidyasagar-amd and vpietila-amd as code owners January 21, 2026 21:17

afagaj requested a review from Copilot January 21, 2026 22:04

Copilot AI reviewed Jan 21, 2026

View reviewed changes

ecamartins added 2 commits January 22, 2026 23:28

Compute reference GEMM on GPU for test verification to decrease testi…

309c253

…ng time

ecamartins force-pushed the emimarti/ck_tile/fix-streamk-alignment branch from 58a8384 to 309c253 Compare January 22, 2026 23:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CK_TILE] Fix alignment in Stream-K workspace buffer #3625

[CK_TILE] Fix alignment in Stream-K workspace buffer #3625

ecamartins commented Jan 21, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[CK_TILE] Fix alignment in Stream-K workspace buffer #3625

Are you sure you want to change the base?

[CK_TILE] Fix alignment in Stream-K workspace buffer #3625

Conversation

ecamartins commented Jan 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Proposed changes

Checklist

Discussion

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ecamartins commented Jan 21, 2026 •

edited

Loading