Skip to content

Conversation

@ecamartins
Copy link
Collaborator

@ecamartins ecamartins commented Jan 21, 2026

Proposed changes

Recently, a Stream-K reduction unit test failed; these tests were temporarily disabled in #3559 since the failure was difficult to reproduce (i.e., the test only failed ~once every 8,000-10,000 runs on one machine). After debugging, the issue was narrowed down to an alignment issue in Stream-K's workspace buffer that resulted in stale data being read by a workgroup. See the Discussion section for more details.

Hence, this PR makes the following changes:

  • Pads the flags portion of the workspace buffer to be 128B-aligned
  • Adds unit tests to test this updated functionality in Stream-K's tile partitioner's get_flags_buffer_size class method.
  • Re-enabled the test_ck_tile_streamk_reduction unit tests.

Checklist

Please put an x into the boxes that apply. You can also fill these out after creating the PR. If you're not sure, please don't hesitate to ask.

  • I have added tests relevant to the introduced functionality, and the unit tests are passing locally
  • I have added the test to REGRESSION_TESTS list defined at the top of CMakeLists.txt in tests/CMakeLists.txt, IF the test takes more than 30 seconds to run.
  • I have added inline documentation which enables the maintainers with understanding the motivation
  • I have removed the stale documentation which is no longer relevant after this pull request
  • (If this change is user-facing) I have added release notes which provide the end users with a brief summary of the improvement from this pull request
  • I have run clang-format on all changed files
  • Any dependent changes have been merged

Discussion

The Stream-K workspace buffer is a single buffer where the first partition stores flags for the workgroups and the second partition holds each workgroups partials (i.e., partial results for a macro tile in the output tensor) as shown in the following diagram:
image

But, in this scenario, there is no guarantee that flags will span an entire cache line. So, we could end up with something like this:
image

In the Stream-K normal and tree reductions, we use cache modifiers to skip cache in certain cases (see #3371 for details). Workgroups skip cache when reading and writing to flags and when writing to partials. But, the cache is not skipped when reading from partials. Using the example above, when a workgroup reads from flags, the entire cache line, which may contain unfinalized partials data, gets stored in cache. Since workgroups don't skip cache to read from partials, they may end up reading incorrect partials data from cache, leading to incorrect results.

While debugging, I ran various experiments to confirm the alignment issue was the cause. The strongest evidence was as follows:

  • The number of incorrect values in the output tensor equaled the number of partials elements that fit on the same cache line as flags (i.e., the portion labelled beta in the diagram).
  • Padding the flags buffer by 1 resulted in 1 less incorrect entry in the output tensor

While one solution is to create separate buffers for partials and flags (rather than a single workspace buffer), this option would involve an interface change. Instead, we opted to pad the flags portion of the workspace buffer to be 128B-aligned since this does not involve any interface changes. Hence, the resulting workspace buffer looks something like this:
image

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes an alignment issue in the Stream-K workspace buffer that caused stale data to be read due to cache line conflicts between flags and partials data. The fix ensures the flags portion of the buffer is 128-byte aligned to prevent cache coherency issues.

Changes:

  • Modified get_flags_buffer_size() to pad the flags buffer to 128-byte alignment
  • Added three new unit tests to verify correct buffer sizing for different flag array sizes
  • Re-enabled previously disabled Stream-K reduction tests

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated no comments.

Show a summary per file
File Description
streamk_gemm_tile_partitioner_impl.hpp Implements 128-byte alignment padding in get_flags_buffer_size()
streamk_gemm_tile_partitioner.hpp Updates documentation for get_flags_buffer_size() to reflect alignment requirement
test_streamk_tile_partitioner_common.hpp Adds three test configuration structs for testing buffer size calculations
test_streamk_tile_partitioner.cpp Adds unit tests for aligned buffer sizing and updates existing test expectations
CMakeLists.txt Re-enables the test_ck_tile_streamk_reduction test suite

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

In CK Tile Stream-K, the workspace buffer is used to hold flags and
partials, where the first i bytes holds the flags and the remaining
bytes hold partials. This change adds padding to the flags prefix of the
workspace buffer to ensure the number of bytes is 128B-aligned. Without
this alignment, since workgroups do not skip cache when reading from
partials, they may read stale partials data in cache, leading to
incorrect results. The added padding avoids the stale data reading.

This change also re-enables the test_ck_tile_streamk_reduction tests.
@ecamartins ecamartins force-pushed the emimarti/ck_tile/fix-streamk-alignment branch from 58a8384 to 309c253 Compare January 22, 2026 23:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants