Hi,
I am trying to understand the thread remapping logic in the shared memory variants of load_strided / store_strided in block_io_generic_strided.hpp of mathdx\25.12\example\cufftdx\05_fft_Xd folder.
My current understanding of the natural thread assignment (used in the non-shared-memory variants and during FFT computation for 1D FFT in introduction_example.cu) is:
threadIdx.x → thread's position within its FFT
threadIdx.y → local_fft_id
blockIdx.x → block's batch offset
However, in the shared memory variants, the code performs the following remapping before doing the global memory I/O:
const unsigned int tid = threadIdx.x + FFT::working_group::block_dim().x * threadIdx.y;
const unsigned int tidx = tid / FFT::working_group::block_dim().y;
const unsigned int tidy = tid % FFT::working_group::block_dim().y;
unsigned int smem_index = tidx + tidy * FFT::working_group::block_dim().x;
I tried to understand it through with AI, however, I yet to succeed with a clear explanation of this logical mapping.
What I still don't fully understand is:
- What is the concrete motivation for performing this logical mapping at all? In other words, what problem does it solve compared to just using
threadIdx.x / threadIdx.y directly for global memory indexing as in the non-smem variants?
- After the transpose, the smem index is computed as
tidx + tidy * block_dim().x. Then after __syncthreads(), the read phase uses threadIdx.x + threadIdx.y * block_dim().x with the same formula. Could you walk through concretely what data each thread writes vs. reads, and why those two different identities (tidx/tidy vs threadIdx.x/threadIdx.y) indexing the same smem layout performs the intended transpose?
- Is my understanding of
blockIdx.y correct — is it unused / always 0 in this kernel launch configuration, or does it carry meaning?
Thanks in advance for your time!
Hi,
I am trying to understand the thread remapping logic in the shared memory variants of
load_strided/store_stridedinblock_io_generic_strided.hppof mathdx\25.12\example\cufftdx\05_fft_Xd folder.My current understanding of the natural thread assignment (used in the non-shared-memory variants and during FFT computation for 1D FFT in introduction_example.cu) is:
threadIdx.x→ thread's position within its FFTthreadIdx.y→local_fft_idblockIdx.x→ block's batch offsetHowever, in the shared memory variants, the code performs the following remapping before doing the global memory I/O:
I tried to understand it through with AI, however, I yet to succeed with a clear explanation of this logical mapping.
What I still don't fully understand is:
threadIdx.x / threadIdx.ydirectly for global memory indexing as in the non-smem variants?tidx + tidy * block_dim().x. Then after__syncthreads(), the read phase usesthreadIdx.x + threadIdx.y * block_dim().xwith the same formula. Could you walk through concretely what data each thread writes vs. reads, and why those two different identities (tidx/tidy vs threadIdx.x/threadIdx.y) indexing the same smem layout performs the intended transpose?blockIdx.ycorrect — is it unused / always 0 in this kernel launch configuration, or does it carry meaning?Thanks in advance for your time!