- CUDA Version: 13.0
- OS: Ubuntu 22.04.5 LTS
- K in GEMM: 7168
- N in GEMM: 160
Hello, I am trying to run FP32 GEMM on an SM100 GPU. I’ve noticed that when M is small (e.g., M = 128), cuBLASLt seems to still select an SM80 kernel, for example:
cutlass::Kernel2<cutlass_80_simt_sgemm_64x64_8x5_nn_align1>
Moreover, its performance is even slower than running FP32 matmul on an SM90 GPU. However, when M becomes larger (e.g., M = 2000), cuBLASLt switches to an SM100 kernel, such as:
cutlass3x_sm100_simt_sgemm_f32_f32_f32_f32_f32_64x32x16_1x1x1_3_nnn_align1_bias_f32_relu
My question is: are there any specific cuBLASLt heuristic or search configurations I should set to ensure that FP32 GEMM uses SM100 kernels even when M is small, in order to achieve better performance?
Hello, I am trying to run FP32 GEMM on an SM100 GPU. I’ve noticed that when M is small (e.g., M = 128), cuBLASLt seems to still select an SM80 kernel, for example:
Moreover, its performance is even slower than running FP32 matmul on an SM90 GPU. However, when M becomes larger (e.g., M = 2000), cuBLASLt switches to an SM100 kernel, such as:
My question is: are there any specific cuBLASLt heuristic or search configurations I should set to ensure that FP32 GEMM uses SM100 kernels even when M is small, in order to achieve better performance?