Skip to content

Question about SM100 fp32 GEMM kernel in cublasLtMatmul #309

@Wanglongzhi2001

Description

@Wanglongzhi2001
  • CUDA Version: 13.0
  • OS: Ubuntu 22.04.5 LTS
  • K in GEMM: 7168
  • N in GEMM: 160

Hello, I am trying to run FP32 GEMM on an SM100 GPU. I’ve noticed that when M is small (e.g., M = 128), cuBLASLt seems to still select an SM80 kernel, for example:

cutlass::Kernel2<cutlass_80_simt_sgemm_64x64_8x5_nn_align1>

Moreover, its performance is even slower than running FP32 matmul on an SM90 GPU. However, when M becomes larger (e.g., M = 2000), cuBLASLt switches to an SM100 kernel, such as:

cutlass3x_sm100_simt_sgemm_f32_f32_f32_f32_f32_64x32x16_1x1x1_3_nnn_align1_bias_f32_relu

My question is: are there any specific cuBLASLt heuristic or search configurations I should set to ensure that FP32 GEMM uses SM100 kernels even when M is small, in order to achieve better performance?

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions