Question about SM100 fp32 GEMM kernel in cublasLtMatmul

- CUDA Version: 13.0
- OS: Ubuntu 22.04.5 LTS
- K in GEMM: 7168
- N in GEMM: 160

Hello, I am trying to run FP32 GEMM on an SM100 GPU. I’ve noticed that when M is small (e.g., M = 128), cuBLASLt seems to still select an SM80 kernel, for example:

```
cutlass::Kernel2<cutlass_80_simt_sgemm_64x64_8x5_nn_align1>
```

Moreover, its performance is even slower than running FP32 matmul on an SM90 GPU. However, when M becomes larger (e.g., M = 2000), cuBLASLt switches to an SM100 kernel, such as:

```
cutlass3x_sm100_simt_sgemm_f32_f32_f32_f32_f32_64x32x16_1x1x1_3_nnn_align1_bias_f32_relu
```

My question is: are there any specific cuBLASLt heuristic or search configurations I should set to ensure that FP32 GEMM uses SM100 kernels even when M is small, in order to achieve better performance?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question about SM100 fp32 GEMM kernel in cublasLtMatmul #309

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Question about SM100 fp32 GEMM kernel in cublasLtMatmul #309

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions