Skip to content

[Feature]: Compiler option for expanding s_waitcnt instructions or don’t merge them in the first place #67

@bbiiggppiigg

Description

@bbiiggppiigg

Suggestion Description

From the kernels we have studied, it is not uncommon to see the use of a single waitcnt instruction to wait for multiple load instructions to finish loading their values.

Once the PC-sampling feature is enabled on these kernels, we expect to see a non-negligible amount of pc-samples reported at the waitcnt instructions.

In order to figure out which load instruction might be a/the bottleneck, it would be nice to have a compiler option that expands a single waitcnt instruction for value N into a series of waitcnt instructions with decreasing value from N+k, N+k-1, … N, when the waitcnt instruction is waiting for k loads to complete.

Please note that we are NOT looking for the existing compiler option -amdgpu-waitcnt-forcezero that adds an s_waitcnt(0) after every instruction, as we still want to hide the memory load latency with compute instructions as much as possible.

Operating System

No response

GPU

MI200 / MI250 / MI300

ROCm Component

No response

Metadata

Metadata

Assignees

Labels

Under InvestigationgenericBuild error, or some other issue not caused by an LLVM bug

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions