Skip to content

Conversation

@jayshah1819
Copy link
Contributor

  • Removed the old sortBump counter since the atomicAdd there was slowing the
    kernel down a lot. Now the workgroup ID just comes from builtinsUniform.wgid.x,
    so we don’t hit any global atomics. Also cleared the hist/passHist before each run.

  • This improves bandwidth graph.

  • Fix plot stroke=timing for proper GPU/CPU lines; expand input range + trials.

Screenshot 2025-12-01 at 5 12 10 PM

@jowens
Copy link
Collaborator

jowens commented Dec 1, 2025

Now the workgroup ID just comes from builtinsUniform.wgid.x, so we don’t hit any global atomics.

We can't do this. We could lock up the machine with this. We have no guarantee that issue order on workgroups is in any way correlated with wgid.

@jowens
Copy link
Collaborator

jowens commented Dec 1, 2025

Consider the case that the hardware issued all the blocks with the largest wgid first and all the small wgids were not issued and were waiting on the larger wgids to finish. If we didn't have fallback in this code, it would lock up for sure (because the resident workgroups would be waiting for workgroup 0 to finish, but workgroup 0 would not have launched yet). This particular code will not lock up, because we do have fallback, but if we fallback, performance will be terrible.

@jowens
Copy link
Collaborator

jowens commented Dec 1, 2025

tl;dr: We need to make this work with sortBump.

@jowens
Copy link
Collaborator

jowens commented Dec 1, 2025

See: first paragraph of 2.3.3: https://escholarship.org/uc/item/0bk9z4bt

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants