[Not-for-merge] Try to implement kernels along axis with one kernel for IndexAxis0#668
[Not-for-merge] Try to implement kernels along axis with one kernel for IndexAxis0#668qindazhu wants to merge 5 commits intok2-fsa:masterfrom
Conversation
| return ans; | ||
| } | ||
| mgpu::context_t *mgpu_context = GetModernGpuAllocator(c); | ||
| auto lambda_set_ans = [=] __device__(int32_t index, int32_t seg, |
There was a problem hiding this comment.
Would be nice to have some documentation of the args of this lambda.
The documntation of transform_lbs
https://moderngpu.github.io/doc/api.html
unfortunately doesn't seem to explain this.
There was a problem hiding this comment.
.. I know this is not for merge, but some explanation would be nice.
There was a problem hiding this comment.
Sure, mainly
index is the index of element (i.e. idx01)
seg is the row id (i.e. idx0),
rank is the index in current seg/row (i.e. idx1)
also added the documentation in the code, thanks.
|
OK, interesting. |
Just to show the idea of implementing such kernels (e.g. in
Append/Stack) with one kernel, there are still a lot of TODOs, just make a PR so that we can fix them later as this task is not a P1 task currently.Here are the benchmarks:
The header is
IndexAxis0_(New)_NumAxes_Dim0_NumElements_..._AverageTimeWe can see that for large size the new approach will be much slower for the old one, one main reason is that we did not process multiple elements in one thread as the old approach did (so that we can save the time to index
new_offsetsand.old_offsets). ModernGpu did not support this, we need to write kernel by ourselves. Another reason is that there are a fewifin the kernel.