[Feature Request] Is there any plan to provide python wrapper of the cuda kernels?

Hi, the kernels are awesome to support prefill-generate at the same round and it is predictable to have a better performance. 

However, as most inference/serving frameworks are Python-based, the cpp-only architecture prevents the project from further application. So is there any plan to wrap it with pybind11 so that the kernel can be used in PyTorch?