Hi, the kernels are awesome to support prefill-generate at the same round and it is predictable to have a better performance.
However, as most inference/serving frameworks are Python-based, the cpp-only architecture prevents the project from further application. So is there any plan to wrap it with pybind11 so that the kernel can be used in PyTorch?
Hi, the kernels are awesome to support prefill-generate at the same round and it is predictable to have a better performance.
However, as most inference/serving frameworks are Python-based, the cpp-only architecture prevents the project from further application. So is there any plan to wrap it with pybind11 so that the kernel can be used in PyTorch?