diff --git a/docs/source/builder/agents-guide.md b/docs/source/builder/agents-guide.md
index 7753cdcc..20e9214b 100644
--- a/docs/source/builder/agents-guide.md
+++ b/docs/source/builder/agents-guide.md
@@ -1,6 +1,6 @@
 # Develop kernels with agents
 
-Code agents are a good fit to build custom kernels because the hard part is not just writing in Domain Specific Language (DSLs) like CUDA. You also need the right project layout, PyTorch bindings, architecture-specific choices, model-specific integration, and trustworthy benchmarks. 
+Code agents are a good fit to build custom kernels because the hard part is not just writing in Domain Specific Language (DSLs) like CUDA. You also need the right project layout, PyTorch bindings, architecture-specific choices, model-specific integration, and trustworthy benchmarks.
 
 Kernels on Hugging Face are compatible with agents via skills and the `hf` CLI. The `cuda-kernels`, `rocm-kernels`, `xpu-kernels`, and `cpu-kernels` skills contain knowledge so an agent can generate and publish a complete kernel project, instead of isolated snippets.
 
@@ -10,7 +10,7 @@ This guide is for **authoring new kernels**. If you only want to **load an exist
 
 You need:
 
-- a coding agent that supports skills, such as Claude Code, Codex, Cursor, or OpenCode  
+- a coding agent that supports skills, such as Claude Code, Codex, Cursor, or OpenCode
 - a clear target: library, model, operation, GPU, dtype, and representative shapes
 
 The skill currently focuses on NVIDIA GPUs such as **H100**, **A100**, and **T4**, and on integration patterns for **transformers** and **diffusers**.
@@ -31,14 +31,14 @@ kernel-builder skills add --claude
 
 Writing kernels is a hard problem, so be specific to agents. A robust prompt will declare all core attributes, including:
 
-- the library, for example `transformers` or `diffusers`  
-- the model id, for example `Qwen3-8B` or `LTX-Video`  
-- the operation, for example `RMSNorm`, attention, RoPE, `GEGLU`, or `AdaLN`  
-- the target GPU, for example `H100`, `A100`, or `T4`  
-- the dtype, for example `bfloat16`, `float16`, or `float32`  
+- the library, for example `transformers` or `diffusers`
+- the model id, for example `Qwen3-8B` or `LTX-Video`
+- the operation, for example `RMSNorm`, attention, RoPE, `GEGLU`, or `AdaLN`
+- the target GPU, for example `H100`, `A100`, or `T4`
+- the dtype, for example `bfloat16`, `float16`, or `float32`
 - the outputs you expect: kernel code, bindings, tests, and benchmarks
 
-In practice, you can often skip some of these and the agent will infer based on common practice, but if you know a detail declare it. 
+In practice, you can often skip some of these and the agent will infer based on common practice, but if you know a detail declare it.
 
 For example:
 
@@ -69,7 +69,7 @@ examples/your_model/
 │   └── torch_binding.cpp       # PyTorch C++ bindings
 ├── benchmark_rmsnorm.py        # Micro-benchmark script
 ├── build.toml                  # kernel-builder config
-├── setup.py                    # pip install -e .
+├── setup.py                    # python setup.py build_kernel
 └── pyproject.toml
 ```
 
@@ -105,15 +105,15 @@ cuda-capabilities = ["9.0"]  # H100
 
 First check that:
 
-- `backends = ["cuda"]` is correct for your project  
-- the kernel source files are listed correctly  
-- the Torch binding sources are included under `[torch]`  
+- `backends = ["cuda"]` is correct for your project
+- the kernel source files are listed correctly
+- the Torch binding sources are included under `[torch]`
 - `cuda-capabilities` is only set when the kernel truly targets specific architectures
 
 For architecture-specific kernels, typical capability values are:
 
-- H100: `9.0`  
-- A100: `8.0`  
+- H100: `9.0`
+- A100: `8.0`
 - T4: `7.5`
 
 If the kernel does **not** require a specific capability, the kernels docs recommend leaving `cuda-capabilities` unset so the builder can target all supported capabilities. In practice, you can prompt your agent to review the `build.toml` for excessive definitions. Agents have a tendency to over-specify capabilities.
@@ -126,7 +126,7 @@ The kernel should be registered as Torch ops in `torch-ext/torch_binding.cpp`, w
 
 Make sure the integration matches the library:
 
-- **transformers**: patch the target modules directly, often RMSNorm modules whose class name contains `RMSNorm`  
+- **transformers**: patch the target modules directly, often RMSNorm modules whose class name contains `RMSNorm`
 - **diffusers**: inspect the actual pipeline structure before patching, because modules and attention processors can differ across pipelines
 
 > [!NOTE]
@@ -134,9 +134,9 @@ Make sure the integration matches the library:
 
 A few patterns matter in practice for the integration code:
 
-- In **transformers**, RMSNorm modules generally have weights, but epsilon may be exposed as `variance_epsilon` or `eps` depending on the model.  
-- In **diffusers**, some RMSNorm modules may have `weight=None`, so the integration code needs to handle both weighted and unweighted cases.  
-- In **diffusers**, checking `type(module).__name__` is often more reliable than `isinstance(...)` for matching RMSNorm modules across implementations.  
+- In **transformers**, RMSNorm modules generally have weights, but epsilon may be exposed as `variance_epsilon` or `eps` depending on the model.
+- In **diffusers**, some RMSNorm modules may have `weight=None`, so the integration code needs to handle both weighted and unweighted cases.
+- In **diffusers**, checking `type(module).__name__` is often more reliable than `isinstance(...)` for matching RMSNorm modules across implementations.
 - If a diffusers pipeline uses CPU offloading, inject custom kernels **before** enabling offload.
 
 For attention, prefer the model library's existing optimized path when one already exists. For example, in `transformers`, Flash Attention 2 is usually the right baseline for attention, while custom kernels are especially useful for operations like RMSNorm and other targeted hotspots.
@@ -160,22 +160,22 @@ nix run nixpkgs#cachix -- use huggingface
 
 There are two main benchmarks to consider:
 
-1. an isolated kernel micro-benchmark  
+1. an isolated kernel micro-benchmark
 2. an end-to-end benchmark in the real model or pipeline
 
 The agent will generate both benchmarks based on the agent skills examples. Typically as a script called `benchmark_example.py`. If you have access to the target hardware, you can run it to verify the kernel works. For example, the agent will generat a table like this:
 
 ```markdown
-| Shape | Custom (ms) | PyTorch (ms) | Speedup |
-| :---- | :---: | :---: | :---: |
-| [1x128x4096] | 0.040 | 0.062 | **1.58x** |
-| [1x512x4096] | 0.038 | 0.064 | **1.69x** |
-| [1x1024x4096] | 0.037 | 0.071 | **1.90x** |
-| [1x2048x4096] | 0.045 | 0.091 | **2.03x** |
-| [1x4096x4096] | 0.071 | 0.150 | **2.12x** |
-| [4x512x4096] | 0.056 | 0.093 | **1.67x** |
-| [8x256x4096] | 0.045 | 0.092 | **2.06x** |
-| [1x8192x4096] | 0.109 | 0.269 | **2.47x** |
+| Shape         | Custom (ms) | PyTorch (ms) |  Speedup  |
+| :------------ | :---------: | :----------: | :-------: |
+| [1x128x4096]  |    0.040    |    0.062     | **1.58x** |
+| [1x512x4096]  |    0.038    |    0.064     | **1.69x** |
+| [1x1024x4096] |    0.037    |    0.071     | **1.90x** |
+| [1x2048x4096] |    0.045    |    0.091     | **2.03x** |
+| [1x4096x4096] |    0.071    |    0.150     | **2.12x** |
+| [4x512x4096]  |    0.056    |    0.093     | **1.67x** |
+| [8x256x4096]  |    0.045    |    0.092     | **2.06x** |
+| [1x8192x4096] |    0.109    |    0.269     | **2.47x** |
 ```
 
 Interpret the results carefully. A kernel can show a large isolated speedup but only a modest end-to-end gain if that operation is a small fraction of total runtime. In the LTX-Video example from [the blog we wrote](https://huggingface.co/blog/custom-cuda-kernels-agent-skills), the generated RMSNorm kernel improved the isolated benchmark by about **1.88x** on average, but end-to-end video generation improved by about **6%**, which matched the fact that RMSNorm accounted for only a small share of total compute.
@@ -186,7 +186,7 @@ Once the project is correct and benchmarked, you can build Hub-compatible artifa
 
 ```shell
 # install the hf CLI tool
-hf skills add 
+hf skills add
 
 # Authenticate
 hf auth login
@@ -216,4 +216,5 @@ from kernels import get_kernel
 kernel = get_kernel("your-org/your-kernel", version=1)
 ```
 
-Well done! You have now built a custom kernel and published it to the Hub.
\ No newline at end of file
+Well done! You have now built a custom kernel and published it to the Hub.
+
diff --git a/docs/source/builder/build.md b/docs/source/builder/build.md
index 8632eff9..cc2c033d 100644
--- a/docs/source/builder/build.md
+++ b/docs/source/builder/build.md
@@ -69,25 +69,27 @@ for monitoring the build. The compiled kernel will then be in the local
 
 `kernel-builder` provides shells for developing kernels. In such a shell,
 all required dependencies are available, as well as `kernel-builder` for generating
-project files. For example:
+project files. For example, you can use the development shell to build a
+arch (AOT-compiled) kernel:
 
 ```bash
 $ kernel-builder devshell
 # A devshell is opened in which you can run the following commands:
 $ kernel-builder create-pyproject
 $ cmake -B build-ext
-$ cmake --build build-ext
+$ cmake --build build-ext --target local_install
 ```
 
-If you want to test the kernel as a Python package, you can do so.
-`kernel-builder devshell` will automatically create a virtual environment in
-the `.venv` and activate it. You can install the kernel as a regular
-Python package in this virtual environment:
+This will build the kernel and puts the output in the `build` directory
+and can be used with the `kernels` library.
+
+Noarch (JIT-compiled) kernels do not use CMake. For this reason, we also
+create a `setup.py` that works both for arch and noarch kernels:
 
 ```bash
 $ kernel-builder devshell
 $ kernel-builder create-pyproject
-$ pip install --no-build-isolation -e .
+$ python setup.py build_kernel
 ```
 
 Development shells are available for every build configuration. For
diff --git a/docs/source/builder/local-dev.md b/docs/source/builder/local-dev.md
index df07b847..b23dd528 100644
--- a/docs/source/builder/local-dev.md
+++ b/docs/source/builder/local-dev.md
@@ -27,12 +27,24 @@ $ kernel-builder create-pyproject -f
 The `-f` flag is optional and instructs `kernel-builder` to overwrite
 existing files.
 
-It is recommended to do an editable install of the generated project into
-your Python virtual environment for development:
+You can build the kernel with
 
 ```bash
-$ pip install wheel # Needed once to enable bdist_wheel.
-$ pip install --no-build-isolation -e .
+$ python setup.py build_kernel
+```
+
+This builds the kernel and puts the variant that is compatible with the build
+environment in `build`. The build can then be loaded directly with `kernels`:
+
+```shell
+$ python -c 'import pathlib; import kernels; k = kernels.get_local_kernel(pathlib.Path("build")); print(k)'
+```
+
+For AOT kernels, if you want to skip the CMake configuration step in subsequent
+builds, you can also run Ninja directly to do incremental builds:
+
+```shell
+$ ninja -C _cmake_build local_install
 ```
 
 You can also create a Python project for a kernel in another directory:
diff --git a/docs/source/builder/writing-kernels.md b/docs/source/builder/writing-kernels.md
index 666cb739..db610ca6 100644
--- a/docs/source/builder/writing-kernels.md
+++ b/docs/source/builder/writing-kernels.md
@@ -409,10 +409,35 @@ def relu_fwd_fake(input: torch.Tensor) -> torch.Tensor:
 
 ## Kernel tests
 
-Kernel tests are stored in the `tests` directory. Since running all
-kernel tests in CI may be prohibitively expensive, the `pyproject.toml`
-generated by the builder adds support for the special `kernels_ci`
-PyTest marker that can be used as follows:
+### Use `get_kernel` in tests
+
+Kernel tests are stored in the `tests` directory. Tests must not use direct
+imports, but instead use `get_kernel` to test the kernel as it will be used.
+For example:
+
+```python
+import kernels
+import torch
+import torch.nn.functional as F
+
+relu = kernels.get_kernel("kernels-community/relu", version=1)
+
+def test_relu():
+    x = torch.randn(1024, 1024, dtype=torch.float32, device=torch.device("cuda"))
+    y = relu.relu(x, torch.empty_like(x))
+    y_ref = F.relu(x)
+    torch.testing.assert_close(y_ref, y)
+```
+
+Development shells (`kernel-builder devshell`/`kernel-builder testshell`)
+will set the `LOCAL_KERNELS` variable to ensure that the kernel will be
+loaded from the development environment.
+
+### Mark CI tests
+
+Since running all kernel tests in CI may be prohibitively expensive, the
+`pyproject.toml` generated by the builder adds support for the special
+`kernels_ci` PyTest marker that can be used as follows:
 
 ```python
 import pytest
diff --git a/kernel-builder/src/pyproject/templates/torch/setup.py b/kernel-builder/src/pyproject/templates/torch/setup.py
index 57376b94..f7e7fe24 100644
--- a/kernel-builder/src/pyproject/templates/torch/setup.py
+++ b/kernel-builder/src/pyproject/templates/torch/setup.py
@@ -6,6 +6,7 @@
 from pathlib import Path
 
 from setuptools import Extension, find_packages, setup
+from setuptools.command.build import build
 from setuptools.command.build_ext import build_ext
 
 logger = logging.getLogger(__name__)
@@ -39,6 +40,78 @@ def is_ninja_available() -> bool:
     return which("ninja") is not None
 
 
+def _make_cmake_args(cfg: str) -> tuple[list[str], list[str]]:
+    """Build CMake and build arguments from the current environment."""
+    cmake_generator = os.environ.get("CMAKE_GENERATOR", "")
+
+    cmake_args = [
+        f"-DPython3_EXECUTABLE={sys.executable}",
+        f"-DCMAKE_BUILD_TYPE={cfg}",  # not used on MSVC, but no harm
+    ]
+    build_args: list[str] = []
+
+    if "CMAKE_ARGS" in os.environ:
+        cmake_args += [item for item in os.environ["CMAKE_ARGS"].split(" ") if item]
+
+    if not cmake_generator or cmake_generator == "Ninja":
+        try:
+            import ninja
+
+            ninja_executable_path = Path(ninja.BIN_DIR) / "ninja"
+            cmake_args += [
+                "-GNinja",
+                f"-DCMAKE_MAKE_PROGRAM:FILEPATH={ninja_executable_path}",
+            ]
+        except ImportError:
+            pass
+
+    if is_sccache_available():
+        cmake_args += [
+            "-DCMAKE_C_COMPILER_LAUNCHER=sccache",
+            "-DCMAKE_CXX_COMPILER_LAUNCHER=sccache",
+            "-DCMAKE_CUDA_COMPILER_LAUNCHER=sccache",
+            "-DCMAKE_HIP_COMPILER_LAUNCHER=sccache",
+            "-DCMAKE_OBJC_COMPILER_LAUNCHER=sccache",
+            "-DCMAKE_OBJCXX_COMPILER_LAUNCHER=sccache",
+        ]
+    elif is_ccache_available():
+        cmake_args += [
+            "-DCMAKE_C_COMPILER_LAUNCHER=ccache",
+            "-DCMAKE_CXX_COMPILER_LAUNCHER=ccache",
+            "-DCMAKE_CUDA_COMPILER_LAUNCHER=ccache",
+            "-DCMAKE_HIP_COMPILER_LAUNCHER=ccache",
+            "-DCMAKE_OBJC_COMPILER_LAUNCHER=ccache",
+            "-DCMAKE_OBJCXX_COMPILER_LAUNCHER=ccache",
+        ]
+
+    num_jobs = os.getenv("MAX_JOBS", None)
+    if num_jobs is not None:
+        num_jobs = int(num_jobs)
+        logger.info("Using MAX_JOBS=%d as the number of jobs.", num_jobs)
+    else:
+        try:
+            # os.sched_getaffinity() isn't universally available, so fall
+            #  back to os.cpu_count() if we get an error here.
+            num_jobs = len(os.sched_getaffinity(0))
+        except AttributeError:
+            num_jobs = os.cpu_count()
+
+    nvcc_threads = os.getenv("NVCC_THREADS", None)
+    if nvcc_threads is not None:
+        nvcc_threads = int(nvcc_threads)
+        logger.info(
+            "Using NVCC_THREADS=%d as the number of nvcc threads.", nvcc_threads
+        )
+        num_jobs = max(1, num_jobs // nvcc_threads)
+        cmake_args += ["-DNVCC_THREADS={}".format(nvcc_threads)]
+
+    build_args += [f"-j{num_jobs}"]
+    if sys.platform == "win32":
+        build_args += ["--config", cfg]
+
+    return cmake_args, build_args
+
+
 class CMakeExtension(Extension):
     def __init__(self, name: str, sourcedir: str = "") -> None:
         super().__init__(name, sources=[], py_limited_api=True)
@@ -53,85 +126,20 @@ def build_extension(self, ext: CMakeExtension) -> None:
         debug = int(os.environ.get("DEBUG", 0)) if self.debug is None else self.debug
         cfg = "Debug" if debug else "Release"
 
-        cmake_generator = os.environ.get("CMAKE_GENERATOR", "")
-
-        # Set Python3_EXECUTABLE instead if you use PYBIND11_FINDPYTHON
-        # EXAMPLE_VERSION_INFO shows you how to pass a value into the C++ code
-        # from Python.
-        cmake_args = [
-            f"-DCMAKE_LIBRARY_OUTPUT_DIRECTORY={extdir}{os.sep}",
-            f"-DPython3_EXECUTABLE={sys.executable}",
-            f"-DCMAKE_BUILD_TYPE={cfg}",  # not used on MSVC, but no harm
-        ]
-        build_args = []
-        if "CMAKE_ARGS" in os.environ:
-            cmake_args += [item for item in os.environ["CMAKE_ARGS"].split(" ") if item]
-
-        if not cmake_generator or cmake_generator == "Ninja":
-            try:
-                import ninja
-
-                ninja_executable_path = Path(ninja.BIN_DIR) / "ninja"
-                cmake_args += [
-                    "-GNinja",
-                    f"-DCMAKE_MAKE_PROGRAM:FILEPATH={ninja_executable_path}",
-                ]
-            except ImportError:
-                pass
-
-        if is_sccache_available():
-            cmake_args += [
-                "-DCMAKE_C_COMPILER_LAUNCHER=sccache",
-                "-DCMAKE_CXX_COMPILER_LAUNCHER=sccache",
-                "-DCMAKE_CUDA_COMPILER_LAUNCHER=sccache",
-                "-DCMAKE_HIP_COMPILER_LAUNCHER=sccache",
-                "-DCMAKE_OBJC_COMPILER_LAUNCHER=sccache",
-                "-DCMAKE_OBJCXX_COMPILER_LAUNCHER=sccache",
-            ]
-        elif is_ccache_available():
-            cmake_args += [
-                "-DCMAKE_C_COMPILER_LAUNCHER=ccache",
-                "-DCMAKE_CXX_COMPILER_LAUNCHER=ccache",
-                "-DCMAKE_CUDA_COMPILER_LAUNCHER=ccache",
-                "-DCMAKE_HIP_COMPILER_LAUNCHER=ccache",
-                "-DCMAKE_OBJC_COMPILER_LAUNCHER=ccache",
-                "-DCMAKE_OBJCXX_COMPILER_LAUNCHER=ccache",
-            ]
-
-        num_jobs = os.getenv("MAX_JOBS", None)
-        if num_jobs is not None:
-            num_jobs = int(num_jobs)
-            logger.info("Using MAX_JOBS=%d as the number of jobs.", num_jobs)
-        else:
-            try:
-                # os.sched_getaffinity() isn't universally available, so fall
-                #  back to os.cpu_count() if we get an error here.
-                num_jobs = len(os.sched_getaffinity(0))
-            except AttributeError:
-                num_jobs = os.cpu_count()
-
-        nvcc_threads = os.getenv("NVCC_THREADS", None)
-        if nvcc_threads is not None:
-            nvcc_threads = int(nvcc_threads)
-            logger.info(
-                "Using NVCC_THREADS=%d as the number of nvcc threads.", nvcc_threads
-            )
-            num_jobs = max(1, num_jobs // nvcc_threads)
-            cmake_args += ["-DNVCC_THREADS={}".format(nvcc_threads)]
-
-        build_args += [f"-j{num_jobs}"]
-        if sys.platform == "win32":
-            build_args += ["--config", cfg]
+        cmake_args, build_args = _make_cmake_args(cfg)
+        cmake_args = [f"-DCMAKE_LIBRARY_OUTPUT_DIRECTORY={extdir}{os.sep}"] + cmake_args
 
         build_temp = Path(self.build_temp) / ext.name
         if not build_temp.exists():
             build_temp.mkdir(parents=True)
 
         subprocess.run(
-            ["cmake", ext.sourcedir, *cmake_args], cwd=build_temp, check=True
+            ["cmake", "-S", ext.sourcedir, "-B", str(build_temp), *cmake_args],
+            cwd=build_temp,
+            check=True,
         )
         subprocess.run(
-            ["cmake", "--build", ".", *build_args], cwd=build_temp, check=True
+            ["cmake", "--build", str(build_temp), *build_args], cwd=build_temp, check=True
         )
 
         if sys.platform == "win32":
@@ -140,6 +148,41 @@ def build_extension(self, ext: CMakeExtension) -> None:
                 move(extdir / cfg / filename, extdir / filename)
 
 
+class BuildKernel(build):
+    """Custom command to build and locally install the kernel."""
+
+    description = "Build the kernel and install via the local_install CMake target"
+    user_options = []
+
+    def initialize_options(self) -> None:
+        super().initialize_options()
+
+    def finalize_options(self) -> None:
+        super().finalize_options()
+
+    def run(self) -> None:
+        project_root = Path(__file__).parent
+
+        debug = int(os.environ.get("DEBUG", 0))
+        cfg = "Debug" if debug else "Release"
+
+        cmake_args, build_args = _make_cmake_args(cfg)
+
+        build_temp = project_root / "_cmake_build"
+        build_temp.mkdir(parents=True, exist_ok=True)
+
+        subprocess.run(
+            ["cmake", "-S", str(project_root), "-B", str(build_temp), *cmake_args],
+            cwd=project_root,
+            check=True,
+        )
+        subprocess.run(
+            ["cmake", "--build", str(build_temp), "--target", "local_install", *build_args],
+            cwd=project_root,
+            check=True,
+        )
+
+
 backend = get_backend()
 ops_name = f"_{{ kernel_name }}_{backend}_{{ kernel_unique_id }}"
 
@@ -148,7 +191,7 @@ def build_extension(self, ext: CMakeExtension) -> None:
     # The version is just a stub, it's not used by the final build artefact.
     version="0.1.0",
     ext_modules=[CMakeExtension(f"{{ python_name }}.{ops_name}")],
-    cmdclass={"build_ext": CMakeBuild},
+    cmdclass={"build_ext": CMakeBuild, "build_kernel": BuildKernel},
     packages=find_packages(where="torch-ext", include=["{{ python_name }}*"]),
     package_dir={"": "torch-ext"},
 {% if data_globs %}
diff --git a/kernel-builder/src/pyproject/templates/tvm_ffi/setup.py b/kernel-builder/src/pyproject/templates/tvm_ffi/setup.py
index b4aaf2d8..ae19466b 100644
--- a/kernel-builder/src/pyproject/templates/tvm_ffi/setup.py
+++ b/kernel-builder/src/pyproject/templates/tvm_ffi/setup.py
@@ -7,6 +7,7 @@
 from pathlib import Path
 
 from setuptools import Extension, find_packages, setup
+from setuptools.command.build import build
 from setuptools.command.build_ext import build_ext
 
 logger = logging.getLogger(__name__)
@@ -34,6 +35,78 @@ def is_ninja_available() -> bool:
     return which("ninja") is not None
 
 
+def _make_cmake_args(cfg: str) -> tuple[list[str], list[str]]:
+    """Build CMake and build arguments from the current environment."""
+    cmake_generator = os.environ.get("CMAKE_GENERATOR", "")
+
+    cmake_args = [
+        f"-DPython3_EXECUTABLE={sys.executable}",
+        f"-DCMAKE_BUILD_TYPE={cfg}",  # not used on MSVC, but no harm
+    ]
+    build_args: list[str] = []
+
+    if "CMAKE_ARGS" in os.environ:
+        cmake_args += [item for item in os.environ["CMAKE_ARGS"].split(" ") if item]
+
+    if not cmake_generator or cmake_generator == "Ninja":
+        try:
+            import ninja
+
+            ninja_executable_path = Path(ninja.BIN_DIR) / "ninja"
+            cmake_args += [
+                "-GNinja",
+                f"-DCMAKE_MAKE_PROGRAM:FILEPATH={ninja_executable_path}",
+            ]
+        except ImportError:
+            pass
+
+    if is_sccache_available():
+        cmake_args += [
+            "-DCMAKE_C_COMPILER_LAUNCHER=sccache",
+            "-DCMAKE_CXX_COMPILER_LAUNCHER=sccache",
+            "-DCMAKE_CUDA_COMPILER_LAUNCHER=sccache",
+            "-DCMAKE_HIP_COMPILER_LAUNCHER=sccache",
+            "-DCMAKE_OBJC_COMPILER_LAUNCHER=sccache",
+            "-DCMAKE_OBJCXX_COMPILER_LAUNCHER=sccache",
+        ]
+    elif is_ccache_available():
+        cmake_args += [
+            "-DCMAKE_C_COMPILER_LAUNCHER=ccache",
+            "-DCMAKE_CXX_COMPILER_LAUNCHER=ccache",
+            "-DCMAKE_CUDA_COMPILER_LAUNCHER=ccache",
+            "-DCMAKE_HIP_COMPILER_LAUNCHER=ccache",
+            "-DCMAKE_OBJC_COMPILER_LAUNCHER=ccache",
+            "-DCMAKE_OBJCXX_COMPILER_LAUNCHER=ccache",
+        ]
+
+    num_jobs = os.getenv("MAX_JOBS", None)
+    if num_jobs is not None:
+        num_jobs = int(num_jobs)
+        logger.info("Using MAX_JOBS=%d as the number of jobs.", num_jobs)
+    else:
+        try:
+            # os.sched_getaffinity() isn't universally available, so fall
+            #  back to os.cpu_count() if we get an error here.
+            num_jobs = len(os.sched_getaffinity(0))
+        except AttributeError:
+            num_jobs = os.cpu_count()
+
+    nvcc_threads = os.getenv("NVCC_THREADS", None)
+    if nvcc_threads is not None:
+        nvcc_threads = int(nvcc_threads)
+        logger.info(
+            "Using NVCC_THREADS=%d as the number of nvcc threads.", nvcc_threads
+        )
+        num_jobs = max(1, num_jobs // nvcc_threads)
+        cmake_args += ["-DNVCC_THREADS={}".format(nvcc_threads)]
+
+    build_args += [f"-j{num_jobs}"]
+    if sys.platform == "win32":
+        build_args += ["--config", cfg]
+
+    return cmake_args, build_args
+
+
 class CMakeExtension(Extension):
     def __init__(self, name: str, sourcedir: str = "") -> None:
         super().__init__(name, sources=[], py_limited_api=False)
@@ -48,85 +121,20 @@ def build_extension(self, ext: CMakeExtension) -> None:
         debug = int(os.environ.get("DEBUG", 0)) if self.debug is None else self.debug
         cfg = "Debug" if debug else "Release"
 
-        cmake_generator = os.environ.get("CMAKE_GENERATOR", "")
-
-        # Set Python3_EXECUTABLE instead if you use PYBIND11_FINDPYTHON
-        # EXAMPLE_VERSION_INFO shows you how to pass a value into the C++ code
-        # from Python.
-        cmake_args = [
-            f"-DCMAKE_LIBRARY_OUTPUT_DIRECTORY={extdir}{os.sep}",
-            f"-DPython3_EXECUTABLE={sys.executable}",
-            f"-DCMAKE_BUILD_TYPE={cfg}",  # not used on MSVC, but no harm
-        ]
-        build_args = []
-        if "CMAKE_ARGS" in os.environ:
-            cmake_args += [item for item in os.environ["CMAKE_ARGS"].split(" ") if item]
-
-        if not cmake_generator or cmake_generator == "Ninja":
-            try:
-                import ninja
-
-                ninja_executable_path = Path(ninja.BIN_DIR) / "ninja"
-                cmake_args += [
-                    "-GNinja",
-                    f"-DCMAKE_MAKE_PROGRAM:FILEPATH={ninja_executable_path}",
-                ]
-            except ImportError:
-                pass
-
-        if is_sccache_available():
-            cmake_args += [
-                "-DCMAKE_C_COMPILER_LAUNCHER=sccache",
-                "-DCMAKE_CXX_COMPILER_LAUNCHER=sccache",
-                "-DCMAKE_CUDA_COMPILER_LAUNCHER=sccache",
-                "-DCMAKE_HIP_COMPILER_LAUNCHER=sccache",
-                "-DCMAKE_OBJC_COMPILER_LAUNCHER=sccache",
-                "-DCMAKE_OBJCXX_COMPILER_LAUNCHER=sccache",
-            ]
-        elif is_ccache_available():
-            cmake_args += [
-                "-DCMAKE_C_COMPILER_LAUNCHER=ccache",
-                "-DCMAKE_CXX_COMPILER_LAUNCHER=ccache",
-                "-DCMAKE_CUDA_COMPILER_LAUNCHER=ccache",
-                "-DCMAKE_HIP_COMPILER_LAUNCHER=ccache",
-                "-DCMAKE_OBJC_COMPILER_LAUNCHER=ccache",
-                "-DCMAKE_OBJCXX_COMPILER_LAUNCHER=ccache",
-            ]
-
-        num_jobs = os.getenv("MAX_JOBS", None)
-        if num_jobs is not None:
-            num_jobs = int(num_jobs)
-            logger.info("Using MAX_JOBS=%d as the number of jobs.", num_jobs)
-        else:
-            try:
-                # os.sched_getaffinity() isn't universally available, so fall
-                #  back to os.cpu_count() if we get an error here.
-                num_jobs = len(os.sched_getaffinity(0))
-            except AttributeError:
-                num_jobs = os.cpu_count()
-
-        nvcc_threads = os.getenv("NVCC_THREADS", None)
-        if nvcc_threads is not None:
-            nvcc_threads = int(nvcc_threads)
-            logger.info(
-                "Using NVCC_THREADS=%d as the number of nvcc threads.", nvcc_threads
-            )
-            num_jobs = max(1, num_jobs // nvcc_threads)
-            cmake_args += ["-DNVCC_THREADS={}".format(nvcc_threads)]
-
-        build_args += [f"-j{num_jobs}"]
-        if sys.platform == "win32":
-            build_args += ["--config", cfg]
+        cmake_args, build_args = _make_cmake_args(cfg)
+        cmake_args = [f"-DCMAKE_LIBRARY_OUTPUT_DIRECTORY={extdir}{os.sep}"] + cmake_args
 
         build_temp = Path(self.build_temp) / ext.name
         if not build_temp.exists():
             build_temp.mkdir(parents=True)
 
         subprocess.run(
-            ["cmake", ext.sourcedir, *cmake_args], cwd=build_temp, check=True
+            ["cmake", "-S", ext.sourcedir, "-B", str(build_temp), *cmake_args],
+            cwd=build_temp,
+            check=True,
         )
         subprocess.run(
-            ["cmake", "--build", ".", *build_args], cwd=build_temp, check=True
+            ["cmake", "--build", str(build_temp), *build_args], cwd=build_temp, check=True
         )
 
         if sys.platform == "win32":
@@ -134,18 +142,54 @@ def build_extension(self, ext: CMakeExtension) -> None:
             for filename in os.listdir(extdir / cfg):
                 move(extdir / cfg / filename, extdir / filename)
 
-    def get_ext_filename(self, ext_name):
+    def get_ext_filename(self, ext_name):  # type: ignore[override]
         # The dynamic library is not a real Python extension, so it does not have
         # the usual Python platform information.
         suffix = sysconfig.get_config_var("SHLIB_SUFFIX")
         return f"{ext_name.replace(".", "/")}/{ops_name}{suffix}"
 
+
+class BuildKernel(build):
+    """Custom command to build and locally install the kernel."""
+
+    description = "Build the kernel and install via the local_install CMake target"
+    user_options = []
+
+    def initialize_options(self) -> None:
+        super().initialize_options()
+
+    def finalize_options(self) -> None:
+        super().finalize_options()
+
+    def run(self) -> None:
+        project_root = Path(__file__).parent
+
+        debug = int(os.environ.get("DEBUG", 0))
+        cfg = "Debug" if debug else "Release"
+
+        cmake_args, build_args = _make_cmake_args(cfg)
+
+        build_temp = project_root / "_cmake_build"
+        build_temp.mkdir(parents=True, exist_ok=True)
+
+        subprocess.run(
+            ["cmake", "-S", str(project_root), "-B", str(build_temp), *cmake_args],
+            cwd=project_root,
+            check=True,
+        )
+        subprocess.run(
+            ["cmake", "--build", str(build_temp), "--target", "local_install", *build_args],
+            cwd=project_root,
+            check=True,
+        )
+
+
 setup(
     name="{{ python_name }}",
     # The version is just a stub, it's not used by the final build artefact.
     version="0.1.0",
     ext_modules=[CMakeExtension(f"{{ python_name }}.{ops_name}")],
-    cmdclass={"build_ext": CMakeBuild},
+    cmdclass={"build_ext": CMakeBuild, "build_kernel": BuildKernel},
     packages=find_packages(where="tvm-ffi-ext", include=["{{ python_name }}*"]),
     package_dir={"": "tvm-ffi-ext"},
 {% if data_globs %}
diff --git a/nix-builder/lib/build.nix b/nix-builder/lib/build.nix
index 4b49871d..fcc3bc88 100644
--- a/nix-builder/lib/build.nix
+++ b/nix-builder/lib/build.nix
@@ -274,6 +274,7 @@ rec {
     }:
     let
       kernelConfig = readKernelConfig path;
+      repoId = lib.attrByPath [ "toml" "general" "hub" "repo-id" ] null kernelConfig;
       shellForBuildSet =
         { path, rev }:
         buildSet:
@@ -303,6 +304,7 @@ rec {
                 ++ pythonCheckInputs ps
                 ++ [
                   buildSet.torch
+                  kernels
                   pytest
                 ]
                 ++ pythonCheckInputs ps
@@ -320,6 +322,9 @@ rec {
               # make testing as pure as possible.
               unset LD_LIBRARY_PATH
               export PYTHONPATH=${extension}/${buildSet.variants.kernelVariant kernelConfig}
+            ''
+            + ''
+              export LOCAL_KERNELS="${repoId}=${extension}"
             '';
           };
         };
@@ -392,6 +397,7 @@ rec {
     }:
     let
       kernelConfig = readKernelConfig path;
+      repoId = lib.attrByPath [ "toml" "general" "hub" "repo-id" ] null kernelConfig;
       shellForBuildSet =
         buildSet:
         let
@@ -416,6 +422,7 @@ rec {
               ++ [
                 buildSet.torch
                 kernels
+                ninja
                 pip
                 pytest
               ]
@@ -457,6 +464,9 @@ rec {
               fi
               source "${venvDir}/bin/activate"
               unset LD_LIBRARY_PATH
+            ''
+            + lib.optionals (repoId != null) ''
+              export LOCAL_KERNELS="${repoId}=$(pwd)/build"
             '';
           };
         };