-
Notifications
You must be signed in to change notification settings - Fork 108
nix-builder: redesign kernel testing #676
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
893d529
372b904
10ddaed
928757b
e59839e
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,6 +1,6 @@ | ||
| # Develop kernels with agents | ||
|
|
||
| Code agents are a good fit to build custom kernels because the hard part is not just writing in Domain Specific Language (DSLs) like CUDA. You also need the right project layout, PyTorch bindings, architecture-specific choices, model-specific integration, and trustworthy benchmarks. | ||
| Code agents are a good fit to build custom kernels because the hard part is not just writing in Domain Specific Language (DSLs) like CUDA. You also need the right project layout, PyTorch bindings, architecture-specific choices, model-specific integration, and trustworthy benchmarks. | ||
|
|
||
| Kernels on Hugging Face are compatible with agents via skills and the `hf` CLI. The `cuda-kernels`, `rocm-kernels`, `xpu-kernels`, and `cpu-kernels` skills contain knowledge so an agent can generate and publish a complete kernel project, instead of isolated snippets. | ||
|
|
||
|
|
@@ -10,7 +10,7 @@ This guide is for **authoring new kernels**. If you only want to **load an exist | |
|
|
||
| You need: | ||
|
|
||
| - a coding agent that supports skills, such as Claude Code, Codex, Cursor, or OpenCode | ||
| - a coding agent that supports skills, such as Claude Code, Codex, Cursor, or OpenCode | ||
| - a clear target: library, model, operation, GPU, dtype, and representative shapes | ||
|
|
||
| The skill currently focuses on NVIDIA GPUs such as **H100**, **A100**, and **T4**, and on integration patterns for **transformers** and **diffusers**. | ||
|
|
@@ -31,14 +31,14 @@ kernel-builder skills add --claude | |
|
|
||
| Writing kernels is a hard problem, so be specific to agents. A robust prompt will declare all core attributes, including: | ||
|
|
||
| - the library, for example `transformers` or `diffusers` | ||
| - the model id, for example `Qwen3-8B` or `LTX-Video` | ||
| - the operation, for example `RMSNorm`, attention, RoPE, `GEGLU`, or `AdaLN` | ||
| - the target GPU, for example `H100`, `A100`, or `T4` | ||
| - the dtype, for example `bfloat16`, `float16`, or `float32` | ||
| - the library, for example `transformers` or `diffusers` | ||
| - the model id, for example `Qwen3-8B` or `LTX-Video` | ||
| - the operation, for example `RMSNorm`, attention, RoPE, `GEGLU`, or `AdaLN` | ||
| - the target GPU, for example `H100`, `A100`, or `T4` | ||
| - the dtype, for example `bfloat16`, `float16`, or `float32` | ||
| - the outputs you expect: kernel code, bindings, tests, and benchmarks | ||
|
|
||
| In practice, you can often skip some of these and the agent will infer based on common practice, but if you know a detail declare it. | ||
| In practice, you can often skip some of these and the agent will infer based on common practice, but if you know a detail declare it. | ||
|
|
||
| For example: | ||
|
|
||
|
|
@@ -69,7 +69,7 @@ examples/your_model/ | |
| │ └── torch_binding.cpp # PyTorch C++ bindings | ||
| ├── benchmark_rmsnorm.py # Micro-benchmark script | ||
| ├── build.toml # kernel-builder config | ||
| ├── setup.py # pip install -e . | ||
| ├── setup.py # python setup.py build_kernel | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Does this mean that the existing skill files need to be updated?
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think so, I'll make a separate PR for that to move this forward. |
||
| └── pyproject.toml | ||
| ``` | ||
|
|
||
|
|
@@ -105,15 +105,15 @@ cuda-capabilities = ["9.0"] # H100 | |
|
|
||
| First check that: | ||
|
|
||
| - `backends = ["cuda"]` is correct for your project | ||
| - the kernel source files are listed correctly | ||
| - the Torch binding sources are included under `[torch]` | ||
| - `backends = ["cuda"]` is correct for your project | ||
| - the kernel source files are listed correctly | ||
| - the Torch binding sources are included under `[torch]` | ||
| - `cuda-capabilities` is only set when the kernel truly targets specific architectures | ||
|
|
||
| For architecture-specific kernels, typical capability values are: | ||
|
|
||
| - H100: `9.0` | ||
| - A100: `8.0` | ||
| - H100: `9.0` | ||
| - A100: `8.0` | ||
| - T4: `7.5` | ||
|
|
||
| If the kernel does **not** require a specific capability, the kernels docs recommend leaving `cuda-capabilities` unset so the builder can target all supported capabilities. In practice, you can prompt your agent to review the `build.toml` for excessive definitions. Agents have a tendency to over-specify capabilities. | ||
|
|
@@ -126,17 +126,17 @@ The kernel should be registered as Torch ops in `torch-ext/torch_binding.cpp`, w | |
|
|
||
| Make sure the integration matches the library: | ||
|
|
||
| - **transformers**: patch the target modules directly, often RMSNorm modules whose class name contains `RMSNorm` | ||
| - **transformers**: patch the target modules directly, often RMSNorm modules whose class name contains `RMSNorm` | ||
| - **diffusers**: inspect the actual pipeline structure before patching, because modules and attention processors can differ across pipelines | ||
|
|
||
| > [!NOTE] | ||
| > One common issue is that the agent will not integrate the kernel at all. Typically because the project's context is so long. | ||
|
|
||
| A few patterns matter in practice for the integration code: | ||
|
|
||
| - In **transformers**, RMSNorm modules generally have weights, but epsilon may be exposed as `variance_epsilon` or `eps` depending on the model. | ||
| - In **diffusers**, some RMSNorm modules may have `weight=None`, so the integration code needs to handle both weighted and unweighted cases. | ||
| - In **diffusers**, checking `type(module).__name__` is often more reliable than `isinstance(...)` for matching RMSNorm modules across implementations. | ||
| - In **transformers**, RMSNorm modules generally have weights, but epsilon may be exposed as `variance_epsilon` or `eps` depending on the model. | ||
| - In **diffusers**, some RMSNorm modules may have `weight=None`, so the integration code needs to handle both weighted and unweighted cases. | ||
| - In **diffusers**, checking `type(module).__name__` is often more reliable than `isinstance(...)` for matching RMSNorm modules across implementations. | ||
| - If a diffusers pipeline uses CPU offloading, inject custom kernels **before** enabling offload. | ||
|
|
||
| For attention, prefer the model library's existing optimized path when one already exists. For example, in `transformers`, Flash Attention 2 is usually the right baseline for attention, while custom kernels are especially useful for operations like RMSNorm and other targeted hotspots. | ||
|
|
@@ -160,22 +160,22 @@ nix run nixpkgs#cachix -- use huggingface | |
|
|
||
| There are two main benchmarks to consider: | ||
|
|
||
| 1. an isolated kernel micro-benchmark | ||
| 1. an isolated kernel micro-benchmark | ||
| 2. an end-to-end benchmark in the real model or pipeline | ||
|
|
||
| The agent will generate both benchmarks based on the agent skills examples. Typically as a script called `benchmark_example.py`. If you have access to the target hardware, you can run it to verify the kernel works. For example, the agent will generat a table like this: | ||
|
|
||
| ```markdown | ||
| | Shape | Custom (ms) | PyTorch (ms) | Speedup | | ||
| | :---- | :---: | :---: | :---: | | ||
| | [1x128x4096] | 0.040 | 0.062 | **1.58x** | | ||
| | [1x512x4096] | 0.038 | 0.064 | **1.69x** | | ||
| | [1x1024x4096] | 0.037 | 0.071 | **1.90x** | | ||
| | [1x2048x4096] | 0.045 | 0.091 | **2.03x** | | ||
| | [1x4096x4096] | 0.071 | 0.150 | **2.12x** | | ||
| | [4x512x4096] | 0.056 | 0.093 | **1.67x** | | ||
| | [8x256x4096] | 0.045 | 0.092 | **2.06x** | | ||
| | [1x8192x4096] | 0.109 | 0.269 | **2.47x** | | ||
| | Shape | Custom (ms) | PyTorch (ms) | Speedup | | ||
| | :------------ | :---------: | :----------: | :-------: | | ||
| | [1x128x4096] | 0.040 | 0.062 | **1.58x** | | ||
| | [1x512x4096] | 0.038 | 0.064 | **1.69x** | | ||
| | [1x1024x4096] | 0.037 | 0.071 | **1.90x** | | ||
| | [1x2048x4096] | 0.045 | 0.091 | **2.03x** | | ||
| | [1x4096x4096] | 0.071 | 0.150 | **2.12x** | | ||
| | [4x512x4096] | 0.056 | 0.093 | **1.67x** | | ||
| | [8x256x4096] | 0.045 | 0.092 | **2.06x** | | ||
| | [1x8192x4096] | 0.109 | 0.269 | **2.47x** | | ||
| ``` | ||
|
|
||
| Interpret the results carefully. A kernel can show a large isolated speedup but only a modest end-to-end gain if that operation is a small fraction of total runtime. In the LTX-Video example from [the blog we wrote](https://huggingface.co/blog/custom-cuda-kernels-agent-skills), the generated RMSNorm kernel improved the isolated benchmark by about **1.88x** on average, but end-to-end video generation improved by about **6%**, which matched the fact that RMSNorm accounted for only a small share of total compute. | ||
|
|
@@ -186,7 +186,7 @@ Once the project is correct and benchmarked, you can build Hub-compatible artifa | |
|
|
||
| ```shell | ||
| # install the hf CLI tool | ||
| hf skills add | ||
| hf skills add | ||
|
|
||
| # Authenticate | ||
| hf auth login | ||
|
|
@@ -216,4 +216,5 @@ from kernels import get_kernel | |
| kernel = get_kernel("your-org/your-kernel", version=1) | ||
| ``` | ||
|
|
||
| Well done! You have now built a custom kernel and published it to the Hub. | ||
| Well done! You have now built a custom kernel and published it to the Hub. | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -409,10 +409,35 @@ def relu_fwd_fake(input: torch.Tensor) -> torch.Tensor: | |
|
|
||
| ## Kernel tests | ||
|
|
||
| Kernel tests are stored in the `tests` directory. Since running all | ||
| kernel tests in CI may be prohibitively expensive, the `pyproject.toml` | ||
| generated by the builder adds support for the special `kernels_ci` | ||
| PyTest marker that can be used as follows: | ||
| ### Use `get_kernel` in tests | ||
|
|
||
| Kernel tests are stored in the `tests` directory. Tests must not use direct | ||
| imports, but instead use `get_kernel` to test the kernel as it will be used. | ||
| For example: | ||
|
|
||
| ```python | ||
| import kernels | ||
| import torch | ||
| import torch.nn.functional as F | ||
|
|
||
| relu = kernels.get_kernel("kernels-community/relu", version=1) | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This assumes that the kernel has been pushed to the Hub and the tests require loading remotely and NOT locally? Should we perhaps make that point a bit clearer?
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It doesn't, see the note about |
||
|
|
||
| def test_relu(): | ||
| x = torch.randn(1024, 1024, dtype=torch.float32, device=torch.device("cuda")) | ||
| y = relu.relu(x, torch.empty_like(x)) | ||
| y_ref = F.relu(x) | ||
| torch.testing.assert_close(y_ref, y) | ||
| ``` | ||
|
|
||
| Development shells (`kernel-builder devshell`/`kernel-builder testshell`) | ||
| will set the `LOCAL_KERNELS` variable to ensure that the kernel will be | ||
| loaded from the development environment. | ||
|
|
||
| ### Mark CI tests | ||
|
|
||
| Since running all kernel tests in CI may be prohibitively expensive, the | ||
| `pyproject.toml` generated by the builder adds support for the special | ||
| `kernels_ci` PyTest marker that can be used as follows: | ||
|
|
||
| ```python | ||
| import pytest | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How were doc related formatting changes started showing up? Any version updates?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Possibly my editor reformatting these superfluous EOL whitespaces. If it's compatible with hf doc builder, we could consider using
prettierto standardize formatting.