Summary
scripts/run_benchmark.sh reports layernorm ≈ 1.69 TB/s at 32768x8192 bf16 on the gfx950 runner, while softmax/rmsnorm are ~5.8 TB/s at the same shape. The base layernorm kernel is healthy (~5.6 TB/s) — this is a benchmark-output parsing bug, not a kernel regression.
Root cause
The norm branch of _py_parse_and_emit keeps the last Bandwidth: line:
for m_bw in re.finditer(r"Bandwidth:\s*([0-9.]+)\s*GB/s", txt):
pass
Since #549, test_layernorm.py's __main__ runs six benchmarks in sequence — base layernorm, then fused_add / dynamicquant / smoothquant / fused_add_dynamicquant / fused_add_smoothquant — each printing its own Bandwidth: line. The last is the fully-scalar fused_add_smoothquant path, so the parser reports that as "layernorm". softmax/rmsnorm are unaffected because their __main__ runs a single base test (first == last).
Evidence
Exact CI command on MI350X / gfx950 (32768x8192 bf16) prints all six:
LayerNorm (base, 128b vectorized) ........ 5617 GB/s <- the real number
FusedAdd LayerNorm ....................... 2982
LayerNorm DynamicQuant ................... 1852
LayerNorm SmoothQuant .................... 1608
FusedAdd DynamicQuant .................... 1768
FusedAdd SmoothQuant ..................... 1660 GB/s <- parser keeps this one
Per-commit CI benchmark history pins the step exactly at #549 (5.5 → 1.69 TB/s), with softmax/rmsnorm flat across the same range. The kernel never changed: git diff on kernels/layernorm_kernel.py at #549 is @@ -314,3 +314,607 @@ — purely appended quant/fused builders; build_layernorm_module is byte-identical.
Secondary: the regression gate can't catch this
compare_benchmark.py (current vs main) compares against a main baseline that is itself mislabeled — 1.69 vs 1.69 — so the 3× discrepancy slips through silently. Worth considering a per-op-tagged bandwidth line or an absolute roofline sanity floor.
Fix
Report the first Bandwidth: match (the base op is always benchmarked first). PR: #654.
Summary
scripts/run_benchmark.shreports layernorm ≈ 1.69 TB/s at32768x8192bf16 on the gfx950 runner, whilesoftmax/rmsnormare ~5.8 TB/s at the same shape. The base layernorm kernel is healthy (~5.6 TB/s) — this is a benchmark-output parsing bug, not a kernel regression.Root cause
The norm branch of
_py_parse_and_emitkeeps the lastBandwidth:line:Since #549,
test_layernorm.py's__main__runs six benchmarks in sequence — base layernorm, thenfused_add/dynamicquant/smoothquant/fused_add_dynamicquant/fused_add_smoothquant— each printing its ownBandwidth:line. The last is the fully-scalarfused_add_smoothquantpath, so the parser reports that as "layernorm".softmax/rmsnormare unaffected because their__main__runs a single base test (first == last).Evidence
Exact CI command on MI350X / gfx950 (
32768x8192bf16) prints all six:Per-commit CI benchmark history pins the step exactly at #549 (5.5 → 1.69 TB/s), with softmax/rmsnorm flat across the same range. The kernel never changed:
git diffonkernels/layernorm_kernel.pyat #549 is@@ -314,3 +314,607 @@— purely appended quant/fused builders;build_layernorm_moduleis byte-identical.Secondary: the regression gate can't catch this
compare_benchmark.py(current vs main) compares against amainbaseline that is itself mislabeled — 1.69 vs 1.69 — so the 3× discrepancy slips through silently. Worth considering a per-op-tagged bandwidth line or an absolute roofline sanity floor.Fix
Report the first
Bandwidth:match (the base op is always benchmarked first). PR: #654.