Skip to content

run_benchmark mislabels layernorm at ~1.69 TB/s (parser keeps last Bandwidth = scalar smoothquant variant); base layernorm is ~5.6 TB/s #655

@jhinpan

Description

@jhinpan

Summary

scripts/run_benchmark.sh reports layernorm ≈ 1.69 TB/s at 32768x8192 bf16 on the gfx950 runner, while softmax/rmsnorm are ~5.8 TB/s at the same shape. The base layernorm kernel is healthy (~5.6 TB/s) — this is a benchmark-output parsing bug, not a kernel regression.

Root cause

The norm branch of _py_parse_and_emit keeps the last Bandwidth: line:

for m_bw in re.finditer(r"Bandwidth:\s*([0-9.]+)\s*GB/s", txt):
    pass

Since #549, test_layernorm.py's __main__ runs six benchmarks in sequence — base layernorm, then fused_add / dynamicquant / smoothquant / fused_add_dynamicquant / fused_add_smoothquant — each printing its own Bandwidth: line. The last is the fully-scalar fused_add_smoothquant path, so the parser reports that as "layernorm". softmax/rmsnorm are unaffected because their __main__ runs a single base test (first == last).

Evidence

Exact CI command on MI350X / gfx950 (32768x8192 bf16) prints all six:

LayerNorm (base, 128b vectorized) ........ 5617 GB/s   <- the real number
FusedAdd LayerNorm ....................... 2982
LayerNorm DynamicQuant ................... 1852
LayerNorm SmoothQuant .................... 1608
FusedAdd DynamicQuant .................... 1768
FusedAdd SmoothQuant ..................... 1660 GB/s   <- parser keeps this one

Per-commit CI benchmark history pins the step exactly at #549 (5.5 → 1.69 TB/s), with softmax/rmsnorm flat across the same range. The kernel never changed: git diff on kernels/layernorm_kernel.py at #549 is @@ -314,3 +314,607 @@ — purely appended quant/fused builders; build_layernorm_module is byte-identical.

Secondary: the regression gate can't catch this

compare_benchmark.py (current vs main) compares against a main baseline that is itself mislabeled — 1.69 vs 1.69 — so the 3× discrepancy slips through silently. Worth considering a per-op-tagged bandwidth line or an absolute roofline sanity floor.

Fix

Report the first Bandwidth: match (the base op is always benchmarked first). PR: #654.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions