run_benchmark mislabels layernorm at ~1.69 TB/s (parser keeps last Bandwidth = scalar smoothquant variant); base layernorm is ~5.6 TB/s

## Summary

`scripts/run_benchmark.sh` reports **layernorm ≈ 1.69 TB/s** at `32768x8192` bf16 on the gfx950 runner, while `softmax`/`rmsnorm` are ~5.8 TB/s at the same shape. **The base layernorm kernel is healthy (~5.6 TB/s)** — this is a benchmark-output **parsing bug**, not a kernel regression.

## Root cause

The norm branch of `_py_parse_and_emit` keeps the **last** `Bandwidth:` line:

```python
for m_bw in re.finditer(r"Bandwidth:\s*([0-9.]+)\s*GB/s", txt):
    pass
```

Since #549, `test_layernorm.py`'s `__main__` runs **six** benchmarks in sequence — base layernorm, then `fused_add` / `dynamicquant` / `smoothquant` / `fused_add_dynamicquant` / `fused_add_smoothquant` — each printing its own `Bandwidth:` line. The last is the fully-scalar `fused_add_smoothquant` path, so the parser reports *that* as "layernorm". `softmax`/`rmsnorm` are unaffected because their `__main__` runs a single base test (first == last).

## Evidence

Exact CI command on MI350X / gfx950 (`32768x8192` bf16) prints all six:

```
LayerNorm (base, 128b vectorized) ........ 5617 GB/s   <- the real number
FusedAdd LayerNorm ....................... 2982
LayerNorm DynamicQuant ................... 1852
LayerNorm SmoothQuant .................... 1608
FusedAdd DynamicQuant .................... 1768
FusedAdd SmoothQuant ..................... 1660 GB/s   <- parser keeps this one
```

Per-commit CI benchmark history pins the step exactly at #549 (5.5 → 1.69 TB/s), with softmax/rmsnorm flat across the same range. The kernel never changed: `git diff` on `kernels/layernorm_kernel.py` at #549 is `@@ -314,3 +314,607 @@` — purely appended quant/fused builders; `build_layernorm_module` is byte-identical.

## Secondary: the regression gate can't catch this

`compare_benchmark.py` (current vs main) compares against a `main` baseline that is *itself* mislabeled — 1.69 vs 1.69 — so the 3× discrepancy slips through silently. Worth considering a per-op-tagged bandwidth line or an absolute roofline sanity floor.

## Fix

Report the **first** `Bandwidth:` match (the base op is always benchmarked first). PR: ROCm/FlyDSL#654.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

run_benchmark mislabels layernorm at ~1.69 TB/s (parser keeps last Bandwidth = scalar smoothquant variant); base layernorm is ~5.6 TB/s #655

Summary

Root cause

Evidence

Secondary: the regression gate can't catch this

Fix

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

run_benchmark mislabels layernorm at ~1.69 TB/s (parser keeps last Bandwidth = scalar smoothquant variant); base layernorm is ~5.6 TB/s #655

Description

Summary

Root cause

Evidence

Secondary: the regression gate can't catch this

Fix

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions