Skip to content

moe_blockscale reference is wrong at inter_dim=2048 (kernels agree cos 0.997; the fp32 reference is the outlier) #6

Description

@jhinpan

After fixing the f16-overflow data-prep (ROCm/FlyDSL#642), moe_blockscale passes on the inter_dim=256 shapes (4/12) but is still incorrect on all 6 inter_dim=2048 shapes (Dim2=4096).

Following the #642 lesson, I checked kernel-vs-kernel agreement on M32/E256/Dim1=7168/Dim2=4096:

ref max_abs=2.703   fly max_abs=2.688   aiter max_abs=2.688
flydsl vs aiter:  max_abs_err=1.023  cos=0.9972   <- the two independent kernels AGREE
flydsl vs ref:    max_abs_err=2.884  cos=0.5634   <- ref is the outlier
aiter  vs ref:    max_abs_err=2.884  cos=0.5661

FlyDSL and aiter's CK kernel agree (cos 0.997); both disagree with our fp32 MoeBlockscaleOp.reference identically (cos ~0.56). When two independent vendor kernels condemn the reference the same way, the reference is wrong, not the kernels — so these 6 are false negatives.

The bug is specific to inter_dim=2048 (the inter_dim=256 reference matches the kernels). The block-quant helpers (_block_quant_dequant_a / _block_quant_dequant_w_expert) look structurally correct for any block count, so the exact cause is not yet root-caused — likely the inter-stage block-scale handling at 16 K-blocks vs 2.

Impact: moe_blockscale shows 4/12 correct on the dashboard, but ~10/12 are actually kernel-correct (4 verified + 6 kernel-agreement; 2 are OOM). Fix the reference to recover the 6.

Verified-correct fixes already landed (PR #5): the f16-overflow data-prep in both the FlyDSL and aiter providers, and the fp8 MoE tolerance.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions