After fixing the f16-overflow data-prep (ROCm/FlyDSL#642), moe_blockscale passes on the inter_dim=256 shapes (4/12) but is still incorrect on all 6 inter_dim=2048 shapes (Dim2=4096).
Following the #642 lesson, I checked kernel-vs-kernel agreement on M32/E256/Dim1=7168/Dim2=4096:
ref max_abs=2.703 fly max_abs=2.688 aiter max_abs=2.688
flydsl vs aiter: max_abs_err=1.023 cos=0.9972 <- the two independent kernels AGREE
flydsl vs ref: max_abs_err=2.884 cos=0.5634 <- ref is the outlier
aiter vs ref: max_abs_err=2.884 cos=0.5661
FlyDSL and aiter's CK kernel agree (cos 0.997); both disagree with our fp32 MoeBlockscaleOp.reference identically (cos ~0.56). When two independent vendor kernels condemn the reference the same way, the reference is wrong, not the kernels — so these 6 are false negatives.
The bug is specific to inter_dim=2048 (the inter_dim=256 reference matches the kernels). The block-quant helpers (_block_quant_dequant_a / _block_quant_dequant_w_expert) look structurally correct for any block count, so the exact cause is not yet root-caused — likely the inter-stage block-scale handling at 16 K-blocks vs 2.
Impact: moe_blockscale shows 4/12 correct on the dashboard, but ~10/12 are actually kernel-correct (4 verified + 6 kernel-agreement; 2 are OOM). Fix the reference to recover the 6.
Verified-correct fixes already landed (PR #5): the f16-overflow data-prep in both the FlyDSL and aiter providers, and the fp8 MoE tolerance.
After fixing the f16-overflow data-prep (ROCm/FlyDSL#642),
moe_blockscalepasses on the inter_dim=256 shapes (4/12) but is stillincorrecton all 6 inter_dim=2048 shapes (Dim2=4096).Following the #642 lesson, I checked kernel-vs-kernel agreement on
M32/E256/Dim1=7168/Dim2=4096:FlyDSL and aiter's CK kernel agree (cos 0.997); both disagree with our fp32
MoeBlockscaleOp.referenceidentically (cos ~0.56). When two independent vendor kernels condemn the reference the same way, the reference is wrong, not the kernels — so these 6 are false negatives.The bug is specific to inter_dim=2048 (the inter_dim=256 reference matches the kernels). The block-quant helpers (
_block_quant_dequant_a/_block_quant_dequant_w_expert) look structurally correct for any block count, so the exact cause is not yet root-caused — likely the inter-stage block-scale handling at 16 K-blocks vs 2.Impact: moe_blockscale shows 4/12 correct on the dashboard, but ~10/12 are actually kernel-correct (4 verified + 6 kernel-agreement; 2 are OOM). Fix the reference to recover the 6.
Verified-correct fixes already landed (PR #5): the f16-overflow data-prep in both the FlyDSL and aiter providers, and the fp8 MoE tolerance.