Iris-based MoE Implementation [initial] by neoblizz · Pull Request #397 · ROCm/iris

neoblizz · 2026-02-25T16:00:38Z

Motivation

Expanding the MoE (distributed) work done in original triton_kernels (credit to them: https://github.com/triton-lang/triton/blob/main/python/triton_kernels/tests/test_distributed.py) to use Iris' symmetric memory abstraction and working on exposing fine-grained overlap.

Initial Benchmarking

bpe	ntok	unfused	fused	ref(1gpu)	fused/unf
4	128	9.821ms	7.101ms	14.7ms	0.72x
5	160	7.843ms	10.726ms	18.1ms	1.37x
6	192	7.133ms	6.968ms	20.8ms	0.98x
7	224	6.071ms	6.426ms	23.3ms	1.06x
8	256	7.845ms	8.331ms	26.1ms	1.06x
10	320	6.684ms	7.774ms	32.2ms	1.16x
12	384	7.097ms	6.841ms	38.0ms	0.96x
14	448	9.275ms	7.753ms	43.7ms	0.84x
16	512	7.090ms	8.886ms	48.6ms	1.25x
20	640	7.497ms	7.619ms	59.6ms	1.02x
24	768	6.031ms	7.803ms	71.9ms	1.29x
28	896	7.774ms	6.793ms	84.2ms	0.87x
32	1024	6.594ms	8.324ms	96.1ms	1.26x
40	1280	7.654ms	7.827ms	117.8ms	1.02x
48	1536	7.103ms	5.913ms	144.2ms	0.83x
56	1792	7.865ms	8.619ms	165.4ms	1.10x
64	2048	7.856ms	8.600ms	188.3ms	1.09x
80	2560	6.210ms	8.843ms	233.2ms	1.42x
96	3072	8.163ms	8.455ms	284.8ms	1.04x
112	3584	7.770ms	8.254ms	325.7ms	1.06x
128	4096	8.473ms	8.332ms	370.6ms	0.98x
160	5120	7.444ms	7.133ms	461.8ms	0.96x
192	6144	8.985ms	9.675ms	559.9ms	1.08x
224	7168	9.990ms	9.327ms	648.3ms	0.93x
256	8192	9.098ms	9.346ms	739.2ms	1.03x
288	9216	9.459ms	9.502ms	829.8ms	1.00x
320	10240	10.252ms	10.887ms	937.4ms	1.06x
352	11264	9.868ms	11.134ms	1012.1ms	1.13x
384	12288	10.983ms	10.885ms	1104.4ms	0.99x
416	13312	11.734ms	11.559ms	1192.1ms	0.99x
448	14336	12.003ms	11.794ms	1282.6ms	0.98x
480	15360	12.640ms	13.014ms	1421.3ms	1.03x
512	16384	12.086ms	12.852ms	1473.0ms	1.06x
544	17408	13.663ms	13.385ms	1559.6ms	0.98x
576	18432	13.659ms	13.762ms	1654.1ms	1.01x
608	19456	13.424ms	14.260ms	1736.3ms	1.06x
640	20480	14.087ms	14.317ms	1835.6ms	1.02x
672	21504	14.335ms	15.182ms	1924.2ms	1.06x
704	22528	15.207ms	15.244ms	2092.8ms	1.00x
736	23552	15.754ms	16.270ms	2112.5ms	1.03x
768	24576	15.961ms	16.309ms	2200.6ms	1.02x
800	25600	16.514ms	17.049ms	2298.0ms	1.03x
832	26624	15.863ms	17.130ms	2385.9ms	1.08x
864	27648	17.351ms	17.319ms	2481.3ms	1.00x
896	28672	17.625ms	17.654ms	2568.7ms	1.00x
928	29696	17.612ms	17.443ms	2664.6ms	0.99x
960	30720	18.070ms	18.658ms	2755.0ms	1.03x
992	31744	20.183ms	18.783ms	2845.0ms	0.93x

Submission Checklist

Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

mawad-amd · 2026-02-26T18:26:26Z

Can we move the dispatch and combine code into iris.x? These are more of standard APIs now.

neoblizz · 2026-02-26T18:27:51Z

Can we move the dispatch and combine code into iris.x? These are more of standard APIs now.

Not yet, we have to see how these patterns are used properly. For now, everything will be under example. Maybe another PR that extracts useful stuff and wraps it under ops and x.

Copilot

Pull request overview

Introduces an initial Iris-based expert-sharded MoE forward implementation (ported/simplified from Triton distributed MoE tests) along with a correctness test and a benchmark driver.

Changes:

Adds a new distributed expert-sharded MoE example pipeline (routing/top-k, DP↔EP data movement via Iris, grouped matmul, reduce, and an optional fused matmul+combine kernel).
Adds an example runner and a benchmark sweep script for MoE configurations.
Adds a distributed pytest that compares the sharded pipeline against the single-device reference.

Reviewed changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated 11 comments.

Show a summary per file

File	Description
tests/examples/test_expert_sharded_moe.py	New distributed correctness test for the expert-sharded MoE pipeline.
examples/31_expert_sharded_moe/topk.py	Top-k routing + metadata construction for dispatch/combine indexing.
examples/31_expert_sharded_moe/reduce.py	Triton reduce kernel used for the final combine step (masked sum).
examples/31_expert_sharded_moe/ragged_metadata.py	Ragged metadata helpers for grouped expert computation.
examples/31_expert_sharded_moe/moe.py	Reference and Iris-based distributed MoE pipeline + Iris all-gather helper.
examples/31_expert_sharded_moe/grouped_matmul.py	Triton grouped/ragged GEMM for per-expert matmul.
examples/31_expert_sharded_moe/fused_exp_matmul_ep_to_dp.py	Fused expert matmul + EP→DP combine via Iris stores.
examples/31_expert_sharded_moe/expert_assignment.py	Expert-to-rank assignment/bitmask metadata.
examples/31_expert_sharded_moe/example_run.py	Standalone runner to validate sharded vs reference.
examples/31_expert_sharded_moe/dispatch.py	DP→EP dispatch implementation using Iris symmetric heap stores.
examples/31_expert_sharded_moe/combine.py	EP→DP combine implementation using Iris symmetric heap stores.
benchmark/examples/benchmark_moe.py	Benchmark/validation harness for the distributed MoE example.

benchmark/examples/benchmark_moe.py

examples/31_expert_sharded_moe/topk.py

examples/31_expert_sharded_moe/fused_exp_matmul_ep_to_dp.py

examples/31_expert_sharded_moe/ragged_metadata.py

tests/examples/test_expert_sharded_moe.py

examples/31_expert_sharded_moe/combine.py

examples/31_expert_sharded_moe/grouped_matmul.py

examples/31_expert_sharded_moe/expert_assignment.py

examples/31_expert_sharded_moe/moe.py

mawad-amd

Do we need do add a license here or something? Please merge whenever you are ready -- CI is down at the moment and it will take a bit to bring back up.

neoblizz · 2026-02-26T21:03:12Z

Do we need do add a license here or something? Please merge whenever you are ready -- CI is down at the moment and it will take a bit to bring back up.

I need to fix the fused implementation. Will merge after.

neoblizz · 2026-02-26T21:03:28Z

I will run all tests locally before i merged, nw! @mawad-amd

neoblizz · 2026-02-27T03:44:10Z

bpe	n_tokens	Single-GPU (ms)	Unfused 8-GPU (ms)	Fused 8-GPU (ms)	Unfused Speedup	Fused Speedup
4	128	14.94	6.605	5.719	2.26x	2.61x
5	160	18.97	7.046	6.298	2.69x	3.01x
6	192	21.63	8.660	8.286	2.50x	2.61x
7	224	23.65	8.757	6.510	2.70x	3.63x
8	256	25.34	7.009	6.196	3.62x	4.09x
10	320	31.73	9.068	7.003	3.50x	4.53x
12	384	36.26	6.459	7.872	5.61x	4.61x
14	448	41.37	6.154	6.655	6.72x	6.22x
16	512	46.45	6.899	6.786	6.73x	6.84x
20	640	56.66	6.017	6.837	9.42x	8.29x
24	768	66.79	6.067	8.716	11.01x	7.66x
28	896	76.97	7.125	6.708	10.80x	11.47x
32	1024	87.78	8.090	8.285	10.85x	10.60x
40	1280	110.12	7.500	7.592	14.68x	14.50x
48	1536	128.66	8.401	7.073	15.32x	18.19x
56	1792	150.08	7.915	7.799	18.96x	19.24x
64	2048	170.72	7.660	8.361	22.29x	20.42x
80	2560	214.28	6.596	7.137	32.49x	30.02x
96	3072	253.85	6.038	8.411	42.04x	30.18x
112	3584	294.47	7.991	7.263	36.85x	40.54x
128	4096	337.00	7.316	7.550	46.07x	44.64x
160	5120	429.95	6.725	7.595	63.93x	56.61x
192	6144	502.26	7.725	7.746	65.03x	64.84x
224	7168	585.83	8.256	7.472	70.95x	78.40x
256	8192	671.03	7.725	8.088	86.86x	82.97x
288	9216	768.47	7.669	8.426	100.20x	91.21x
320	10240	835.26	8.769	8.934	95.26x	93.50x
352	11264	919.45	8.311	8.998	110.62x	102.18x
384	12288	997.85	8.747	8.176	114.08x	122.05x
416	13312	1081.64	8.944	8.949	120.93x	120.86x
448	14336	1188.36	8.657	8.796	137.28x	135.10x
480	15360	1255.78	8.768	14.419	143.23x	87.09x
512	16384	1336.75	8.887	10.193	150.42x	131.15x
544	17408	1422.48	8.587	9.371	165.67x	151.80x
576	18432	1500.20	9.444	9.776	158.85x	153.47x
608	19456	1575.70	10.337	10.550	152.44x	149.36x
640	20480	1656.15	10.016	10.714	165.36x	154.58x
672	21504	1795.87	10.925	10.617	164.38x	169.16x
704	22528	1824.75	10.301	10.620	177.14x	171.82x
736	23552	1918.18	10.499	13.054	182.70x	146.94x
768	24576	1991.40	10.008	11.141	199.00x	178.75x
800	25600	2080.69	9.566	12.788	217.52x	162.71x
832	26624	2164.62	11.085	11.627	195.28x	186.18x
864	27648	2251.40	10.900	12.332	206.55x	182.57x
896	28672	2341.37	12.565	12.317	186.31x	190.12x
928	29696	2420.40	11.618	12.062	208.34x	200.66x
960	30720	2500.33	11.865	13.101	210.77x	190.85x
992	31744	2572.05	12.414	13.954	207.19x	184.29x

Iris-based MoE Implementation [initial].

ce0845f

github-actions bot added in-progress We are working on it iris Iris project issue labels Feb 25, 2026

neoblizz and others added 5 commits February 25, 2026 16:03

Lint fixes.

6293f37

Apply Ruff auto-fixes

2843c50

Add MoE benchmark with GPT-OSS shapes.

7a6910e

Fused exp matmul + ep-to-dp.

a0c3667

Apply Ruff auto-fixes

551c7cd

neoblizz marked this pull request as ready for review February 26, 2026 18:22

neoblizz requested a review from mawad-amd as a code owner February 26, 2026 18:22

Copilot AI review requested due to automatic review settings February 26, 2026 18:22

neoblizz requested a review from BKP as a code owner February 26, 2026 18:22

Copilot started reviewing on behalf of neoblizz February 26, 2026 18:24 View session

Copilot AI reviewed Feb 26, 2026

View reviewed changes

mawad-amd approved these changes Feb 26, 2026

View reviewed changes

neoblizz and others added 2 commits February 26, 2026 21:34

Fixed the fused-version to use tiled- instead of row-based approach.

c71ae1c

Apply Ruff auto-fixes

79e489f

This was linked to issues Feb 26, 2026

[Feature]: Implement Multi-GPU MoE as an Example #66

Closed

[Feature]: Fine-grained Exp-Sharded MoE Support #396

Closed

neoblizz and others added 2 commits February 27, 2026 01:44

MoE performance updates.

dc00c3b

Apply Ruff auto-fixes

a336185

neoblizz merged commit 360bf96 into main Feb 27, 2026

neoblizz deleted the neoblizz/moe branch February 27, 2026 03:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Iris-based MoE Implementation [initial]#397

Iris-based MoE Implementation [initial]#397
neoblizz merged 10 commits intomainfrom
neoblizz/moe

neoblizz commented Feb 25, 2026 •

edited

Loading

Uh oh!

mawad-amd commented Feb 26, 2026

Uh oh!

neoblizz commented Feb 26, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mawad-amd left a comment

Uh oh!

neoblizz commented Feb 26, 2026

Uh oh!

neoblizz commented Feb 26, 2026

Uh oh!

neoblizz commented Feb 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

neoblizz commented Feb 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Initial Benchmarking

Submission Checklist

Uh oh!

mawad-amd commented Feb 26, 2026

Uh oh!

neoblizz commented Feb 26, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mawad-amd left a comment

Choose a reason for hiding this comment

Uh oh!

neoblizz commented Feb 26, 2026

Uh oh!

neoblizz commented Feb 26, 2026

Uh oh!

neoblizz commented Feb 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

neoblizz commented Feb 25, 2026 •

edited

Loading