Skip to content

Iris-based MoE Implementation [initial]#397

Merged
neoblizz merged 10 commits intomainfrom
neoblizz/moe
Feb 27, 2026
Merged

Iris-based MoE Implementation [initial]#397
neoblizz merged 10 commits intomainfrom
neoblizz/moe

Conversation

@neoblizz
Copy link
Member

@neoblizz neoblizz commented Feb 25, 2026

Motivation

Expanding the MoE (distributed) work done in original triton_kernels (credit to them: https://github.com/triton-lang/triton/blob/main/python/triton_kernels/tests/test_distributed.py) to use Iris' symmetric memory abstraction and working on exposing fine-grained overlap.

Initial Benchmarking

bpe ntok unfused fused ref(1gpu) fused/unf
4 128 9.821ms 7.101ms 14.7ms 0.72x
5 160 7.843ms 10.726ms 18.1ms 1.37x
6 192 7.133ms 6.968ms 20.8ms 0.98x
7 224 6.071ms 6.426ms 23.3ms 1.06x
8 256 7.845ms 8.331ms 26.1ms 1.06x
10 320 6.684ms 7.774ms 32.2ms 1.16x
12 384 7.097ms 6.841ms 38.0ms 0.96x
14 448 9.275ms 7.753ms 43.7ms 0.84x
16 512 7.090ms 8.886ms 48.6ms 1.25x
20 640 7.497ms 7.619ms 59.6ms 1.02x
24 768 6.031ms 7.803ms 71.9ms 1.29x
28 896 7.774ms 6.793ms 84.2ms 0.87x
32 1024 6.594ms 8.324ms 96.1ms 1.26x
40 1280 7.654ms 7.827ms 117.8ms 1.02x
48 1536 7.103ms 5.913ms 144.2ms 0.83x
56 1792 7.865ms 8.619ms 165.4ms 1.10x
64 2048 7.856ms 8.600ms 188.3ms 1.09x
80 2560 6.210ms 8.843ms 233.2ms 1.42x
96 3072 8.163ms 8.455ms 284.8ms 1.04x
112 3584 7.770ms 8.254ms 325.7ms 1.06x
128 4096 8.473ms 8.332ms 370.6ms 0.98x
160 5120 7.444ms 7.133ms 461.8ms 0.96x
192 6144 8.985ms 9.675ms 559.9ms 1.08x
224 7168 9.990ms 9.327ms 648.3ms 0.93x
256 8192 9.098ms 9.346ms 739.2ms 1.03x
288 9216 9.459ms 9.502ms 829.8ms 1.00x
320 10240 10.252ms 10.887ms 937.4ms 1.06x
352 11264 9.868ms 11.134ms 1012.1ms 1.13x
384 12288 10.983ms 10.885ms 1104.4ms 0.99x
416 13312 11.734ms 11.559ms 1192.1ms 0.99x
448 14336 12.003ms 11.794ms 1282.6ms 0.98x
480 15360 12.640ms 13.014ms 1421.3ms 1.03x
512 16384 12.086ms 12.852ms 1473.0ms 1.06x
544 17408 13.663ms 13.385ms 1559.6ms 0.98x
576 18432 13.659ms 13.762ms 1654.1ms 1.01x
608 19456 13.424ms 14.260ms 1736.3ms 1.06x
640 20480 14.087ms 14.317ms 1835.6ms 1.02x
672 21504 14.335ms 15.182ms 1924.2ms 1.06x
704 22528 15.207ms 15.244ms 2092.8ms 1.00x
736 23552 15.754ms 16.270ms 2112.5ms 1.03x
768 24576 15.961ms 16.309ms 2200.6ms 1.02x
800 25600 16.514ms 17.049ms 2298.0ms 1.03x
832 26624 15.863ms 17.130ms 2385.9ms 1.08x
864 27648 17.351ms 17.319ms 2481.3ms 1.00x
896 28672 17.625ms 17.654ms 2568.7ms 1.00x
928 29696 17.612ms 17.443ms 2664.6ms 0.99x
960 30720 18.070ms 18.658ms 2755.0ms 1.03x
992 31744 20.183ms 18.783ms 2845.0ms 0.93x

Submission Checklist

@github-actions github-actions bot added in-progress We are working on it iris Iris project issue labels Feb 25, 2026
@neoblizz neoblizz marked this pull request as ready for review February 26, 2026 18:22
@neoblizz neoblizz requested a review from mawad-amd as a code owner February 26, 2026 18:22
Copilot AI review requested due to automatic review settings February 26, 2026 18:22
@neoblizz neoblizz requested a review from BKP as a code owner February 26, 2026 18:22
@mawad-amd
Copy link
Collaborator

Can we move the dispatch and combine code into iris.x? These are more of standard APIs now.

@neoblizz
Copy link
Member Author

Can we move the dispatch and combine code into iris.x? These are more of standard APIs now.

Not yet, we have to see how these patterns are used properly. For now, everything will be under example. Maybe another PR that extracts useful stuff and wraps it under ops and x.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Introduces an initial Iris-based expert-sharded MoE forward implementation (ported/simplified from Triton distributed MoE tests) along with a correctness test and a benchmark driver.

Changes:

  • Adds a new distributed expert-sharded MoE example pipeline (routing/top-k, DP↔EP data movement via Iris, grouped matmul, reduce, and an optional fused matmul+combine kernel).
  • Adds an example runner and a benchmark sweep script for MoE configurations.
  • Adds a distributed pytest that compares the sharded pipeline against the single-device reference.

Reviewed changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated 11 comments.

Show a summary per file
File Description
tests/examples/test_expert_sharded_moe.py New distributed correctness test for the expert-sharded MoE pipeline.
examples/31_expert_sharded_moe/topk.py Top-k routing + metadata construction for dispatch/combine indexing.
examples/31_expert_sharded_moe/reduce.py Triton reduce kernel used for the final combine step (masked sum).
examples/31_expert_sharded_moe/ragged_metadata.py Ragged metadata helpers for grouped expert computation.
examples/31_expert_sharded_moe/moe.py Reference and Iris-based distributed MoE pipeline + Iris all-gather helper.
examples/31_expert_sharded_moe/grouped_matmul.py Triton grouped/ragged GEMM for per-expert matmul.
examples/31_expert_sharded_moe/fused_exp_matmul_ep_to_dp.py Fused expert matmul + EP→DP combine via Iris stores.
examples/31_expert_sharded_moe/expert_assignment.py Expert-to-rank assignment/bitmask metadata.
examples/31_expert_sharded_moe/example_run.py Standalone runner to validate sharded vs reference.
examples/31_expert_sharded_moe/dispatch.py DP→EP dispatch implementation using Iris symmetric heap stores.
examples/31_expert_sharded_moe/combine.py EP→DP combine implementation using Iris symmetric heap stores.
benchmark/examples/benchmark_moe.py Benchmark/validation harness for the distributed MoE example.

Copy link
Collaborator

@mawad-amd mawad-amd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need do add a license here or something? Please merge whenever you are ready -- CI is down at the moment and it will take a bit to bring back up.

@neoblizz
Copy link
Member Author

Do we need do add a license here or something? Please merge whenever you are ready -- CI is down at the moment and it will take a bit to bring back up.

I need to fix the fused implementation. Will merge after.

@neoblizz
Copy link
Member Author

I will run all tests locally before i merged, nw! @mawad-amd

@neoblizz
Copy link
Member Author

bpe n_tokens Single-GPU (ms) Unfused 8-GPU (ms) Fused 8-GPU (ms) Unfused Speedup Fused Speedup
4 128 14.94 6.605 5.719 2.26x 2.61x
5 160 18.97 7.046 6.298 2.69x 3.01x
6 192 21.63 8.660 8.286 2.50x 2.61x
7 224 23.65 8.757 6.510 2.70x 3.63x
8 256 25.34 7.009 6.196 3.62x 4.09x
10 320 31.73 9.068 7.003 3.50x 4.53x
12 384 36.26 6.459 7.872 5.61x 4.61x
14 448 41.37 6.154 6.655 6.72x 6.22x
16 512 46.45 6.899 6.786 6.73x 6.84x
20 640 56.66 6.017 6.837 9.42x 8.29x
24 768 66.79 6.067 8.716 11.01x 7.66x
28 896 76.97 7.125 6.708 10.80x 11.47x
32 1024 87.78 8.090 8.285 10.85x 10.60x
40 1280 110.12 7.500 7.592 14.68x 14.50x
48 1536 128.66 8.401 7.073 15.32x 18.19x
56 1792 150.08 7.915 7.799 18.96x 19.24x
64 2048 170.72 7.660 8.361 22.29x 20.42x
80 2560 214.28 6.596 7.137 32.49x 30.02x
96 3072 253.85 6.038 8.411 42.04x 30.18x
112 3584 294.47 7.991 7.263 36.85x 40.54x
128 4096 337.00 7.316 7.550 46.07x 44.64x
160 5120 429.95 6.725 7.595 63.93x 56.61x
192 6144 502.26 7.725 7.746 65.03x 64.84x
224 7168 585.83 8.256 7.472 70.95x 78.40x
256 8192 671.03 7.725 8.088 86.86x 82.97x
288 9216 768.47 7.669 8.426 100.20x 91.21x
320 10240 835.26 8.769 8.934 95.26x 93.50x
352 11264 919.45 8.311 8.998 110.62x 102.18x
384 12288 997.85 8.747 8.176 114.08x 122.05x
416 13312 1081.64 8.944 8.949 120.93x 120.86x
448 14336 1188.36 8.657 8.796 137.28x 135.10x
480 15360 1255.78 8.768 14.419 143.23x 87.09x
512 16384 1336.75 8.887 10.193 150.42x 131.15x
544 17408 1422.48 8.587 9.371 165.67x 151.80x
576 18432 1500.20 9.444 9.776 158.85x 153.47x
608 19456 1575.70 10.337 10.550 152.44x 149.36x
640 20480 1656.15 10.016 10.714 165.36x 154.58x
672 21504 1795.87 10.925 10.617 164.38x 169.16x
704 22528 1824.75 10.301 10.620 177.14x 171.82x
736 23552 1918.18 10.499 13.054 182.70x 146.94x
768 24576 1991.40 10.008 11.141 199.00x 178.75x
800 25600 2080.69 9.566 12.788 217.52x 162.71x
832 26624 2164.62 11.085 11.627 195.28x 186.18x
864 27648 2251.40 10.900 12.332 206.55x 182.57x
896 28672 2341.37 12.565 12.317 186.31x 190.12x
928 29696 2420.40 11.618 12.062 208.34x 200.66x
960 30720 2500.33 11.865 13.101 210.77x 190.85x
992 31744 2572.05 12.414 13.954 207.19x 184.29x

@neoblizz neoblizz merged commit 360bf96 into main Feb 27, 2026
@neoblizz neoblizz deleted the neoblizz/moe branch February 27, 2026 03:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

in-progress We are working on it iris Iris project issue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature]: Fine-grained Exp-Sharded MoE Support [Feature]: Implement Multi-GPU MoE as an Example

3 participants