Conversation
|
Can we move the dispatch and combine code into |
Not yet, we have to see how these patterns are used properly. For now, everything will be under example. Maybe another PR that extracts useful stuff and wraps it under ops and x. |
There was a problem hiding this comment.
Pull request overview
Introduces an initial Iris-based expert-sharded MoE forward implementation (ported/simplified from Triton distributed MoE tests) along with a correctness test and a benchmark driver.
Changes:
- Adds a new distributed expert-sharded MoE example pipeline (routing/top-k, DP↔EP data movement via Iris, grouped matmul, reduce, and an optional fused matmul+combine kernel).
- Adds an example runner and a benchmark sweep script for MoE configurations.
- Adds a distributed pytest that compares the sharded pipeline against the single-device reference.
Reviewed changes
Copilot reviewed 12 out of 12 changed files in this pull request and generated 11 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/examples/test_expert_sharded_moe.py | New distributed correctness test for the expert-sharded MoE pipeline. |
| examples/31_expert_sharded_moe/topk.py | Top-k routing + metadata construction for dispatch/combine indexing. |
| examples/31_expert_sharded_moe/reduce.py | Triton reduce kernel used for the final combine step (masked sum). |
| examples/31_expert_sharded_moe/ragged_metadata.py | Ragged metadata helpers for grouped expert computation. |
| examples/31_expert_sharded_moe/moe.py | Reference and Iris-based distributed MoE pipeline + Iris all-gather helper. |
| examples/31_expert_sharded_moe/grouped_matmul.py | Triton grouped/ragged GEMM for per-expert matmul. |
| examples/31_expert_sharded_moe/fused_exp_matmul_ep_to_dp.py | Fused expert matmul + EP→DP combine via Iris stores. |
| examples/31_expert_sharded_moe/expert_assignment.py | Expert-to-rank assignment/bitmask metadata. |
| examples/31_expert_sharded_moe/example_run.py | Standalone runner to validate sharded vs reference. |
| examples/31_expert_sharded_moe/dispatch.py | DP→EP dispatch implementation using Iris symmetric heap stores. |
| examples/31_expert_sharded_moe/combine.py | EP→DP combine implementation using Iris symmetric heap stores. |
| benchmark/examples/benchmark_moe.py | Benchmark/validation harness for the distributed MoE example. |
mawad-amd
left a comment
There was a problem hiding this comment.
Do we need do add a license here or something? Please merge whenever you are ready -- CI is down at the moment and it will take a bit to bring back up.
I need to fix the fused implementation. Will merge after. |
|
I will run all tests locally before i merged, nw! @mawad-amd |
|
Motivation
Expanding the MoE (distributed) work done in original triton_kernels (credit to them: https://github.com/triton-lang/triton/blob/main/python/triton_kernels/tests/test_distributed.py) to use Iris' symmetric memory abstraction and working on exposing fine-grained overlap.
Initial Benchmarking
Submission Checklist