[TLERaw][Mega] Add Triton TLE NVSHMEM MegaMoE correctness case by Zhang-kg · Pull Request #712 · flagos-ai/FlagTree

Zhang-kg · 2026-06-23T05:32:51Z

Summary

This PR adds a Triton TLE + raw NVSHMEM MegaMoE integration case under:

python/test/tle/integration/megamoe/

The goal of this PR is to demonstrate that the current FlagTree/Triton TLE stack can run a small end-to-end MegaMoE-style operator with:

raw NVSHMEM dispatch / receiver
TLE warp specialization
FP8 L1/L2 TensorCore tl.dot compute
remote combine staging
local y reduce
repeated launch with workspace cleanup

This is a correctness/capability integration case, not yet a production-performance replacement for UserHopperMegaMoE.

Environment

Validated locally on:

GPU: 8x NVIDIA H100 80GB HBM3
CUDA: 12.8
Python: 3.10
NVSHMEM: 3.4.5 (nvidia-nvshmem-cu12==3.4.5)
MPI launcher: Open MPI 4.1.2 (mpirun)
Triton: FlagTree PR682-based Triton with TLE raw NVSHMEM support

The runnable instructions are documented in:

python/test/tle/integration/megamoe/RUNBOOK_CN.md

Implementation Status

Added files include:

python/test/tle/integration/megamoe/
  megamoe_operator/
    triton_tle_megamoe_operator.py
    triton_tle_megamoe_runtime.py
    ws_userhopper_dispatch_receiver_device.cu
    ws_userhopper_dispatch_receiver_extern_call.py
    ws_userhopper_dispatch_receiver_host.cu

tests/
  megamoe_local_harness.py
  run_isolated_operator.py

perf/
test_logs/
RUNBOOK_CN.md

Current main kernel path: _single_kernel_dispatch_receiver_l1_l2_tile_split_multi_cta_tldot_kernel

The implementation currently validates a merged dispatch/receiver/compute/combine path for small controlled shapes. The 8-rank H256 path uses a tile-split multi-CTA compute structure.

A file lock and atomic temporary output path were added around local host helper compilation to avoid concurrent mpirun ranks corrupting the generated .so.

Verified Cases

Case 1: 2-rank H128

Config:

world_size = 2
H = 128
I = 128
experts = 2
topk = 1
tokens/rank = 2
repeats = 3
cleanup = 1

Result:

PASS printed on both ranks
finalize still hangs; command is expected to be collected by timeout

Latest local timing from test_logs/local_2rank_h128_tokens2_repeats3.log:

rank 0 steady_avg_us = 4920.624, steady_max_us = 6031.936
rank 1 steady_avg_us = 3580.512, steady_max_us = 3926.080

Case 2: 8-rank H256 masked tile-split

Config:

world_size = 8
H = 256
I = 128
experts = 16
topk = 8
route_mode = masked
tokens/rank = 1
repeats = 2
cleanup = 1
compute_order = expert_wave_multi_cta_l1_l2_tile_split

Result:

PASS on all 8 ranks
process exits normally
checked = 6
counts = [3, 3]

Latest local timing from test_logs/local_8rank_h256_topk8_masked_tokens1_tile_split_repeats2.log:

steady_min_us across ranks ~= 16207.552
steady_max_us across ranks ~= 16471.872

Current Gap vs UserHopperMegaMoE

This PR does not claim production equivalence with UserHopperMegaMoE yet.

Known gaps:

Only small correctness/capability shapes are covered.
Production MoE shapes are not supported yet.
The current H256 path relies on tile-split multi-CTA structure.
It does not yet implement UserHopper’s full persistent scheduler.
It does not yet implement the mature UserHopper TMA pipeline / mbarrier structure.
It does not yet implement production-grade expert-wave scheduling and combine path.
The 2-rank H128 case still has a finalize hang after PASS.
Performance numbers are only for sanity/capability tracking, not for speedup claims.

Next Steps

Planned follow-up work:

Remove the 2-rank finalize hang.
Convert the current manual runner into a more standard gated integration test.
Expand supported shapes beyond the current small H128/H256 cases.
Move closer to UserHopper’s production structure:
- persistent scheduling
- expert-wave execution
- TMA-style pipeline
- finer synchronization and arrival tracking
- production combine/reduce path
Add comparable benchmark cases against existing MegaMoE implementations once shapes and semantics are aligned.

CLAassistant · 2026-06-23T05:32:59Z

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution.
1 out of 2 committers have signed the CLA.

✅ lizhangyu258
❌ Zhang-kg
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

lizhangyu258 added 7 commits June 10, 2026 00:37

support nvshmem

011ab66

add nvshmem/example

9588e13

add macro define parameters

fd24dca

merge tle_raw.call and libdevice.call

396925c

refactor cuda jit nvcc compile

c56b174

register make_cubin during the first initialization

2f7ef63

add test case

8cea58e

github-actions Bot added nvidia triton_v3.6.x labels Jun 23, 2026

Zhang-kg force-pushed the triton-tle-megamoe-integration branch from bea0c04 to 3e95f07 Compare June 23, 2026 06:24

Add Triton TLE NVSHMEM MegaMoE intrgration case

3e95f07

i3wanna2 changed the title ~~[KMCompiler] [TLERaw] Add Triton TLE NVSHMEM MegaMoE correctness case~~ [TLERaw][Mega] Add Triton TLE NVSHMEM MegaMoE correctness case Jun 23, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[TLERaw][Mega] Add Triton TLE NVSHMEM MegaMoE correctness case#712

[TLERaw][Mega] Add Triton TLE NVSHMEM MegaMoE correctness case#712
Zhang-kg wants to merge 8 commits into
flagos-ai:triton_v3.6.xfrom
Zhang-kg:triton-tle-megamoe-integration

Zhang-kg commented Jun 23, 2026

Uh oh!

CLAassistant commented Jun 23, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

Zhang-kg commented Jun 23, 2026

Summary

Environment

Implementation Status

Verified Cases

Case 1: 2-rank H128

Case 2: 8-rank H256 masked tile-split

Current Gap vs UserHopperMegaMoE

Next Steps

Uh oh!

CLAassistant commented Jun 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

CLAassistant commented Jun 23, 2026 •

edited

Loading