Skip to content

[TLERaw][Mega] Add Triton TLE NVSHMEM MegaMoE correctness case#712

Draft
Zhang-kg wants to merge 8 commits into
flagos-ai:triton_v3.6.xfrom
Zhang-kg:triton-tle-megamoe-integration
Draft

[TLERaw][Mega] Add Triton TLE NVSHMEM MegaMoE correctness case#712
Zhang-kg wants to merge 8 commits into
flagos-ai:triton_v3.6.xfrom
Zhang-kg:triton-tle-megamoe-integration

Conversation

@Zhang-kg

Copy link
Copy Markdown

Summary

This PR adds a Triton TLE + raw NVSHMEM MegaMoE integration case under:

python/test/tle/integration/megamoe/

The goal of this PR is to demonstrate that the current FlagTree/Triton TLE stack can run a small end-to-end MegaMoE-style operator with:

  • raw NVSHMEM dispatch / receiver
  • TLE warp specialization
  • FP8 L1/L2 TensorCore tl.dot compute
  • remote combine staging
  • local y reduce
  • repeated launch with workspace cleanup

This is a correctness/capability integration case, not yet a production-performance replacement for UserHopperMegaMoE.

Environment

Validated locally on:

GPU: 8x NVIDIA H100 80GB HBM3
CUDA: 12.8
Python: 3.10
NVSHMEM: 3.4.5 (nvidia-nvshmem-cu12==3.4.5)
MPI launcher: Open MPI 4.1.2 (mpirun)
Triton: FlagTree PR682-based Triton with TLE raw NVSHMEM support

The runnable instructions are documented in:

python/test/tle/integration/megamoe/RUNBOOK_CN.md

Implementation Status

Added files include:

python/test/tle/integration/megamoe/
  megamoe_operator/
    triton_tle_megamoe_operator.py
    triton_tle_megamoe_runtime.py
    ws_userhopper_dispatch_receiver_device.cu
    ws_userhopper_dispatch_receiver_extern_call.py
    ws_userhopper_dispatch_receiver_host.cu

tests/
  megamoe_local_harness.py
  run_isolated_operator.py

perf/
test_logs/
RUNBOOK_CN.md

Current main kernel path: _single_kernel_dispatch_receiver_l1_l2_tile_split_multi_cta_tldot_kernel

The implementation currently validates a merged dispatch/receiver/compute/combine path for small controlled shapes. The 8-rank H256 path uses a tile-split multi-CTA compute structure.

A file lock and atomic temporary output path were added around local host helper compilation to avoid concurrent mpirun ranks corrupting the generated .so.

Verified Cases

Case 1: 2-rank H128

Config:

world_size = 2
H = 128
I = 128
experts = 2
topk = 1
tokens/rank = 2
repeats = 3
cleanup = 1

Result:

PASS printed on both ranks
finalize still hangs; command is expected to be collected by timeout

Latest local timing from test_logs/local_2rank_h128_tokens2_repeats3.log:

rank 0 steady_avg_us = 4920.624, steady_max_us = 6031.936
rank 1 steady_avg_us = 3580.512, steady_max_us = 3926.080

Case 2: 8-rank H256 masked tile-split

Config:

world_size = 8
H = 256
I = 128
experts = 16
topk = 8
route_mode = masked
tokens/rank = 1
repeats = 2
cleanup = 1
compute_order = expert_wave_multi_cta_l1_l2_tile_split

Result:

PASS on all 8 ranks
process exits normally
checked = 6
counts = [3, 3]

Latest local timing from test_logs/local_8rank_h256_topk8_masked_tokens1_tile_split_repeats2.log:

steady_min_us across ranks ~= 16207.552
steady_max_us across ranks ~= 16471.872

Current Gap vs UserHopperMegaMoE

This PR does not claim production equivalence with UserHopperMegaMoE yet.

Known gaps:

  • Only small correctness/capability shapes are covered.
  • Production MoE shapes are not supported yet.
  • The current H256 path relies on tile-split multi-CTA structure.
  • It does not yet implement UserHopper’s full persistent scheduler.
  • It does not yet implement the mature UserHopper TMA pipeline / mbarrier structure.
  • It does not yet implement production-grade expert-wave scheduling and combine path.
  • The 2-rank H128 case still has a finalize hang after PASS.
  • Performance numbers are only for sanity/capability tracking, not for speedup claims.

Next Steps

Planned follow-up work:

  1. Remove the 2-rank finalize hang.

  2. Convert the current manual runner into a more standard gated integration test.

  3. Expand supported shapes beyond the current small H128/H256 cases.

  4. Move closer to UserHopper’s production structure:

    • persistent scheduling
    • expert-wave execution
    • TMA-style pipeline
    • finer synchronization and arrival tracking
    • production combine/reduce path
  5. Add comparable benchmark cases against existing MegaMoE implementations once shapes and semantics are aligned.

@CLAassistant

CLAassistant commented Jun 23, 2026

Copy link
Copy Markdown

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution.
1 out of 2 committers have signed the CLA.

✅ lizhangyu258
❌ Zhang-kg
You have signed the CLA already but the status is still pending? Let us recheck it.

@i3wanna2 i3wanna2 changed the title [KMCompiler] [TLERaw] Add Triton TLE NVSHMEM MegaMoE correctness case [TLERaw][Mega] Add Triton TLE NVSHMEM MegaMoE correctness case Jun 23, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants