[TLERaw][Mega] Add Triton TLE NVSHMEM MegaMoE correctness case#712
Draft
Zhang-kg wants to merge 8 commits into
Draft
[TLERaw][Mega] Add Triton TLE NVSHMEM MegaMoE correctness case#712Zhang-kg wants to merge 8 commits into
Zhang-kg wants to merge 8 commits into
Conversation
|
|
bea0c04 to
3e95f07
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR adds a Triton TLE + raw NVSHMEM MegaMoE integration case under:
The goal of this PR is to demonstrate that the current FlagTree/Triton TLE stack can run a small end-to-end MegaMoE-style operator with:
This is a correctness/capability integration case, not yet a production-performance replacement for UserHopperMegaMoE.
Environment
Validated locally on:
GPU: 8x NVIDIA H100 80GB HBM3
CUDA: 12.8
Python: 3.10
NVSHMEM: 3.4.5 (
nvidia-nvshmem-cu12==3.4.5)MPI launcher: Open MPI 4.1.2 (
mpirun)Triton: FlagTree PR682-based Triton with TLE raw NVSHMEM support
The runnable instructions are documented in:
python/test/tle/integration/megamoe/RUNBOOK_CN.md
Implementation Status
Added files include:
Current main kernel path:
_single_kernel_dispatch_receiver_l1_l2_tile_split_multi_cta_tldot_kernelThe implementation currently validates a merged dispatch/receiver/compute/combine path for small controlled shapes. The 8-rank H256 path uses a tile-split multi-CTA compute structure.
A file lock and atomic temporary output path were added around local host helper compilation to avoid concurrent mpirun ranks corrupting the generated .so.
Verified Cases
Case 1: 2-rank H128
Config:
Result:
PASS printed on both ranks
finalize still hangs; command is expected to be collected by timeout
Latest local timing from test_logs/local_2rank_h128_tokens2_repeats3.log:
rank 0 steady_avg_us = 4920.624, steady_max_us = 6031.936
rank 1 steady_avg_us = 3580.512, steady_max_us = 3926.080
Case 2: 8-rank H256 masked tile-split
Config:
Result:
PASS on all 8 ranks
process exits normally
checked = 6
counts = [3, 3]
Latest local timing from test_logs/local_8rank_h256_topk8_masked_tokens1_tile_split_repeats2.log:
steady_min_us across ranks ~= 16207.552
steady_max_us across ranks ~= 16471.872
Current Gap vs UserHopperMegaMoE
This PR does not claim production equivalence with UserHopperMegaMoE yet.
Known gaps:
Next Steps
Planned follow-up work:
Remove the 2-rank finalize hang.
Convert the current manual runner into a more standard gated integration test.
Expand supported shapes beyond the current small H128/H256 cases.
Move closer to UserHopper’s production structure:
Add comparable benchmark cases against existing MegaMoE implementations once shapes and semantics are aligned.