-
Notifications
You must be signed in to change notification settings - Fork 166
feat: Fp8 moe rollout #1446
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
feat: Fp8 moe rollout #1446
Conversation
Signed-off-by: Guyue Huang <[email protected]>
Signed-off-by: Guyue Huang <[email protected]>
Signed-off-by: Guyue Huang <[email protected]>
Signed-off-by: Guyue Huang <[email protected]>
Signed-off-by: Guyue Huang <[email protected]>
…-RL into fix_fp8_rollout_dense
Signed-off-by: Guyue Huang <[email protected]>
📝 WalkthroughWalkthroughThis pull request extends FP8 quantization support to MoE (Mixture of Experts) models within the generation pipeline. Changes include a new GRPO configuration file enabling FP8 inference with Megatron parallelism, enhancements to FP8 weight processing for FusedMoE modules, modified module traversal logic, and an additional training metric display. Changes
Sequence Diagram(s)sequenceDiagram
participant Load as Model Loading
participant Check as FP8 Detection
participant Route as Module Routing
participant Process as Weight Processing
Load->>Check: Load model with FP8 enabled
Check->>Route: Identify FusedMoE modules
alt FusedMoE Module Found
Route->>Process: Route to process_weights_after_loading_moe
Process->>Process: Extract w13_weight, w2_weight
Process->>Process: Pad tensors to block_size alignment
Process->>Process: Cast to FP8 (blockwise)
Process->>Process: Unpad results
else Linear Module
Route->>Process: Route to standard Linear processing
Process->>Process: Cast weights to FP8
end
Process-->>Load: Ready for inference
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes
Suggested labels
Suggested reviewers
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
examples/configs/grpo_math_qwen30ba3b_megatron_fp8.yaml(1 hunks)nemo_rl/algorithms/grpo.py(1 hunks)nemo_rl/models/generation/fp8.py(7 hunks)
🧰 Additional context used
📓 Path-based instructions (3)
**/*.py
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.py: Follow the Google Python Style Guide for all Python code
Target Python 3.12+ for all Python code in NeMo-RL
Indent Python code with 4 spaces; do not use tabs
Python filenames should be snake_case (e.g., some_file.py)
Class names should be PascalCase
Function and method names should be snake_case
Local variable names should be snake_case; if starting with a number, prefix with k (e.g., k_99th_percentile)
Global variables should be UPPER_SNAKE_CASE and prefixed with G_ (e.g., G_MY_GLOBAL)
Constants should be UPPER_SNAKE_CASE
Avoid shadowing variables declared in an outer scope
Initialize all externally visible members of a class in the constructor
For public interfaces used outside a file, prefer docstrings over comments
Use comments mainly for code within a function or interfaces local to a file
Commented-out code must include a nearby comment explaining usage and why it is commented out; otherwise remove before merging
Use Google-style docstrings for classes and functions (Sphinx-parseable)
Avoid using reflection when functionality can be easily achieved without it
Limit except clauses to the smallest specific set of exceptions possible
For duck-typing via try/except, keep the try body minimal and use else for main logic
Add the NVIDIA copyright header (with current year) at the top of all Python files, excluding tests/ and test-only scripts
Files:
nemo_rl/algorithms/grpo.pynemo_rl/models/generation/fp8.py
nemo_rl/**/*.py
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
nemo_rl/**/*.py: Do not set non-None configuration defaults in code; YAML is the single source of truth for defaults
Access required config attributes directly (e.g., policy_cfg["precision"]) and assume presence; do not introduce hidden defaults
Express configuration optionality via TypedDict using typing.NotRequired
When adding a new config key to a TypedDict subclass, document the key’s purpose, valid values/types, and recommended default in code
For any class or function decorated with @ray.remote, add '# pragma: no cover' on the class/def line (and on remote functions)
Files:
nemo_rl/algorithms/grpo.pynemo_rl/models/generation/fp8.py
examples/configs/*.yaml
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
examples/configs/*.yaml: Exemplar configs under examples/configs/.yaml must include documented defaults
When adding a new config key, reflect its recommended default in exemplar YAMLs under examples/configs/.yaml
Files:
examples/configs/grpo_math_qwen30ba3b_megatron_fp8.yaml
🧠 Learnings (3)
📓 Common learnings
Learnt from: adil-a
Repo: NVIDIA-NeMo/RL PR: 1440
File: examples/configs/sft_automodel.yaml:48-58
Timestamp: 2025-10-30T20:50:44.126Z
Learning: In DTensor configurations for MoE (Mixture of Experts) models, expert_parallel_size and data_parallel_size can be applied together without multiplying the GPU requirements. Expert Parallelism (EP) only applies to MoE layers, while Data Parallelism/FSDP applies to non-MoE layers. Therefore, configurations like expert_parallel_size: 8 and data_parallel_size: 8 are valid on an 8-GPU cluster for MoE models.
📚 Learning: 2025-09-18T14:57:31.003Z
Learnt from: zpqiu
Repo: NVIDIA-NeMo/RL PR: 1006
File: nemo_rl/algorithms/distillation.py:312-354
Timestamp: 2025-09-18T14:57:31.003Z
Learning: The distillation algorithm's cluster setup logic is designed to follow the same patterns used in GRPO for handling distributed training clusters and resource allocation.
Applied to files:
examples/configs/grpo_math_qwen30ba3b_megatron_fp8.yaml
📚 Learning: 2025-10-30T20:50:44.126Z
Learnt from: adil-a
Repo: NVIDIA-NeMo/RL PR: 1440
File: examples/configs/sft_automodel.yaml:48-58
Timestamp: 2025-10-30T20:50:44.126Z
Learning: In DTensor configurations for MoE (Mixture of Experts) models, expert_parallel_size and data_parallel_size can be applied together without multiplying the GPU requirements. Expert Parallelism (EP) only applies to MoE layers, while Data Parallelism/FSDP applies to non-MoE layers. Therefore, configurations like expert_parallel_size: 8 and data_parallel_size: 8 are valid on an 8-GPU cluster for MoE models.
Applied to files:
nemo_rl/models/generation/fp8.py
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
- GitHub Check: Lint check
- GitHub Check: Post submodule check comment / Comment on PR
- GitHub Check: Post automodel integration comment / Comment on PR
| print(f" • Generation KL Error: {metrics['gen_kl_error']:.4f}") | ||
| if master_config["grpo"]["use_dynamic_sampling"]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Guard the new Generation KL Error print
metrics['gen_kl_error'] is not guaranteed to exist—metrics is seeded locally without that key and only receives whatever train_results["all_mb_metrics"] happens to provide. On steps where the backend does not emit gen_kl_error (e.g., existing configs that pre-date this PR), this line will raise a KeyError and abort training. Please gate the print behind a membership check or default.
- print(f" • Generation KL Error: {metrics['gen_kl_error']:.4f}")
+ if "gen_kl_error" in metrics:
+ print(f" • Generation KL Error: {metrics['gen_kl_error']:.4f}")📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| print(f" • Generation KL Error: {metrics['gen_kl_error']:.4f}") | |
| if master_config["grpo"]["use_dynamic_sampling"]: | |
| if "gen_kl_error" in metrics: | |
| print(f" • Generation KL Error: {metrics['gen_kl_error']:.4f}") | |
| if master_config["grpo"]["use_dynamic_sampling"]: |
🤖 Prompt for AI Agents
In nemo_rl/algorithms/grpo.py around lines 1279-1280, the print of
metrics['gen_kl_error'] assumes that key exists and will raise KeyError when
missing; change the code to first check for 'gen_kl_error' in metrics (or use
metrics.get('gen_kl_error')) and only print the formatted Generation KL Error
when the value is present (or not None), otherwise skip the print (or print a
default/placeholder); ensure this guard is applied before the surrounding
conditional on master_config["grpo"]["use_dynamic_sampling"] so missing keys no
longer abort training.
Signed-off-by: Guyue Huang <[email protected]>
Signed-off-by: Guyue Huang <[email protected]>
Signed-off-by: Guyue Huang <[email protected]>
Signed-off-by: Guyue Huang <[email protected]>
Signed-off-by: Guyue Huang <[email protected]>
Signed-off-by: Guyue Huang <[email protected]>
Signed-off-by: Guyue Huang <[email protected]>
|
@terrykong I added a unit test that uses 8 GPUs because I want to test a fine-grainde MoE model, and the smallest one I know is moonlight-16B which needs 8 gpus. Please check my changes to the Any ideas why cannot create placement group if I require 8 gpus? |
Signed-off-by: Guyue Huang <[email protected]>
Signed-off-by: Guyue Huang <[email protected]>
|
@guyueh1 did you observe that on our github CI? if so, we only have 2 gpus available for unit tests |
|
@terrykong yes, the L0 github CI failed. I couldn't find a moe model for testing that uses just 2 gpus, should I remove the unit test for now? Or should I add an end2end test to L2? |
|
out of curiosity do any of these work https://huggingface.co/hf-internal-testing/models ? |
|
@terrykong did a quick look at hf-internal-testing but I haven't found a suitable model, they are either too large, or their hidden size is not divisible by 128 which can't be used for fp8 block-quant testing. |
What does this PR do ?
Support fp8 precision in generation phase for MoE models.
Issues
List issues that this PR closes (syntax):
Usage
# Add a code snippet demonstrating how to use thisBefore your PR is "Ready for review"
Pre checks:
Additional Information
Summary by CodeRabbit