Skip to content

Conversation

@guyueh1
Copy link
Contributor

@guyueh1 guyueh1 commented Oct 29, 2025

What does this PR do ?

Support fp8 precision in generation phase for MoE models.

Issues

List issues that this PR closes (syntax):

Usage

  • You can potentially add a usage example below
# Add a code snippet demonstrating how to use this

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
  • Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.

Additional Information

  • ...

Summary by CodeRabbit

  • New Features
    • Added new GRPO configuration with FP8 quantization support for optimized performance.
    • Generation KL Error metric now displayed during training results.
    • Extended FP8 quantization support to MoE (Mixture of Experts) models.

@guyueh1 guyueh1 requested review from a team as code owners October 29, 2025 18:03
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Nov 2, 2025

📝 Walkthrough

Walkthrough

This pull request extends FP8 quantization support to MoE (Mixture of Experts) models within the generation pipeline. Changes include a new GRPO configuration file enabling FP8 inference with Megatron parallelism, enhancements to FP8 weight processing for FusedMoE modules, modified module traversal logic, and an additional training metric display.

Changes

Cohort / File(s) Summary
Configuration for FP8-enabled GRPO
examples/configs/grpo_math_qwen30ba3b_megatron_fp8.yaml
New configuration file setting up GRPO training with FP8 quantization, Megatron parallelism (tensor parallel size 1, pipeline parallel size 2), and vLLM inference with FP8 precision and deep GEMM support.
Training metrics enhancement
nemo_rl/algorithms/grpo.py
Added generation KL error metric printout to training results display alongside existing loss metrics.
FP8 support for MoE modules
nemo_rl/models/generation/fp8.py
Extended FP8 quantization path to handle FusedMoE weight processing: added FusedMoE module import and type checks, introduced process_weights_after_loading_moe handler for MoE-specific weight conversion, updated _get_module_from_param_name and _is_fp8_weight to recognize FusedMoE instances, relaxed MoE validation assertion, enhanced tensor padding in cast_tensor_to_fp8_blockwise, and patched Fp8MoEMethod.process_weights_after_loading.

Sequence Diagram(s)

sequenceDiagram
    participant Load as Model Loading
    participant Check as FP8 Detection
    participant Route as Module Routing
    participant Process as Weight Processing
    
    Load->>Check: Load model with FP8 enabled
    Check->>Route: Identify FusedMoE modules
    
    alt FusedMoE Module Found
        Route->>Process: Route to process_weights_after_loading_moe
        Process->>Process: Extract w13_weight, w2_weight
        Process->>Process: Pad tensors to block_size alignment
        Process->>Process: Cast to FP8 (blockwise)
        Process->>Process: Unpad results
    else Linear Module
        Route->>Process: Route to standard Linear processing
        Process->>Process: Cast weights to FP8
    end
    
    Process-->>Load: Ready for inference
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

  • FusedMoE weight processing logic: New process_weights_after_loading_moe function introduces MoE-specific tensor handling with padding/unpadding, requiring careful validation of tensor shape manipulations.
  • Module traversal changes: Early-return logic in _get_module_from_param_name and extended weight classification in _is_fp8_weight impact type detection across the model tree; edge cases with nested MoE structures need validation.
  • Patching behavior refinement: Changes to apply_fp8_patches affecting both Fp8LinearMethod and Fp8MoEMethod require verification that patch order and execution flow correctly.
  • Assertion removal: Removal of the MoE num_experts zero-check could affect backward compatibility; verify intended scope.

Suggested labels

CI:L1

Suggested reviewers

  • parthchadha
  • terrykong

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Test Results For Major Changes ⚠️ Warning This PR introduces a major new feature adding FP8 precision support for Mixture-of-Experts (MoE) models during the generation phase. The changes include significant modifications to nemo_rl/models/generation/fp8.py (marked as "High" review effort), new configuration files, and weight processing logic. Since FP8 is a different precision format, these changes could affect numerical behavior and convergence. However, according to the PR objectives, the PR description explicitly states that "A placeholder for a usage example and a Python snippet is included but not populated" and critically, "none of these checklist items [adding tests, running unit/functional tests, and updating documentation] are marked complete." The PR lacks any documented test results, convergence/regression demonstrations, or performance comparisons required for changes of this magnitude. To pass this check, the PR description must be updated to include: (1) test results demonstrating the feature works correctly with FP8 MoE models, (2) evidence that there is no numerical regression or convergence degradation compared to non-FP8 baselines, and (3) performance metrics (before-and-after numbers) showing the impact of FP8 quantization on generation speed and model quality. Additionally, the contributor checklist items should be marked complete, example code should be populated, and documentation should be updated.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check ✅ Passed The title "feat: Fp8 moe rollout" is directly related to the primary objective and main changes in this pull request. The PR's core contribution is adding FP8 (floating point 8) precision support for Mixture-of-Experts models during generation, which is precisely what the title communicates. The largest and most complex changes are in nemo_rl/models/generation/fp8.py, where comprehensive FP8 support for FusedMoE modules is implemented, including weight processing, module traversal, and patching updates. The title is concise, specific, and would allow a teammate reviewing commit history to quickly understand that this changeset introduces FP8 support for MoE models. While secondary changes include a new configuration file and a metric printout, the title appropriately focuses on the main technical contribution.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between bd2e645 and 13439ad.

📒 Files selected for processing (3)
  • examples/configs/grpo_math_qwen30ba3b_megatron_fp8.yaml (1 hunks)
  • nemo_rl/algorithms/grpo.py (1 hunks)
  • nemo_rl/models/generation/fp8.py (7 hunks)
🧰 Additional context used
📓 Path-based instructions (3)
**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Follow the Google Python Style Guide for all Python code
Target Python 3.12+ for all Python code in NeMo-RL
Indent Python code with 4 spaces; do not use tabs
Python filenames should be snake_case (e.g., some_file.py)
Class names should be PascalCase
Function and method names should be snake_case
Local variable names should be snake_case; if starting with a number, prefix with k (e.g., k_99th_percentile)
Global variables should be UPPER_SNAKE_CASE and prefixed with G_ (e.g., G_MY_GLOBAL)
Constants should be UPPER_SNAKE_CASE
Avoid shadowing variables declared in an outer scope
Initialize all externally visible members of a class in the constructor
For public interfaces used outside a file, prefer docstrings over comments
Use comments mainly for code within a function or interfaces local to a file
Commented-out code must include a nearby comment explaining usage and why it is commented out; otherwise remove before merging
Use Google-style docstrings for classes and functions (Sphinx-parseable)
Avoid using reflection when functionality can be easily achieved without it
Limit except clauses to the smallest specific set of exceptions possible
For duck-typing via try/except, keep the try body minimal and use else for main logic
Add the NVIDIA copyright header (with current year) at the top of all Python files, excluding tests/ and test-only scripts

Files:

  • nemo_rl/algorithms/grpo.py
  • nemo_rl/models/generation/fp8.py
nemo_rl/**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

nemo_rl/**/*.py: Do not set non-None configuration defaults in code; YAML is the single source of truth for defaults
Access required config attributes directly (e.g., policy_cfg["precision"]) and assume presence; do not introduce hidden defaults
Express configuration optionality via TypedDict using typing.NotRequired
When adding a new config key to a TypedDict subclass, document the key’s purpose, valid values/types, and recommended default in code
For any class or function decorated with @ray.remote, add '# pragma: no cover' on the class/def line (and on remote functions)

Files:

  • nemo_rl/algorithms/grpo.py
  • nemo_rl/models/generation/fp8.py
examples/configs/*.yaml

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

examples/configs/*.yaml: Exemplar configs under examples/configs/.yaml must include documented defaults
When adding a new config key, reflect its recommended default in exemplar YAMLs under examples/configs/
.yaml

Files:

  • examples/configs/grpo_math_qwen30ba3b_megatron_fp8.yaml
🧠 Learnings (3)
📓 Common learnings
Learnt from: adil-a
Repo: NVIDIA-NeMo/RL PR: 1440
File: examples/configs/sft_automodel.yaml:48-58
Timestamp: 2025-10-30T20:50:44.126Z
Learning: In DTensor configurations for MoE (Mixture of Experts) models, expert_parallel_size and data_parallel_size can be applied together without multiplying the GPU requirements. Expert Parallelism (EP) only applies to MoE layers, while Data Parallelism/FSDP applies to non-MoE layers. Therefore, configurations like expert_parallel_size: 8 and data_parallel_size: 8 are valid on an 8-GPU cluster for MoE models.
📚 Learning: 2025-09-18T14:57:31.003Z
Learnt from: zpqiu
Repo: NVIDIA-NeMo/RL PR: 1006
File: nemo_rl/algorithms/distillation.py:312-354
Timestamp: 2025-09-18T14:57:31.003Z
Learning: The distillation algorithm's cluster setup logic is designed to follow the same patterns used in GRPO for handling distributed training clusters and resource allocation.

Applied to files:

  • examples/configs/grpo_math_qwen30ba3b_megatron_fp8.yaml
📚 Learning: 2025-10-30T20:50:44.126Z
Learnt from: adil-a
Repo: NVIDIA-NeMo/RL PR: 1440
File: examples/configs/sft_automodel.yaml:48-58
Timestamp: 2025-10-30T20:50:44.126Z
Learning: In DTensor configurations for MoE (Mixture of Experts) models, expert_parallel_size and data_parallel_size can be applied together without multiplying the GPU requirements. Expert Parallelism (EP) only applies to MoE layers, while Data Parallelism/FSDP applies to non-MoE layers. Therefore, configurations like expert_parallel_size: 8 and data_parallel_size: 8 are valid on an 8-GPU cluster for MoE models.

Applied to files:

  • nemo_rl/models/generation/fp8.py
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: Lint check
  • GitHub Check: Post submodule check comment / Comment on PR
  • GitHub Check: Post automodel integration comment / Comment on PR

Comment on lines +1279 to 1280
print(f" • Generation KL Error: {metrics['gen_kl_error']:.4f}")
if master_config["grpo"]["use_dynamic_sampling"]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Guard the new Generation KL Error print
metrics['gen_kl_error'] is not guaranteed to exist—metrics is seeded locally without that key and only receives whatever train_results["all_mb_metrics"] happens to provide. On steps where the backend does not emit gen_kl_error (e.g., existing configs that pre-date this PR), this line will raise a KeyError and abort training. Please gate the print behind a membership check or default.

-            print(f"  • Generation KL Error: {metrics['gen_kl_error']:.4f}")
+            if "gen_kl_error" in metrics:
+                print(f"  • Generation KL Error: {metrics['gen_kl_error']:.4f}")
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
print(f" • Generation KL Error: {metrics['gen_kl_error']:.4f}")
if master_config["grpo"]["use_dynamic_sampling"]:
if "gen_kl_error" in metrics:
print(f" • Generation KL Error: {metrics['gen_kl_error']:.4f}")
if master_config["grpo"]["use_dynamic_sampling"]:
🤖 Prompt for AI Agents
In nemo_rl/algorithms/grpo.py around lines 1279-1280, the print of
metrics['gen_kl_error'] assumes that key exists and will raise KeyError when
missing; change the code to first check for 'gen_kl_error' in metrics (or use
metrics.get('gen_kl_error')) and only print the formatted Generation KL Error
when the value is present (or not None), otherwise skip the print (or print a
default/placeholder); ensure this guard is applied before the surrounding
conditional on master_config["grpo"]["use_dynamic_sampling"] so missing keys no
longer abort training.

Signed-off-by: Guyue Huang <[email protected]>
@guyueh1 guyueh1 linked an issue Nov 3, 2025 that may be closed by this pull request
@guyueh1 guyueh1 requested a review from a team as a code owner November 9, 2025 21:25
@guyueh1 guyueh1 self-assigned this Nov 9, 2025
@guyueh1 guyueh1 added the CI:L2 Run doctests, unit tests, functional tests, and convergence tests label Nov 9, 2025
Signed-off-by: Guyue Huang <[email protected]>
@guyueh1 guyueh1 requested a review from a team as a code owner November 9, 2025 21:38
@github-actions github-actions bot added the documentation Improvements or additions to documentation label Nov 9, 2025
@guyueh1 guyueh1 removed documentation Improvements or additions to documentation CI:L2 Run doctests, unit tests, functional tests, and convergence tests labels Nov 9, 2025
@guyueh1 guyueh1 added the CI:L2 Run doctests, unit tests, functional tests, and convergence tests label Nov 9, 2025
Signed-off-by: Guyue Huang <[email protected]>
@guyueh1 guyueh1 added CI:L2 Run doctests, unit tests, functional tests, and convergence tests and removed CI:L2 Run doctests, unit tests, functional tests, and convergence tests labels Nov 9, 2025
@github-actions github-actions bot added the documentation Improvements or additions to documentation label Nov 9, 2025
@guyueh1
Copy link
Contributor Author

guyueh1 commented Nov 9, 2025

@terrykong I added a unit test that uses 8 GPUs because I want to test a fine-grainde MoE model, and the smallest one I know is moonlight-16B which needs 8 gpus. Please check my changes to the tests/unit/models/generation/test_vllm_generation.py
But I got

>       raise ResourceInsufficientError(
            f"Maximum number of retries reached ({max_retries}). Cluster resources may be insufficient or cluster itself is highly unstable. Please check your cluster configuration and your cluster logs."
        )
E       nemo_rl.distributed.virtual_cluster.ResourceInsufficientError: Maximum number of retries reached (6). Cluster resources may be insufficient or cluster itself is highly unstable. Please check your cluster configuration and your cluster logs.

Any ideas why cannot create placement group if I require 8 gpus?

@guyueh1 guyueh1 linked an issue Nov 9, 2025 that may be closed by this pull request
@guyueh1 guyueh1 added CI:L0 Run doctests and unit tests and removed CI:L2 Run doctests, unit tests, functional tests, and convergence tests labels Nov 9, 2025
@terrykong
Copy link
Contributor

@guyueh1 did you observe that on our github CI? if so, we only have 2 gpus available for unit tests

@guyueh1
Copy link
Contributor Author

guyueh1 commented Nov 10, 2025

@terrykong yes, the L0 github CI failed. I couldn't find a moe model for testing that uses just 2 gpus, should I remove the unit test for now? Or should I add an end2end test to L2?

@terrykong
Copy link
Contributor

out of curiosity do any of these work https://huggingface.co/hf-internal-testing/models ?

@guyueh1
Copy link
Contributor Author

guyueh1 commented Nov 10, 2025

@terrykong did a quick look at hf-internal-testing but I haven't found a suitable model, they are either too large, or their hidden size is not divisible by 128 which can't be used for fp8 block-quant testing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CI:L0 Run doctests and unit tests documentation Improvements or additions to documentation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Improve DeepSeek Generation Speed with FP8 Quantization FP8 generation in vLLM for MoEs

2 participants