feat: Fp8 moe rollout #1446

guyueh1 · 2025-10-29T18:03:47Z

What does this PR do ?

Support fp8 precision in generation phase for MoE models.

Issues

List issues that this PR closes (syntax):

Usage

You can potentially add a usage example below

# Add a code snippet demonstrating how to use this

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.

Additional Information

...

Summary by CodeRabbit

New Features
- Added new GRPO configuration with FP8 quantization support for optimized performance.
- Generation KL Error metric now displayed during training results.
- Extended FP8 quantization support to MoE (Mixture of Experts) models.

Signed-off-by: Guyue Huang <[email protected]>

…-RL into fix_fp8_rollout_dense

Signed-off-by: Guyue Huang <[email protected]>

coderabbitai · 2025-11-02T23:14:58Z

📝 Walkthrough

Walkthrough

This pull request extends FP8 quantization support to MoE (Mixture of Experts) models within the generation pipeline. Changes include a new GRPO configuration file enabling FP8 inference with Megatron parallelism, enhancements to FP8 weight processing for FusedMoE modules, modified module traversal logic, and an additional training metric display.

Changes

Cohort / File(s)	Summary
Configuration for FP8-enabled GRPO `examples/configs/grpo_math_qwen30ba3b_megatron_fp8.yaml`	New configuration file setting up GRPO training with FP8 quantization, Megatron parallelism (tensor parallel size 1, pipeline parallel size 2), and vLLM inference with FP8 precision and deep GEMM support.
Training metrics enhancement `nemo_rl/algorithms/grpo.py`	Added generation KL error metric printout to training results display alongside existing loss metrics.
FP8 support for MoE modules `nemo_rl/models/generation/fp8.py`	Extended FP8 quantization path to handle FusedMoE weight processing: added FusedMoE module import and type checks, introduced `process_weights_after_loading_moe` handler for MoE-specific weight conversion, updated `_get_module_from_param_name` and `_is_fp8_weight` to recognize FusedMoE instances, relaxed MoE validation assertion, enhanced tensor padding in `cast_tensor_to_fp8_blockwise`, and patched `Fp8MoEMethod.process_weights_after_loading`.

Sequence Diagram(s)

sequenceDiagram
    participant Load as Model Loading
    participant Check as FP8 Detection
    participant Route as Module Routing
    participant Process as Weight Processing
    
    Load->>Check: Load model with FP8 enabled
    Check->>Route: Identify FusedMoE modules
    
    alt FusedMoE Module Found
        Route->>Process: Route to process_weights_after_loading_moe
        Process->>Process: Extract w13_weight, w2_weight
        Process->>Process: Pad tensors to block_size alignment
        Process->>Process: Cast to FP8 (blockwise)
        Process->>Process: Unpad results
    else Linear Module
        Route->>Process: Route to standard Linear processing
        Process->>Process: Cast weights to FP8
    end
    
    Process-->>Load: Ready for inference

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

FusedMoE weight processing logic: New process_weights_after_loading_moe function introduces MoE-specific tensor handling with padding/unpadding, requiring careful validation of tensor shape manipulations.
Module traversal changes: Early-return logic in _get_module_from_param_name and extended weight classification in _is_fp8_weight impact type detection across the model tree; edge cases with nested MoE structures need validation.
Patching behavior refinement: Changes to apply_fp8_patches affecting both Fp8LinearMethod and Fp8MoEMethod require verification that patch order and execution flow correctly.
Assertion removal: Removal of the MoE num_experts zero-check could affect backward compatibility; verify intended scope.

Suggested labels

CI:L1

Suggested reviewers

parthchadha
terrykong

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Test Results For Major Changes	⚠️ Warning	This PR introduces a major new feature adding FP8 precision support for Mixture-of-Experts (MoE) models during the generation phase. The changes include significant modifications to `nemo_rl/models/generation/fp8.py` (marked as "High" review effort), new configuration files, and weight processing logic. Since FP8 is a different precision format, these changes could affect numerical behavior and convergence. However, according to the PR objectives, the PR description explicitly states that "A placeholder for a usage example and a Python snippet is included but not populated" and critically, "none of these checklist items [adding tests, running unit/functional tests, and updating documentation] are marked complete." The PR lacks any documented test results, convergence/regression demonstrations, or performance comparisons required for changes of this magnitude.	To pass this check, the PR description must be updated to include: (1) test results demonstrating the feature works correctly with FP8 MoE models, (2) evidence that there is no numerical regression or convergence degradation compared to non-FP8 baselines, and (3) performance metrics (before-and-after numbers) showing the impact of FP8 quantization on generation speed and model quality. Additionally, the contributor checklist items should be marked complete, example code should be populated, and documentation should be updated.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check	✅ Passed	The title "feat: Fp8 moe rollout" is directly related to the primary objective and main changes in this pull request. The PR's core contribution is adding FP8 (floating point 8) precision support for Mixture-of-Experts models during generation, which is precisely what the title communicates. The largest and most complex changes are in `nemo_rl/models/generation/fp8.py`, where comprehensive FP8 support for FusedMoE modules is implemented, including weight processing, module traversal, and patching updates. The title is concise, specific, and would allow a teammate reviewing commit history to quickly understand that this changeset introduces FP8 support for MoE models. While secondary changes include a new configuration file and a metric printout, the title appropriately focuses on the main technical contribution.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between bd2e645 and 13439ad.

📒 Files selected for processing (3)

examples/configs/grpo_math_qwen30ba3b_megatron_fp8.yaml (1 hunks)
nemo_rl/algorithms/grpo.py (1 hunks)
nemo_rl/models/generation/fp8.py (7 hunks)

🧰 Additional context used

📓 Path-based instructions (3)

**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Follow the Google Python Style Guide for all Python code
Target Python 3.12+ for all Python code in NeMo-RL
Indent Python code with 4 spaces; do not use tabs
Python filenames should be snake_case (e.g., some_file.py)
Class names should be PascalCase
Function and method names should be snake_case
Local variable names should be snake_case; if starting with a number, prefix with k (e.g., k_99th_percentile)
Global variables should be UPPER_SNAKE_CASE and prefixed with G_ (e.g., G_MY_GLOBAL)
Constants should be UPPER_SNAKE_CASE
Avoid shadowing variables declared in an outer scope
Initialize all externally visible members of a class in the constructor
For public interfaces used outside a file, prefer docstrings over comments
Use comments mainly for code within a function or interfaces local to a file
Commented-out code must include a nearby comment explaining usage and why it is commented out; otherwise remove before merging
Use Google-style docstrings for classes and functions (Sphinx-parseable)
Avoid using reflection when functionality can be easily achieved without it
Limit except clauses to the smallest specific set of exceptions possible
For duck-typing via try/except, keep the try body minimal and use else for main logic
Add the NVIDIA copyright header (with current year) at the top of all Python files, excluding tests/ and test-only scripts

Files:

nemo_rl/algorithms/grpo.py
nemo_rl/models/generation/fp8.py

nemo_rl/**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

nemo_rl/**/*.py: Do not set non-None configuration defaults in code; YAML is the single source of truth for defaults
Access required config attributes directly (e.g., policy_cfg["precision"]) and assume presence; do not introduce hidden defaults
Express configuration optionality via TypedDict using typing.NotRequired
When adding a new config key to a TypedDict subclass, document the key’s purpose, valid values/types, and recommended default in code
For any class or function decorated with @ray.remote, add '# pragma: no cover' on the class/def line (and on remote functions)

Files:

nemo_rl/algorithms/grpo.py
nemo_rl/models/generation/fp8.py

examples/configs/*.yaml

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

examples/configs/*.yaml: Exemplar configs under examples/configs/.yaml must include documented defaults
When adding a new config key, reflect its recommended default in exemplar YAMLs under examples/configs/.yaml

Files:

examples/configs/grpo_math_qwen30ba3b_megatron_fp8.yaml

🧠 Learnings (3)

📓 Common learnings

Learnt from: adil-a
Repo: NVIDIA-NeMo/RL PR: 1440
File: examples/configs/sft_automodel.yaml:48-58
Timestamp: 2025-10-30T20:50:44.126Z
Learning: In DTensor configurations for MoE (Mixture of Experts) models, expert_parallel_size and data_parallel_size can be applied together without multiplying the GPU requirements. Expert Parallelism (EP) only applies to MoE layers, while Data Parallelism/FSDP applies to non-MoE layers. Therefore, configurations like expert_parallel_size: 8 and data_parallel_size: 8 are valid on an 8-GPU cluster for MoE models.

📚 Learning: 2025-09-18T14:57:31.003Z

Learnt from: zpqiu
Repo: NVIDIA-NeMo/RL PR: 1006
File: nemo_rl/algorithms/distillation.py:312-354
Timestamp: 2025-09-18T14:57:31.003Z
Learning: The distillation algorithm's cluster setup logic is designed to follow the same patterns used in GRPO for handling distributed training clusters and resource allocation.

Applied to files:

examples/configs/grpo_math_qwen30ba3b_megatron_fp8.yaml

📚 Learning: 2025-10-30T20:50:44.126Z

Learnt from: adil-a
Repo: NVIDIA-NeMo/RL PR: 1440
File: examples/configs/sft_automodel.yaml:48-58
Timestamp: 2025-10-30T20:50:44.126Z
Learning: In DTensor configurations for MoE (Mixture of Experts) models, expert_parallel_size and data_parallel_size can be applied together without multiplying the GPU requirements. Expert Parallelism (EP) only applies to MoE layers, while Data Parallelism/FSDP applies to non-MoE layers. Therefore, configurations like expert_parallel_size: 8 and data_parallel_size: 8 are valid on an 8-GPU cluster for MoE models.

Applied to files:

nemo_rl/models/generation/fp8.py

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)

GitHub Check: Lint check
GitHub Check: Post submodule check comment / Comment on PR
GitHub Check: Post automodel integration comment / Comment on PR

coderabbitai · 2025-11-02T23:15:01Z

nemo_rl/algorithms/grpo.py

+            print(f"  • Generation KL Error: {metrics['gen_kl_error']:.4f}")
            if master_config["grpo"]["use_dynamic_sampling"]:


⚠️ Potential issue | 🔴 Critical

Guard the new Generation KL Error print
metrics['gen_kl_error'] is not guaranteed to exist—metrics is seeded locally without that key and only receives whatever train_results["all_mb_metrics"] happens to provide. On steps where the backend does not emit gen_kl_error (e.g., existing configs that pre-date this PR), this line will raise a KeyError and abort training. Please gate the print behind a membership check or default.

- print(f" • Generation KL Error: {metrics['gen_kl_error']:.4f}") + if "gen_kl_error" in metrics: + print(f" • Generation KL Error: {metrics['gen_kl_error']:.4f}")

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

print(f" • Generation KL Error: {metrics['gen_kl_error']:.4f}")

if master_config["grpo"]["use_dynamic_sampling"]:

if "gen_kl_error" in metrics:

print(f" • Generation KL Error: {metrics['gen_kl_error']:.4f}")

if master_config["grpo"]["use_dynamic_sampling"]:

🤖 Prompt for AI Agents

In nemo_rl/algorithms/grpo.py around lines 1279-1280, the print of metrics['gen_kl_error'] assumes that key exists and will raise KeyError when missing; change the code to first check for 'gen_kl_error' in metrics (or use metrics.get('gen_kl_error')) and only print the formatted Generation KL Error when the value is present (or not None), otherwise skip the print (or print a default/placeholder); ensure this guard is applied before the surrounding conditional on master_config["grpo"]["use_dynamic_sampling"] so missing keys no longer abort training.

Signed-off-by: Guyue Huang <[email protected]>

guyueh1 · 2025-11-09T23:17:45Z

@terrykong I added a unit test that uses 8 GPUs because I want to test a fine-grainde MoE model, and the smallest one I know is moonlight-16B which needs 8 gpus. Please check my changes to the tests/unit/models/generation/test_vllm_generation.py
But I got

>       raise ResourceInsufficientError(
            f"Maximum number of retries reached ({max_retries}). Cluster resources may be insufficient or cluster itself is highly unstable. Please check your cluster configuration and your cluster logs."
        )
E       nemo_rl.distributed.virtual_cluster.ResourceInsufficientError: Maximum number of retries reached (6). Cluster resources may be insufficient or cluster itself is highly unstable. Please check your cluster configuration and your cluster logs.

Any ideas why cannot create placement group if I require 8 gpus?

Signed-off-by: Guyue Huang <[email protected]>

terrykong · 2025-11-09T23:44:43Z

@guyueh1 did you observe that on our github CI? if so, we only have 2 gpus available for unit tests

guyueh1 · 2025-11-10T16:48:35Z

@terrykong yes, the L0 github CI failed. I couldn't find a moe model for testing that uses just 2 gpus, should I remove the unit test for now? Or should I add an end2end test to L2?

terrykong · 2025-11-10T17:17:15Z

out of curiosity do any of these work https://huggingface.co/hf-internal-testing/models ?

guyueh1 · 2025-11-10T18:10:00Z

@terrykong did a quick look at hf-internal-testing but I haven't found a suitable model, they are either too large, or their hidden size is not divisible by 128 which can't be used for fp8 block-quant testing.

guyueh1 added 2 commits October 27, 2025 11:56

Fix process_weights_after_loading for fp8 dense

41abdf1

Signed-off-by: Guyue Huang <[email protected]>

Support moe fp8 rollout qwen

9ba1bd2

Signed-off-by: Guyue Huang <[email protected]>

guyueh1 requested review from a team as code owners October 29, 2025 18:03

guyueh1 and others added 6 commits October 31, 2025 09:49

Merge branch 'main' into fix_fp8_rollout_dense

c1d7fd2

Change printed metric to gen_kl_error

bf1ad6b

Signed-off-by: Guyue Huang <[email protected]>

Fix for non deep_gemm weights

c3b5f73

Signed-off-by: Guyue Huang <[email protected]>

Fix disable_tp=True

958d745

Signed-off-by: Guyue Huang <[email protected]>

Merge branch 'fix_fp8_rollout_dense' of ssh://github.com/guyueh1/NeMo…

0f7128e

…-RL into fix_fp8_rollout_dense

Merge branch 'fix_fp8_rollout_dense' into fp8_moe_rollout

13439ad

Signed-off-by: Guyue Huang <[email protected]>

coderabbitai bot reviewed Nov 2, 2025

View reviewed changes

Allow ignore layers

b407c26

Signed-off-by: Guyue Huang <[email protected]>

guyueh1 linked an issue Nov 3, 2025 that may be closed by this pull request

FP8 generation in vLLM for MoEs #978

Open

guyueh1 added 6 commits November 6, 2025 14:44

Merge branch 'main' into fix_fp8_rollout_dense

4cc2976

simplify code

22f615e

Signed-off-by: Guyue Huang <[email protected]>

Rm unused code

335a3e2

Signed-off-by: Guyue Huang <[email protected]>

Fix unit test; log KL in async function also

78e45b6

Signed-off-by: Guyue Huang <[email protected]>

Merge branch 'fix_fp8_rollout_dense' into fp8_moe_rollout

850d416

Add unit test and nightly test

2f0b7dd

Signed-off-by: Guyue Huang <[email protected]>

guyueh1 requested a review from a team as a code owner November 9, 2025 21:25

guyueh1 self-assigned this Nov 9, 2025

guyueh1 added the CI:L2 Run doctests, unit tests, functional tests, and convergence tests label Nov 9, 2025

guyueh1 temporarily deployed to nemo-ci November 9, 2025 21:26 — with GitHub Actions Inactive

update doc

96b5fee

Signed-off-by: Guyue Huang <[email protected]>

guyueh1 requested a review from a team as a code owner November 9, 2025 21:38

github-actions bot added the documentation Improvements or additions to documentation label Nov 9, 2025

guyueh1 removed documentation Improvements or additions to documentation CI:L2 Run doctests, unit tests, functional tests, and convergence tests labels Nov 9, 2025

guyueh1 added the CI:L2 Run doctests, unit tests, functional tests, and convergence tests label Nov 9, 2025

guyueh1 temporarily deployed to nemo-ci November 9, 2025 21:41 — with GitHub Actions Inactive

guyueh1 temporarily deployed to nemo-ci November 9, 2025 21:42 — with GitHub Actions Inactive

Typo

faadfca

Signed-off-by: Guyue Huang <[email protected]>

guyueh1 added CI:L2 Run doctests, unit tests, functional tests, and convergence tests and removed CI:L2 Run doctests, unit tests, functional tests, and convergence tests labels Nov 9, 2025

github-actions bot added the documentation Improvements or additions to documentation label Nov 9, 2025

guyueh1 temporarily deployed to nemo-ci November 9, 2025 21:52 — with GitHub Actions Inactive

guyueh1 linked an issue Nov 9, 2025 that may be closed by this pull request

Improve DeepSeek Generation Speed with FP8 Quantization #1245

Open

guyueh1 added 2 commits November 9, 2025 15:29

Update grpo-moonlight-16ba3b-4n8g-megatron-fp8-e2e.yaml

89d7489

Signed-off-by: Guyue Huang <[email protected]>

Update fp8.py

096c542

Signed-off-by: Guyue Huang <[email protected]>

guyueh1 added CI:L0 Run doctests and unit tests and removed CI:L2 Run doctests, unit tests, functional tests, and convergence tests labels Nov 9, 2025

guyueh1 temporarily deployed to nemo-ci November 9, 2025 23:31 — with GitHub Actions Inactive

guyueh1 temporarily deployed to nemo-ci November 10, 2025 00:16 — with GitHub Actions Inactive

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: Fp8 moe rollout #1446

feat: Fp8 moe rollout #1446

Uh oh!

guyueh1 commented Oct 29, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Nov 2, 2025

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Suggested labels

Suggested reviewers

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Nov 2, 2025

Uh oh!

guyueh1 commented Nov 9, 2025

Uh oh!

terrykong commented Nov 9, 2025

Uh oh!

guyueh1 commented Nov 10, 2025

Uh oh!

terrykong commented Nov 10, 2025

Uh oh!

guyueh1 commented Nov 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		print(f" • Generation KL Error: {metrics['gen_kl_error']:.4f}")
		if master_config["grpo"]["use_dynamic_sampling"]:

feat: Fp8 moe rollout #1446

Are you sure you want to change the base?

feat: Fp8 moe rollout #1446

Uh oh!

Conversation

guyueh1 commented Oct 29, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do ?

Issues

Usage

Before your PR is "Ready for review"

Additional Information

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Nov 2, 2025

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Suggested labels

Suggested reviewers

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Nov 2, 2025

Choose a reason for hiding this comment

Uh oh!

guyueh1 commented Nov 9, 2025

Uh oh!

terrykong commented Nov 9, 2025

Uh oh!

guyueh1 commented Nov 10, 2025

Uh oh!

terrykong commented Nov 10, 2025

Uh oh!

guyueh1 commented Nov 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

guyueh1 commented Oct 29, 2025 •

edited by coderabbitai bot

Loading