feat: Add AceMathRL recipe #1484

ffrujeri · 2025-11-06T23:55:05Z

What does this PR do ?

Adds support for the AceReason-Math dataset with a GRPO training recipe for 7B models with 16K context length.

This PR introduces a new dataset adapter for nvidia/AceReason-Math, a comprehensive GRPO training configuration for DeepSeek-R1-Distill-Qwen-7B, and corresponding test infrastructure for validating the training pipeline.

Issues

List issues that this PR closes (syntax):

Related to 1292

Usage

Training with the AceReason-Math dataset:

from nemo_rl.data.datasets.response_datasets import AceReasonMathDataset

# Initialize the dataset with train/validation splits
dataset = AceReasonMathDataset(seed=42)

# Access the formatted datasets
train_data = dataset.formatted_ds["train"]
val_data = dataset.formatted_ds["validation"]

# Get task specification
task_spec = dataset.task_spec

Running GRPO training with the new recipe:

uv run examples/run_grpo_math.py \
    --config examples/configs/recipes/llm/grpo-acereason-math-7b-16K.yaml \
    grpo.max_num_steps=1000 \
    logger.wandb_enabled=True

Changes in this PR

New Dataset Adapter: nemo_rl/data/datasets/response_datasets/acereason_math.py
- Implements AceReasonMathDataset class with train/validation splits
- Uses nvidia/AceReason-Math for training and HuggingFaceH4/aime_2024 for validation
- Formats data with proper task_name fields for GRPO compatibility
GRPO Recipe: examples/configs/recipes/llm/grpo-acereason-math-7b-16K.yaml
- Configured for DeepSeek-R1-Distill-Qwen-7B with 16K context length
- Uses context parallelism (CP=2) and tensor parallelism (TP=2) for efficient training
- Dynamic batching with 32K logprob tokens and 16K training tokens
- Optimized for 8 GPUs per node
Prompt Template: examples/prompts/acemath_qwen_cot.txt
- Chain-of-thought prompt format for math problem solving
- Instructs model to wrap final answer in \boxed{}
Test Suite: tests/test_suites/llm/grpo-acereason-math-7b-16K.sh
- Automated testing pipeline with 1000 training steps
- Includes checkpoint conversion to HuggingFace format
- Validates training metrics and evaluation performance (baseline threshold: 0.30 score)

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.

Additional Information

The AceReason-Math dataset is designed for training mathematical reasoning capabilities using GRPO
The recipe is optimized for 7B parameter models with extended 16K context length to handle complex reasoning chains
Validation uses AIME 2024 dataset following the pattern established by other math datasets in the repository
The test suite includes automated evaluation with a baseline score threshold that can be adjusted based on actual performance results

Signed-off-by: Felipe Vieira Frujeri <[email protected]>

ffrujeri added 7 commits November 4, 2025 12:53

Refactor normalize_advantages function and add tests.

2e6dd27

Signed-off-by: Felipe Vieira Frujeri <[email protected]>

Update tests, rebase from main.

9789501

Signed-off-by: Felipe Vieira Frujeri <[email protected]>

Remove lingering std.

eb18349

Signed-off-by: Felipe Vieira Frujeri <[email protected]>

Compute advantages only for the response tokens.

0f57b86

Signed-off-by: Felipe Vieira Frujeri <[email protected]>

Update unit tests.

16d1e92

Signed-off-by: Felipe Vieira Frujeri <[email protected]>

Create AceReasonMath dataset and recipe.

d2eeea4

Signed-off-by: Felipe Vieira Frujeri <[email protected]>

Add 8K,16K,24K and 32K into the Acemath recipe.

e608f2b

Signed-off-by: Felipe Vieira Frujeri <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: Add AceMathRL recipe #1484

feat: Add AceMathRL recipe #1484

Uh oh!

ffrujeri commented Nov 6, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

feat: Add AceMathRL recipe #1484

Are you sure you want to change the base?

feat: Add AceMathRL recipe #1484

Uh oh!

Conversation

ffrujeri commented Nov 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do ?

Issues

Usage

Changes in this PR

Before your PR is "Ready for review"

Additional Information

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ffrujeri commented Nov 6, 2025 •

edited

Loading