Skip to content

Conversation

@ffrujeri
Copy link
Contributor

@ffrujeri ffrujeri commented Nov 6, 2025

What does this PR do ?

Adds support for the AceReason-Math dataset with a GRPO training recipe for 7B models with 16K context length.

This PR introduces a new dataset adapter for nvidia/AceReason-Math, a comprehensive GRPO training configuration for DeepSeek-R1-Distill-Qwen-7B, and corresponding test infrastructure for validating the training pipeline.

Issues

List issues that this PR closes (syntax):

Usage

Training with the AceReason-Math dataset:

from nemo_rl.data.datasets.response_datasets import AceReasonMathDataset

# Initialize the dataset with train/validation splits
dataset = AceReasonMathDataset(seed=42)

# Access the formatted datasets
train_data = dataset.formatted_ds["train"]
val_data = dataset.formatted_ds["validation"]

# Get task specification
task_spec = dataset.task_spec

Running GRPO training with the new recipe:

uv run examples/run_grpo_math.py \
    --config examples/configs/recipes/llm/grpo-acereason-math-7b-16K.yaml \
    grpo.max_num_steps=1000 \
    logger.wandb_enabled=True

Changes in this PR

  • New Dataset Adapter: nemo_rl/data/datasets/response_datasets/acereason_math.py

    • Implements AceReasonMathDataset class with train/validation splits
    • Uses nvidia/AceReason-Math for training and HuggingFaceH4/aime_2024 for validation
    • Formats data with proper task_name fields for GRPO compatibility
  • GRPO Recipe: examples/configs/recipes/llm/grpo-acereason-math-7b-16K.yaml

    • Configured for DeepSeek-R1-Distill-Qwen-7B with 16K context length
    • Uses context parallelism (CP=2) and tensor parallelism (TP=2) for efficient training
    • Dynamic batching with 32K logprob tokens and 16K training tokens
    • Optimized for 8 GPUs per node
  • Prompt Template: examples/prompts/acemath_qwen_cot.txt

    • Chain-of-thought prompt format for math problem solving
    • Instructs model to wrap final answer in \boxed{}
  • Test Suite: tests/test_suites/llm/grpo-acereason-math-7b-16K.sh

    • Automated testing pipeline with 1000 training steps
    • Includes checkpoint conversion to HuggingFace format
    • Validates training metrics and evaluation performance (baseline threshold: 0.30 score)

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines

  • Did you write any new necessary tests?

  • Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests

  • Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.

Additional Information

  • The AceReason-Math dataset is designed for training mathematical reasoning capabilities using GRPO
  • The recipe is optimized for 7B parameter models with extended 16K context length to handle complex reasoning chains
  • Validation uses AIME 2024 dataset following the pattern established by other math datasets in the repository
  • The test suite includes automated evaluation with a baseline score threshold that can be adjusted based on actual performance results

Signed-off-by: Felipe Vieira Frujeri <[email protected]>
Signed-off-by: Felipe Vieira Frujeri <[email protected]>
Signed-off-by: Felipe Vieira Frujeri <[email protected]>
Signed-off-by: Felipe Vieira Frujeri <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants