RePro (Rectifying Process-level Reward) is a novel post-training framework that aligns Chain-of-Thought (CoT) reasoning with gradient descent optimization principles.
While long-CoT prompting facilitates thorough exploration, it frequently results in suboptimal behaviors such as overthinking, hallucination, and inefficient reasoning paths. RePro mitigates these issues by:
- 📉 Optimization Lens: Framing each reasoning step as a gradient update trajectory toward the optimal solution.
- ⚖️ Dual Scoring Mechanism: Introducing a surrogate objective function to quantify both the intensity and stability of the reasoning process.
- 🎯 Process-Level Reward: Integrating these metrics into Reinforcement Learning with Verifiable Rewards (RLVR) pipelines to guide model alignment.
Empirical evaluations across mathematics, science, and coding benchmarks demonstrate that RePro consistently enhances reasoning accuracy while significantly reducing redundancy.
- Python: 3.10+
- CUDA: 11.8+
- Key Libraries: PyTorch, vLLM, VeRL
-
Create a conda environment:
conda create -n repro python=3.10 conda activate repro
-
Install the package in editable mode:
pip install -e .
We provide a demonstrated script for launching both single-node and multi-node training via GRPO.
Navigate to the project root and run the training script. Ensure your scripts/run_multinodes_repro_grpo.sh paths are correctly configured.
bash scripts/run_multinodes_repro_grpo.sh \
<MODEL_PATH> \
<NUM_NODES> \
<GPUS_PER_NODE> \
<TP_SIZE> \
<VLLM_GPU_UTIL> \
<RUN_NAME>| Argument | Description |
|---|---|
MODEL_PATH |
HuggingFace model ID or local path (e.g., deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B) |
NUM_NODES |
Total number of nodes (machines) for training |
GPUS_PER_NODE |
Number of GPUs available per node |
TP_SIZE |
Tensor Parallelism size |
VLLM_GPU_UTIL |
VLLM GPU memory utilization ratio (e.g., 0.7) |
RUN_NAME |
Unique identifier for the experiment logging |
To train on a single machine with 8 GPUs:
bash scripts/run_multinodes_repro_grpo.sh \
deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B \
1 \
8 \
1 \
0.7 \
repro-deepscale-r-expFor multi-node setups, you must configure the distributed environment variables (NODE_RANK and MASTER_ADDR) on each node before execution.
Step 1: Export Variables
# On the master node (Rank 0):
export NODE_RANK=0
export MASTER_ADDR=xxx.xxx.xxx.xxx # Replace with actual Master IP
# On worker nodes (Rank 1, 2, ...):
export NODE_RANK=1 # Change based on node index
export MASTER_ADDR=xxx.xxx.xxx.xxx # Same Master IP as aboveStep 2: Launch Script (Run on ALL nodes)
bash scripts/run_multinodes_repro_grpo.sh \
deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B \
2 \
8 \
1 \
0.7 \
repro-multinode-expIf you find this work or code useful in your research, please consider citing:
@article{author2025repro,
title={Rectifying LLM Thought From Lens of Optimization},
author={Author One and Author Two and Author Three},
journal={arXiv preprint arXiv:2507.06920},
year={2025}
}