RePro: Rectifying LLM Thought From Lens of Optimization

📋 Introduction

RePro (Rectifying Process-level Reward) is a novel post-training framework that aligns Chain-of-Thought (CoT) reasoning with gradient descent optimization principles.

While long-CoT prompting facilitates thorough exploration, it frequently results in suboptimal behaviors such as overthinking, hallucination, and inefficient reasoning paths. RePro mitigates these issues by:

📉 Optimization Lens: Framing each reasoning step as a gradient update trajectory toward the optimal solution.
⚖️ Dual Scoring Mechanism: Introducing a surrogate objective function to quantify both the intensity and stability of the reasoning process.
🎯 Process-Level Reward: Integrating these metrics into Reinforcement Learning with Verifiable Rewards (RLVR) pipelines to guide model alignment.

Empirical evaluations across mathematics, science, and coding benchmarks demonstrate that RePro consistently enhances reasoning accuracy while significantly reducing redundancy.

📦 Dependencies

Python: 3.10+
CUDA: 11.8+
Key Libraries: PyTorch, vLLM, VeRL

⚙️ Installation

Create a conda environment:

conda create -n repro python=3.10
conda activate repro

Install the package in editable mode:
```
pip install -e .
```

🚀 Quick Start

🧠 GRPO Training

We provide a demonstrated script for launching both single-node and multi-node training via GRPO.

Usage Syntax

Navigate to the project root and run the training script. Ensure your scripts/run_multinodes_repro_grpo.sh paths are correctly configured.

bash scripts/run_multinodes_repro_grpo.sh \
    <MODEL_PATH> \
    <NUM_NODES> \
    <GPUS_PER_NODE> \
    <TP_SIZE> \
    <VLLM_GPU_UTIL> \
    <RUN_NAME>

Argument	Description
`MODEL_PATH`	HuggingFace model ID or local path (e.g., `deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B`)
`NUM_NODES`	Total number of nodes (machines) for training
`GPUS_PER_NODE`	Number of GPUs available per node
`TP_SIZE`	Tensor Parallelism size
`VLLM_GPU_UTIL`	VLLM GPU memory utilization ratio (e.g., `0.7`)
`RUN_NAME`	Unique identifier for the experiment logging

1. Single-Node Example (DeepSeek-Distill-Qwen-1.5B)

To train on a single machine with 8 GPUs:

bash scripts/run_multinodes_repro_grpo.sh \
    deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B \
    1 \
    8 \
    1 \
    0.7 \
    repro-deepscale-r-exp

2. Multi-Node Training Example

For multi-node setups, you must configure the distributed environment variables (NODE_RANK and MASTER_ADDR) on each node before execution.

Step 1: Export Variables

# On the master node (Rank 0):
export NODE_RANK=0
export MASTER_ADDR=xxx.xxx.xxx.xxx  # Replace with actual Master IP

# On worker nodes (Rank 1, 2, ...):
export NODE_RANK=1  # Change based on node index
export MASTER_ADDR=xxx.xxx.xxx.xxx  # Same Master IP as above

Step 2: Launch Script (Run on ALL nodes)

bash scripts/run_multinodes_repro_grpo.sh \
    deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B \
    2 \
    8 \
    1 \
    0.7 \
    repro-multinode-exp

📖 Citation

If you find this work or code useful in your research, please consider citing:

@article{author2025repro,
  title={Rectifying LLM Thought From Lens of Optimization},
  author={Author One and Author Two and Author Three},
  journal={arXiv preprint arXiv:2507.06920},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
assets		assets
scripts		scripts
verl		verl
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

RePro: Rectifying LLM Thought From Lens of Optimization

📋 Introduction

📦 Dependencies

⚙️ Installation

🚀 Quick Start

🧠 GRPO Training

Usage Syntax

1. Single-Node Example (DeepSeek-Distill-Qwen-1.5B)

2. Multi-Node Training Example

📖 Citation

About

Uh oh!

Releases

Packages

Languages

License

open-compass/RePro

Folders and files

Latest commit

History

Repository files navigation

RePro: Rectifying LLM Thought From Lens of Optimization

📋 Introduction

📦 Dependencies

⚙️ Installation

🚀 Quick Start

🧠 GRPO Training

Usage Syntax

1. Single-Node Example (DeepSeek-Distill-Qwen-1.5B)

2. Multi-Node Training Example

📖 Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages