Skip to content

open-compass/RePro

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

RePro: Rectifying LLM Thought From Lens of Optimization

arXiv License

📋 Introduction

RePro (Rectifying Process-level Reward) is a novel post-training framework that aligns Chain-of-Thought (CoT) reasoning with gradient descent optimization principles.

While long-CoT prompting facilitates thorough exploration, it frequently results in suboptimal behaviors such as overthinking, hallucination, and inefficient reasoning paths. RePro mitigates these issues by:

  • 📉 Optimization Lens: Framing each reasoning step as a gradient update trajectory toward the optimal solution.
  • ⚖️ Dual Scoring Mechanism: Introducing a surrogate objective function to quantify both the intensity and stability of the reasoning process.
  • 🎯 Process-Level Reward: Integrating these metrics into Reinforcement Learning with Verifiable Rewards (RLVR) pipelines to guide model alignment.

Empirical evaluations across mathematics, science, and coding benchmarks demonstrate that RePro consistently enhances reasoning accuracy while significantly reducing redundancy.

📦 Dependencies

  • Python: 3.10+
  • CUDA: 11.8+
  • Key Libraries: PyTorch, vLLM, VeRL

⚙️ Installation

  1. Create a conda environment:

    conda create -n repro python=3.10
    conda activate repro
  2. Install the package in editable mode:

    pip install -e .

🚀 Quick Start

🧠 GRPO Training

We provide a demonstrated script for launching both single-node and multi-node training via GRPO.

Usage Syntax

Navigate to the project root and run the training script. Ensure your scripts/run_multinodes_repro_grpo.sh paths are correctly configured.

bash scripts/run_multinodes_repro_grpo.sh \
    <MODEL_PATH> \
    <NUM_NODES> \
    <GPUS_PER_NODE> \
    <TP_SIZE> \
    <VLLM_GPU_UTIL> \
    <RUN_NAME>
Argument Description
MODEL_PATH HuggingFace model ID or local path (e.g., deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B)
NUM_NODES Total number of nodes (machines) for training
GPUS_PER_NODE Number of GPUs available per node
TP_SIZE Tensor Parallelism size
VLLM_GPU_UTIL VLLM GPU memory utilization ratio (e.g., 0.7)
RUN_NAME Unique identifier for the experiment logging

1. Single-Node Example (DeepSeek-Distill-Qwen-1.5B)

To train on a single machine with 8 GPUs:

bash scripts/run_multinodes_repro_grpo.sh \
    deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B \
    1 \
    8 \
    1 \
    0.7 \
    repro-deepscale-r-exp

2. Multi-Node Training Example

For multi-node setups, you must configure the distributed environment variables (NODE_RANK and MASTER_ADDR) on each node before execution.

Step 1: Export Variables

# On the master node (Rank 0):
export NODE_RANK=0
export MASTER_ADDR=xxx.xxx.xxx.xxx  # Replace with actual Master IP

# On worker nodes (Rank 1, 2, ...):
export NODE_RANK=1  # Change based on node index
export MASTER_ADDR=xxx.xxx.xxx.xxx  # Same Master IP as above

Step 2: Launch Script (Run on ALL nodes)

bash scripts/run_multinodes_repro_grpo.sh \
    deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B \
    2 \
    8 \
    1 \
    0.7 \
    repro-multinode-exp

📖 Citation

If you find this work or code useful in your research, please consider citing:

@article{author2025repro,
  title={Rectifying LLM Thought From Lens of Optimization},
  author={Author One and Author Two and Author Three},
  journal={arXiv preprint arXiv:2507.06920},
  year={2025}
}

About

[Preprint 2025] Rectifying LLM Thought From Lens of Optimization

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published