Update-Free On-Policy Steering Via Verifiers

Diffusion Policy extended with classifier-guided sampling for robomimic. The added pipeline lives in diffusion_policy/uf_ops/ and supports two guidance modes:

BoN (Best-of-N) — sample N action candidates per step and pick the argmax under a learned scoring model.
cg (classifier guidance) — apply a Tweedie-formula gradient from the scoring model to the predicted x0 inside the noise scheduler.

Two scoring models are supported:

ContrastiveClassifier — a binary success classifier (BCE + dissimilar-contrastive loss).
Time2Success — a regression model that fits the discounted return of a trajectory.

Both are defined in diffusion_policy/uf_ops/models.py; the guided noise scheduler is tweedie_guided_ddpm.py; the guided env runners and policy wrappers live in diffusion_policy/env_runner/ and diffusion_policy/policy/ as guided_*.

Setup

Dependencies are managed with uv. One command builds the env from pyproject.toml + uv.lock:

uv sync

This creates .venv/ with Python 3.9, PyTorch 2.7.1 (cu126 wheels), and the rest of the stack pinned by uv.lock. Run commands with uv run python ... or activate the venv (source .venv/bin/activate).

A few non-PyPI bits are wired in transparently:

gym 0.21.0 is pulled from third_party/gym-0.21.0/, which is the upstream sdist with two metadata bugs patched (deprecated tests_require, and an invalid opencv-python>=3. specifier).
robosuite is installed from cheng-chi/robosuite@offline_study, which adds the TwoArmTransport / ToolHang envs needed by these benchmarks.
mujoco-py (a Cython sdist that no longer compiles) is excluded; we use free-mujoco-py instead, which ships a prebuilt binary under the same mujoco_py import name.
pytorch3d is built from source (v0.7.9 tag). First uv sync will take a few minutes for that compile; subsequent syncs reuse the cached wheel.

Robomimic datasets (square/, transport/ in ph/ and mh/ formats) need to be placed at the path each config expects (defaults: /data/bc_uncertainty/robomimic/datasets/...); update original_dataset_path and checkpoint_path in the affected configs if your layout differs.

Pipeline

The full pipeline takes a base diffusion policy checkpoint and produces a guidance model + an evaluation. Five steps, all driven by Hydra configs under diffusion_policy/uf_ops/configs/.

1. Train a base diffusion policy (upstream)

Standard Diffusion Policy training from the root of the repo:

python train.py --config-dir=. --config-name=train_diffusion_unet_image_workspace \
  task=transport_image_abs

Task names: square_lowdim_abs, square_image_abs, transport_lowdim_abs, transport_image_abs. Produces latest.ckpt under the workspace's output dir.

2. Collect a rollout dataset

Roll out the trained policy in the env to gather (obs, action, reward, next_obs, done) transitions and write them to an HDF5 file.

cd diffusion_policy/uf_ops
python update_dataset.py        # lowdim, uses configs/make_new_dataset.yaml
python update_image_dataset.py  # image,  uses configs/make_new_image_dataset.yaml

Override the checkpoint and output paths via Hydra CLI as needed:

python update_dataset.py \
  checkpoint_path=/path/to/latest.ckpt \
  num_trajectories=3000 \
  output_dir=/path/to/output_dir/

3. Compute dataset statistics

Calculates obs_mean, obs_std, action_mean, action_std for normalization at training and inference time.

python calculate_obs_stats.py \
  --config-path=./configs/transport_image --config-name=time2success_BoN

Writes to the dataset_stats_path declared in the chosen config. (Hydra needs --config-path + --config-name separately; the slash form is parsed as a config group.)

4. Train the guidance model

Pick one of the two trainers depending on the scoring model:

# Contrastive binary classifier
python train_contrastive_classifier.py \
  --config-path=./configs/transport_image --config-name=contrastive_classifier_cg

# Time2Success reward regression
python train_time2success.py \
  --config-path=./configs/transport_image --config-name=time2success_BoN

The trained checkpoint is written to classifier_dir/best_model.pth.

5. Evaluate with guidance

# lowdim
python evaluate_classifier.py \
  --config-path=./configs/transport_lowdim --config-name=contrastive_classifier_BoN
# image
python evaluate_image_classifier.py \
  --config-path=./configs/transport_image --config-name=time2success_cg

The config's guidance_type field (BoN or cg) selects the strategy, and is_time2success selects between the two scoring models when both make sense.

Config layout

diffusion_policy/uf_ops/configs/
├── make_new_dataset.yaml              # rollout collection (lowdim)
├── make_new_image_dataset.yaml        # rollout collection (image)
├── square_lowdim/                     # \
├── square_image/                      #  } per-(task, modality) eval configs
├── transport_lowdim/                  #  |
└── transport_image/                   # /

Each {task}_{modality}/ directory contains four canonical configs:

File	scoring model	guidance
`contrastive_classifier_BoN.yaml`	`ContrastiveClassifier`	Best-of-N
`contrastive_classifier_cg.yaml`	`ContrastiveClassifier`	classifier guidance
`time2success_BoN.yaml`	`Time2Success`	Best-of-N
`time2success_cg.yaml`	`Time2Success`	classifier guidance

The lowdim subdirs additionally hold ablation configs (mh_ph_ablation_*, ph_mh_ablation_*) for cross-demo-set generalization runs (train on one demo set, evaluate on the other).

Code map

Path	What's there
`uf_ops/models.py`	`Time2Success`, `ContrastiveClassifier` + image variants
`uf_ops/tweedie_guided_ddpm.py`	`_BaseGuidedDDPMScheduler`, `Time2SuccessGuidedDPMScheduler`, `ClassifierGuidedDDPMScheduler`
`uf_ops/guided_classifier_dataset.py`	`RewardTrajectoryDataset`, `ContrastiveTrajectoryDataset`, `make_dataset`
`uf_ops/eval_policy_utils.py`, `uf_ops/eval_image_policy_utils.py`	`rollout_policy_BoN`, `rollout_policy_cg`, `rollout_policy_collect` (and image counterparts)
`env_runner/guided_robomimic_*_runner_{bon,collect}.py`	BoN and data-collection env runners (the `cg` path reuses the upstream `robomimic_*_runner.py`)
`policy/diffusion_unet_{lowdim,image}_policy.py`	Upstream policies, edited in place to dispatch to the guided schedulers when one is attached

Citation

Please cite us if you liked the work!

@article{attarian2026update,
  title={Update-Free On-Policy Steering via Verifiers},
  author={Attarian, Maria and Vyse, Ian and Voelcker, Claas and Gerigk, Jasper and Opryshko, Evgenii and Almasri, Anas and Singh, Sumeet and Du, Yilun and Gilitschenski, Igor},
  journal={arXiv preprint arXiv:2603.10282},
  year={2026}
}

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
diffusion_policy		diffusion_policy
third_party/gym-0.21.0		third_party/gym-0.21.0
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
eval.py		eval.py
pyproject.toml		pyproject.toml
pyrightconfig.json		pyrightconfig.json
train.py		train.py
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Update-Free On-Policy Steering Via Verifiers

Setup

Pipeline

1. Train a base diffusion policy (upstream)

2. Collect a rollout dataset

3. Compute dataset statistics

4. Train the guidance model

5. Evaluate with guidance

Config layout

Code map

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Update-Free On-Policy Steering Via Verifiers

Setup

Pipeline

1. Train a base diffusion policy (upstream)

2. Collect a rollout dataset

3. Compute dataset statistics

4. Train the guidance model

5. Evaluate with guidance

Config layout

Code map

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages