Skip to content

uoft-isl/uf-ops

Repository files navigation

Update-Free On-Policy Steering Via Verifiers

Diffusion Policy extended with classifier-guided sampling for robomimic. The added pipeline lives in diffusion_policy/uf_ops/ and supports two guidance modes:

  • BoN (Best-of-N) — sample N action candidates per step and pick the argmax under a learned scoring model.
  • cg (classifier guidance) — apply a Tweedie-formula gradient from the scoring model to the predicted x0 inside the noise scheduler.

Two scoring models are supported:

  • ContrastiveClassifier — a binary success classifier (BCE + dissimilar-contrastive loss).
  • Time2Success — a regression model that fits the discounted return of a trajectory.

Both are defined in diffusion_policy/uf_ops/models.py; the guided noise scheduler is tweedie_guided_ddpm.py; the guided env runners and policy wrappers live in diffusion_policy/env_runner/ and diffusion_policy/policy/ as guided_*.

Setup

Dependencies are managed with uv. One command builds the env from pyproject.toml + uv.lock:

uv sync

This creates .venv/ with Python 3.9, PyTorch 2.7.1 (cu126 wheels), and the rest of the stack pinned by uv.lock. Run commands with uv run python ... or activate the venv (source .venv/bin/activate).

A few non-PyPI bits are wired in transparently:

  • gym 0.21.0 is pulled from third_party/gym-0.21.0/, which is the upstream sdist with two metadata bugs patched (deprecated tests_require, and an invalid opencv-python>=3. specifier).
  • robosuite is installed from cheng-chi/robosuite@offline_study, which adds the TwoArmTransport / ToolHang envs needed by these benchmarks.
  • mujoco-py (a Cython sdist that no longer compiles) is excluded; we use free-mujoco-py instead, which ships a prebuilt binary under the same mujoco_py import name.
  • pytorch3d is built from source (v0.7.9 tag). First uv sync will take a few minutes for that compile; subsequent syncs reuse the cached wheel.

Robomimic datasets (square/, transport/ in ph/ and mh/ formats) need to be placed at the path each config expects (defaults: /data/bc_uncertainty/robomimic/datasets/...); update original_dataset_path and checkpoint_path in the affected configs if your layout differs.

Pipeline

The full pipeline takes a base diffusion policy checkpoint and produces a guidance model + an evaluation. Five steps, all driven by Hydra configs under diffusion_policy/uf_ops/configs/.

1. Train a base diffusion policy (upstream)

Standard Diffusion Policy training from the root of the repo:

python train.py --config-dir=. --config-name=train_diffusion_unet_image_workspace \
  task=transport_image_abs

Task names: square_lowdim_abs, square_image_abs, transport_lowdim_abs, transport_image_abs. Produces latest.ckpt under the workspace's output dir.

2. Collect a rollout dataset

Roll out the trained policy in the env to gather (obs, action, reward, next_obs, done) transitions and write them to an HDF5 file.

cd diffusion_policy/uf_ops
python update_dataset.py        # lowdim, uses configs/make_new_dataset.yaml
python update_image_dataset.py  # image,  uses configs/make_new_image_dataset.yaml

Override the checkpoint and output paths via Hydra CLI as needed:

python update_dataset.py \
  checkpoint_path=/path/to/latest.ckpt \
  num_trajectories=3000 \
  output_dir=/path/to/output_dir/

3. Compute dataset statistics

Calculates obs_mean, obs_std, action_mean, action_std for normalization at training and inference time.

python calculate_obs_stats.py \
  --config-path=./configs/transport_image --config-name=time2success_BoN

Writes to the dataset_stats_path declared in the chosen config. (Hydra needs --config-path + --config-name separately; the slash form is parsed as a config group.)

4. Train the guidance model

Pick one of the two trainers depending on the scoring model:

# Contrastive binary classifier
python train_contrastive_classifier.py \
  --config-path=./configs/transport_image --config-name=contrastive_classifier_cg

# Time2Success reward regression
python train_time2success.py \
  --config-path=./configs/transport_image --config-name=time2success_BoN

The trained checkpoint is written to classifier_dir/best_model.pth.

5. Evaluate with guidance

# lowdim
python evaluate_classifier.py \
  --config-path=./configs/transport_lowdim --config-name=contrastive_classifier_BoN
# image
python evaluate_image_classifier.py \
  --config-path=./configs/transport_image --config-name=time2success_cg

The config's guidance_type field (BoN or cg) selects the strategy, and is_time2success selects between the two scoring models when both make sense.

Config layout

diffusion_policy/uf_ops/configs/
├── make_new_dataset.yaml              # rollout collection (lowdim)
├── make_new_image_dataset.yaml        # rollout collection (image)
├── square_lowdim/                     # \
├── square_image/                      #  } per-(task, modality) eval configs
├── transport_lowdim/                  #  |
└── transport_image/                   # /

Each {task}_{modality}/ directory contains four canonical configs:

File scoring model guidance
contrastive_classifier_BoN.yaml ContrastiveClassifier Best-of-N
contrastive_classifier_cg.yaml ContrastiveClassifier classifier guidance
time2success_BoN.yaml Time2Success Best-of-N
time2success_cg.yaml Time2Success classifier guidance

The lowdim subdirs additionally hold ablation configs (mh_ph_ablation_*, ph_mh_ablation_*) for cross-demo-set generalization runs (train on one demo set, evaluate on the other).

Code map

Path What's there
uf_ops/models.py Time2Success, ContrastiveClassifier + image variants
uf_ops/tweedie_guided_ddpm.py _BaseGuidedDDPMScheduler, Time2SuccessGuidedDPMScheduler, ClassifierGuidedDDPMScheduler
uf_ops/guided_classifier_dataset.py RewardTrajectoryDataset, ContrastiveTrajectoryDataset, make_dataset
uf_ops/eval_policy_utils.py, uf_ops/eval_image_policy_utils.py rollout_policy_BoN, rollout_policy_cg, rollout_policy_collect (and image counterparts)
env_runner/guided_robomimic_*_runner_{bon,collect}.py BoN and data-collection env runners (the cg path reuses the upstream robomimic_*_runner.py)
policy/diffusion_unet_{lowdim,image}_policy.py Upstream policies, edited in place to dispatch to the guided schedulers when one is attached

Citation

Please cite us if you liked the work!

@article{attarian2026update,
  title={Update-Free On-Policy Steering via Verifiers},
  author={Attarian, Maria and Vyse, Ian and Voelcker, Claas and Gerigk, Jasper and Opryshko, Evgenii and Almasri, Anas and Singh, Sumeet and Du, Yilun and Gilitschenski, Igor},
  journal={arXiv preprint arXiv:2603.10282},
  year={2026}
}

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors