Diffusion Policy extended with classifier-guided sampling for robomimic. The added pipeline lives in diffusion_policy/uf_ops/ and supports two guidance modes:
- BoN (Best-of-N) — sample N action candidates per step and pick the argmax under a learned scoring model.
- cg (classifier guidance) — apply a Tweedie-formula gradient from the scoring model to the predicted x0 inside the noise scheduler.
Two scoring models are supported:
ContrastiveClassifier— a binary success classifier (BCE + dissimilar-contrastive loss).Time2Success— a regression model that fits the discounted return of a trajectory.
Both are defined in diffusion_policy/uf_ops/models.py; the guided noise scheduler is tweedie_guided_ddpm.py; the guided env runners and policy wrappers live in diffusion_policy/env_runner/ and diffusion_policy/policy/ as guided_*.
Dependencies are managed with uv. One command builds the env from pyproject.toml + uv.lock:
uv syncThis creates .venv/ with Python 3.9, PyTorch 2.7.1 (cu126 wheels), and the rest of the stack pinned by uv.lock. Run commands with uv run python ... or activate the venv (source .venv/bin/activate).
A few non-PyPI bits are wired in transparently:
- gym 0.21.0 is pulled from
third_party/gym-0.21.0/, which is the upstream sdist with two metadata bugs patched (deprecatedtests_require, and an invalidopencv-python>=3.specifier). - robosuite is installed from
cheng-chi/robosuite@offline_study, which adds theTwoArmTransport/ToolHangenvs needed by these benchmarks. - mujoco-py (a Cython sdist that no longer compiles) is excluded; we use
free-mujoco-pyinstead, which ships a prebuilt binary under the samemujoco_pyimport name. - pytorch3d is built from source (
v0.7.9tag). Firstuv syncwill take a few minutes for that compile; subsequent syncs reuse the cached wheel.
Robomimic datasets (square/, transport/ in ph/ and mh/ formats) need to be placed at the path each config expects (defaults: /data/bc_uncertainty/robomimic/datasets/...); update original_dataset_path and checkpoint_path in the affected configs if your layout differs.
The full pipeline takes a base diffusion policy checkpoint and produces a guidance model + an evaluation. Five steps, all driven by Hydra configs under diffusion_policy/uf_ops/configs/.
Standard Diffusion Policy training from the root of the repo:
python train.py --config-dir=. --config-name=train_diffusion_unet_image_workspace \
task=transport_image_absTask names: square_lowdim_abs, square_image_abs, transport_lowdim_abs, transport_image_abs. Produces latest.ckpt under the workspace's output dir.
Roll out the trained policy in the env to gather (obs, action, reward, next_obs, done) transitions and write them to an HDF5 file.
cd diffusion_policy/uf_ops
python update_dataset.py # lowdim, uses configs/make_new_dataset.yaml
python update_image_dataset.py # image, uses configs/make_new_image_dataset.yamlOverride the checkpoint and output paths via Hydra CLI as needed:
python update_dataset.py \
checkpoint_path=/path/to/latest.ckpt \
num_trajectories=3000 \
output_dir=/path/to/output_dir/Calculates obs_mean, obs_std, action_mean, action_std for normalization at training and inference time.
python calculate_obs_stats.py \
--config-path=./configs/transport_image --config-name=time2success_BoNWrites to the dataset_stats_path declared in the chosen config. (Hydra needs --config-path + --config-name separately; the slash form is parsed as a config group.)
Pick one of the two trainers depending on the scoring model:
# Contrastive binary classifier
python train_contrastive_classifier.py \
--config-path=./configs/transport_image --config-name=contrastive_classifier_cg
# Time2Success reward regression
python train_time2success.py \
--config-path=./configs/transport_image --config-name=time2success_BoNThe trained checkpoint is written to classifier_dir/best_model.pth.
# lowdim
python evaluate_classifier.py \
--config-path=./configs/transport_lowdim --config-name=contrastive_classifier_BoN
# image
python evaluate_image_classifier.py \
--config-path=./configs/transport_image --config-name=time2success_cgThe config's guidance_type field (BoN or cg) selects the strategy, and is_time2success selects between the two scoring models when both make sense.
diffusion_policy/uf_ops/configs/
├── make_new_dataset.yaml # rollout collection (lowdim)
├── make_new_image_dataset.yaml # rollout collection (image)
├── square_lowdim/ # \
├── square_image/ # } per-(task, modality) eval configs
├── transport_lowdim/ # |
└── transport_image/ # /
Each {task}_{modality}/ directory contains four canonical configs:
| File | scoring model | guidance |
|---|---|---|
contrastive_classifier_BoN.yaml |
ContrastiveClassifier |
Best-of-N |
contrastive_classifier_cg.yaml |
ContrastiveClassifier |
classifier guidance |
time2success_BoN.yaml |
Time2Success |
Best-of-N |
time2success_cg.yaml |
Time2Success |
classifier guidance |
The lowdim subdirs additionally hold ablation configs (mh_ph_ablation_*, ph_mh_ablation_*) for cross-demo-set generalization runs (train on one demo set, evaluate on the other).
| Path | What's there |
|---|---|
uf_ops/models.py |
Time2Success, ContrastiveClassifier + image variants |
uf_ops/tweedie_guided_ddpm.py |
_BaseGuidedDDPMScheduler, Time2SuccessGuidedDPMScheduler, ClassifierGuidedDDPMScheduler |
uf_ops/guided_classifier_dataset.py |
RewardTrajectoryDataset, ContrastiveTrajectoryDataset, make_dataset |
uf_ops/eval_policy_utils.py, uf_ops/eval_image_policy_utils.py |
rollout_policy_BoN, rollout_policy_cg, rollout_policy_collect (and image counterparts) |
env_runner/guided_robomimic_*_runner_{bon,collect}.py |
BoN and data-collection env runners (the cg path reuses the upstream robomimic_*_runner.py) |
policy/diffusion_unet_{lowdim,image}_policy.py |
Upstream policies, edited in place to dispatch to the guided schedulers when one is attached |
Please cite us if you liked the work!
@article{attarian2026update,
title={Update-Free On-Policy Steering via Verifiers},
author={Attarian, Maria and Vyse, Ian and Voelcker, Claas and Gerigk, Jasper and Opryshko, Evgenii and Almasri, Anas and Singh, Sumeet and Du, Yilun and Gilitschenski, Igor},
journal={arXiv preprint arXiv:2603.10282},
year={2026}
}