GitHub - humanoidintelligence/DiT4DiT: This is the official code repo for DiT4DiT, a Vision-Action-Model (VAM) framework that combines video generation model with flow-matching-based action prediction for generalizable robotic manipulation.

Jointly Modeling Video Dynamics and Actions for Generalizable Robot Control

Teli Ma^1,2 Jia Zheng^1,2 Zifan Wang^1,2 Chunli Jiang¹ Andy Cui¹ Junwei Liang^2,3,* Shuo Yang^1,*

¹Mondo Robotics ²HKUST(GZ) ³HKUST ^*Corresponding author

DiT4DiT is a Vision-Action-Model (VAM) framework that combines video generation transformers with flow-matching-based action prediction for generalizable robotic manipulation. It supports both the tabletop and whole-body control for manipulation tasks. Notably, DiT4DiT stands as the first efficient VAM to achieve real-time whole-body control of humanoid robots.

News

[2026-04-15] Initial release of DiT4DiT with training, evaluation, and deployment code.
[2026-03-11] We release the arXiv paper.

Whole-Body Control (all 1x speed)

Shelf Organization

Tabletop Manipulation (all 1x speed)

Stack Cups	Drawer Interaction

Pick and Place	Arrange Flower

Move Spoon	Insert Plate

Box Packing	Twist Cap

TODOs

Release teleoperation, training and deployment code for Unitree G1 tabletop tasks.
Release teleoperation, training and deployment code for Unitree G1 whole-body control tasks.

Project Structure

DiT4DiT/
├── DiT4DiT/                    # Core package
│   ├── config/                 # Configurations
│   │   ├── deepseeds/          # DeepSpeed configs
│   │   ├── robocasa/           # RoboCasa experiment configs
│   │   └── real_robot/         # Real robot configs
│   ├── dataloader/             # Dataset loading (LeRobot)
│   ├── model/                  # Model architecture
│   │   ├── framework/          # DiT4DiT framework
│   │   └── modules/            # Backbone & action model
│   └── training/               # Training scripts & utilities
├── deployment/                 # WebSocket-based model server
├── docs/                       # Documentation
├── examples/
│   ├── Robocasa_tabletop/      # RoboCasa simulation example
│   │   ├── train_files/        # Training scripts
│   │   └── eval_files/         # Evaluation & simulation
│   └── Real_G1/                # Real Unitree G1 example
│       ├── train_files/        # Training scripts
│       └── eval_files/         # Evaluation
└── requirements.txt

Installation

Prerequisites

Python >= 3.10
CUDA 12.4+
>8x GPUs recommended for training

Setup

# Clone the repository
git clone https://github.com/Mondo-Robotics/DiT4DiT.git
cd DiT4DiT

# Create conda environment
conda create -n dit4dit python=3.10 -y
conda activate dit4dit

# Install PyTorch (CUDA 12.8 recommended)
pip install torch==2.7.0 torchvision==0.22.0 torchaudio==2.7.0 --index-url https://download.pytorch.org/whl/cu128

# Install dependencies
pip install -r requirements.txt

# Install the package
pip install -e .

Download Pretrained Backbone

Download the Cosmos-Predict2.5-2B model from Hugging Face:

huggingface-cli download nvidia/Cosmos-Predict2.5-2B --revision diffusers/base/post-trained --local-dir /path/to/Cosmos-Predict2.5-2B

Model Zoo

We release pretrained checkpoints to facilitate reproduction.

Available Checkpoints

Model	Description	Dataset	Success Rate	Link
DiT4DiT-RoboCasa-GR1	DiT4DiT for RoboCasa-GR1 tabletop tasks	RoboCasa-GR1	56.7	🤗 Hugging Face

Note: More checkpoints will be released soon. Stay tuned!

Quick Start

Simulation

RoboCasa-GR1 Tabletop: See the full training and evaluation guide here.

Real Robot

Coming soon.

Results

Robocasa-GR1 Benchmark

The following results are obtained using the default training parameters described in Configure Training. We report three independent evaluation runs of the same checkpoint to demonstrate reproducibility. The model consistently achieves an average success rate above 56% across all runs.

Task	Run 1	Run 2	Run 3
BottleToCabinetClose	50.0	72.0	68.0
CanToDrawerClose	80.0	80.0	82.0
CupToDrawerClose	50.0	34.0	50.0
MilkToMicrowaveClose	58.0	60.0	38.0
PotatoToMicrowaveClose	40.0	40.0	36.0
WineToCabinetClose	60.0	48.0	60.0
FromCuttingboardToBasket	54.0	48.0	46.0
FromCuttingboardToCardboardbox	50.0	60.0	48.0
FromCuttingboardToPan	80.0	74.0	78.0
FromCuttingboardToPot	52.0	46.0	66.0
FromCuttingboardToTieredbasket	44.0	54.0	50.0
FromPlacematToBasket	58.0	40.0	44.0
FromPlacematToBowl	64.0	66.0	72.0
FromPlacematToPlate	66.0	62.0	64.0
FromPlacematToTieredshelf	44.0	48.0	40.0
FromPlateToBowl	64.0	74.0	54.0
FromPlateToCardboardbox	50.0	54.0	52.0
FromPlateToPan	58.0	68.0	70.0
FromPlateToPlate	62.0	64.0	72.0
FromTrayToCardboardbox	52.0	50.0	60.0
FromTrayToPlate	64.0	64.0	58.0
FromTrayToPot	68.0	70.0	66.0
FromTrayToTieredbasket	50.0	46.0	50.0
FromTrayToTieredshelf	42.0	36.0	28.0
Average	56.7	56.6	56.3

Citation

If you find this work useful, please consider citing:

@article{ma2026dit4dit,
  title={DiT4DiT: Jointly Modeling Video Dynamics and Actions for Generalizable Robot Control},
  author={Ma, Teli and Zheng, Jia and Wang, Zifan and Jiang, Chunli and Cui, Andy and Liang, Junwei and Yang, Shuo},
  journal={arXiv preprint arXiv:2603.10448},
  year={2026}
}

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgements

This project builds upon:

StarVLA
Cosmos-Predict2.5 by NVIDIA
GR00T by NVIDIA
Robocasa
LeRobot by Hugging Face

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
DiT4DiT		DiT4DiT
deployment		deployment
docs		docs
examples		examples
media		media
utils		utils
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Jointly Modeling Video Dynamics and Actions for Generalizable Robot Control

News

Whole-Body Control (all 1x speed)

Tabletop Manipulation (all 1x speed)

Table of Contents

TODOs

Project Structure

Installation

Prerequisites

Setup

Download Pretrained Backbone

Model Zoo

Available Checkpoints

Quick Start

Simulation

Real Robot

Results

Robocasa-GR1 Benchmark

Citation

License

Acknowledgements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Jointly Modeling Video Dynamics and Actions for Generalizable Robot Control

News

Whole-Body Control (all 1x speed)

Tabletop Manipulation (all 1x speed)

Table of Contents

TODOs

Project Structure

Installation

Prerequisites

Setup

Download Pretrained Backbone

Model Zoo

Available Checkpoints

Quick Start

Simulation

Real Robot

Results

Robocasa-GR1 Benchmark

Citation

License

Acknowledgements

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages