Skip to content

humanoidintelligence/DiT4DiT

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DiT4DiT

Jointly Modeling Video Dynamics and Actions for Generalizable Robot Control

arXiv Project Page License

Teli Ma1,2    Jia Zheng1,2    Zifan Wang1,2    Chunli Jiang1    Andy Cui1    Junwei Liang2,3,*    Shuo Yang1,*

1Mondo Robotics    2HKUST(GZ)    3HKUST    *Corresponding author


DiT4DiT is a Vision-Action-Model (VAM) framework that combines video generation transformers with flow-matching-based action prediction for generalizable robotic manipulation. It supports both the tabletop and whole-body control for manipulation tasks. Notably, DiT4DiT stands as the first efficient VAM to achieve real-time whole-body control of humanoid robots.

News

  • [2026-04-15] Initial release of DiT4DiT with training, evaluation, and deployment code.
  • [2026-03-11] We release the arXiv paper.

Whole-Body Control (all 1x speed)

Shelf Organization

Tabletop Manipulation (all 1x speed)

Stack Cups Drawer Interaction
Pick and Place Arrange Flower
Move Spoon Insert Plate
Box Packing Twist Cap

Table of Contents

TODOs

  • Release teleoperation, training and deployment code for Unitree G1 tabletop tasks.
  • Release teleoperation, training and deployment code for Unitree G1 whole-body control tasks.

Project Structure

DiT4DiT/
├── DiT4DiT/                    # Core package
│   ├── config/                 # Configurations
│   │   ├── deepseeds/          # DeepSpeed configs
│   │   ├── robocasa/           # RoboCasa experiment configs
│   │   └── real_robot/         # Real robot configs
│   ├── dataloader/             # Dataset loading (LeRobot)
│   ├── model/                  # Model architecture
│   │   ├── framework/          # DiT4DiT framework
│   │   └── modules/            # Backbone & action model
│   └── training/               # Training scripts & utilities
├── deployment/                 # WebSocket-based model server
├── docs/                       # Documentation
├── examples/
│   ├── Robocasa_tabletop/      # RoboCasa simulation example
│   │   ├── train_files/        # Training scripts
│   │   └── eval_files/         # Evaluation & simulation
│   └── Real_G1/                # Real Unitree G1 example
│       ├── train_files/        # Training scripts
│       └── eval_files/         # Evaluation
└── requirements.txt

Installation

Prerequisites

  • Python >= 3.10
  • CUDA 12.4+
  • >8x GPUs recommended for training

Setup

# Clone the repository
git clone https://github.com/Mondo-Robotics/DiT4DiT.git
cd DiT4DiT

# Create conda environment
conda create -n dit4dit python=3.10 -y
conda activate dit4dit

# Install PyTorch (CUDA 12.8 recommended)
pip install torch==2.7.0 torchvision==0.22.0 torchaudio==2.7.0 --index-url https://download.pytorch.org/whl/cu128

# Install dependencies
pip install -r requirements.txt

# Install the package
pip install -e .

Download Pretrained Backbone

Download the Cosmos-Predict2.5-2B model from Hugging Face:

huggingface-cli download nvidia/Cosmos-Predict2.5-2B --revision diffusers/base/post-trained --local-dir /path/to/Cosmos-Predict2.5-2B

Model Zoo

We release pretrained checkpoints to facilitate reproduction.

Available Checkpoints

Model Description Dataset Success Rate Link
DiT4DiT-RoboCasa-GR1 DiT4DiT for RoboCasa-GR1 tabletop tasks RoboCasa-GR1 56.7 🤗 Hugging Face

Note: More checkpoints will be released soon. Stay tuned!

Quick Start

Simulation

  • RoboCasa-GR1 Tabletop: See the full training and evaluation guide here.

Real Robot

Coming soon.

Results

Robocasa-GR1 Benchmark

The following results are obtained using the default training parameters described in Configure Training. We report three independent evaluation runs of the same checkpoint to demonstrate reproducibility. The model consistently achieves an average success rate above 56% across all runs.

Task Run 1 Run 2 Run 3
BottleToCabinetClose 50.0 72.0 68.0
CanToDrawerClose 80.0 80.0 82.0
CupToDrawerClose 50.0 34.0 50.0
MilkToMicrowaveClose 58.0 60.0 38.0
PotatoToMicrowaveClose 40.0 40.0 36.0
WineToCabinetClose 60.0 48.0 60.0
FromCuttingboardToBasket 54.0 48.0 46.0
FromCuttingboardToCardboardbox 50.0 60.0 48.0
FromCuttingboardToPan 80.0 74.0 78.0
FromCuttingboardToPot 52.0 46.0 66.0
FromCuttingboardToTieredbasket 44.0 54.0 50.0
FromPlacematToBasket 58.0 40.0 44.0
FromPlacematToBowl 64.0 66.0 72.0
FromPlacematToPlate 66.0 62.0 64.0
FromPlacematToTieredshelf 44.0 48.0 40.0
FromPlateToBowl 64.0 74.0 54.0
FromPlateToCardboardbox 50.0 54.0 52.0
FromPlateToPan 58.0 68.0 70.0
FromPlateToPlate 62.0 64.0 72.0
FromTrayToCardboardbox 52.0 50.0 60.0
FromTrayToPlate 64.0 64.0 58.0
FromTrayToPot 68.0 70.0 66.0
FromTrayToTieredbasket 50.0 46.0 50.0
FromTrayToTieredshelf 42.0 36.0 28.0
Average 56.7 56.6 56.3

Citation

If you find this work useful, please consider citing:

@article{ma2026dit4dit,
  title={DiT4DiT: Jointly Modeling Video Dynamics and Actions for Generalizable Robot Control},
  author={Ma, Teli and Zheng, Jia and Wang, Zifan and Jiang, Chunli and Cui, Andy and Liang, Junwei and Yang, Shuo},
  journal={arXiv preprint arXiv:2603.10448},
  year={2026}
}

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgements

This project builds upon:

About

This is the official code repo for DiT4DiT, a Vision-Action-Model (VAM) framework that combines video generation model with flow-matching-based action prediction for generalizable robotic manipulation.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Python 100.0%