Teli Ma1,2 Jia Zheng1,2 Zifan Wang1,2 Chunli Jiang1 Andy Cui1 Junwei Liang2,3,* Shuo Yang1,*
1Mondo Robotics 2HKUST(GZ) 3HKUST *Corresponding author
DiT4DiT is a Vision-Action-Model (VAM) framework that combines video generation transformers with flow-matching-based action prediction for generalizable robotic manipulation. It supports both the tabletop and whole-body control for manipulation tasks. Notably, DiT4DiT stands as the first efficient VAM to achieve real-time whole-body control of humanoid robots.
- [2026-04-15] Initial release of DiT4DiT with training, evaluation, and deployment code.
- [2026-03-11] We release the arXiv paper.
| Stack Cups | Drawer Interaction |
![]() |
![]() |
| Pick and Place | Arrange Flower |
![]() |
![]() |
| Move Spoon | Insert Plate |
![]() |
![]() |
| Box Packing | Twist Cap |
![]() |
![]() |
- Release teleoperation, training and deployment code for Unitree G1 tabletop tasks.
- Release teleoperation, training and deployment code for Unitree G1 whole-body control tasks.
DiT4DiT/
├── DiT4DiT/ # Core package
│ ├── config/ # Configurations
│ │ ├── deepseeds/ # DeepSpeed configs
│ │ ├── robocasa/ # RoboCasa experiment configs
│ │ └── real_robot/ # Real robot configs
│ ├── dataloader/ # Dataset loading (LeRobot)
│ ├── model/ # Model architecture
│ │ ├── framework/ # DiT4DiT framework
│ │ └── modules/ # Backbone & action model
│ └── training/ # Training scripts & utilities
├── deployment/ # WebSocket-based model server
├── docs/ # Documentation
├── examples/
│ ├── Robocasa_tabletop/ # RoboCasa simulation example
│ │ ├── train_files/ # Training scripts
│ │ └── eval_files/ # Evaluation & simulation
│ └── Real_G1/ # Real Unitree G1 example
│ ├── train_files/ # Training scripts
│ └── eval_files/ # Evaluation
└── requirements.txt
- Python >= 3.10
- CUDA 12.4+
- >8x GPUs recommended for training
# Clone the repository
git clone https://github.com/Mondo-Robotics/DiT4DiT.git
cd DiT4DiT
# Create conda environment
conda create -n dit4dit python=3.10 -y
conda activate dit4dit
# Install PyTorch (CUDA 12.8 recommended)
pip install torch==2.7.0 torchvision==0.22.0 torchaudio==2.7.0 --index-url https://download.pytorch.org/whl/cu128
# Install dependencies
pip install -r requirements.txt
# Install the package
pip install -e .Download the Cosmos-Predict2.5-2B model from Hugging Face:
huggingface-cli download nvidia/Cosmos-Predict2.5-2B --revision diffusers/base/post-trained --local-dir /path/to/Cosmos-Predict2.5-2BWe release pretrained checkpoints to facilitate reproduction.
| Model | Description | Dataset | Success Rate | Link |
|---|---|---|---|---|
| DiT4DiT-RoboCasa-GR1 | DiT4DiT for RoboCasa-GR1 tabletop tasks | RoboCasa-GR1 | 56.7 | 🤗 Hugging Face |
Note: More checkpoints will be released soon. Stay tuned!
- RoboCasa-GR1 Tabletop: See the full training and evaluation guide here.
Coming soon.
The following results are obtained using the default training parameters described in Configure Training. We report three independent evaluation runs of the same checkpoint to demonstrate reproducibility. The model consistently achieves an average success rate above 56% across all runs.
| Task | Run 1 | Run 2 | Run 3 |
|---|---|---|---|
| BottleToCabinetClose | 50.0 | 72.0 | 68.0 |
| CanToDrawerClose | 80.0 | 80.0 | 82.0 |
| CupToDrawerClose | 50.0 | 34.0 | 50.0 |
| MilkToMicrowaveClose | 58.0 | 60.0 | 38.0 |
| PotatoToMicrowaveClose | 40.0 | 40.0 | 36.0 |
| WineToCabinetClose | 60.0 | 48.0 | 60.0 |
| FromCuttingboardToBasket | 54.0 | 48.0 | 46.0 |
| FromCuttingboardToCardboardbox | 50.0 | 60.0 | 48.0 |
| FromCuttingboardToPan | 80.0 | 74.0 | 78.0 |
| FromCuttingboardToPot | 52.0 | 46.0 | 66.0 |
| FromCuttingboardToTieredbasket | 44.0 | 54.0 | 50.0 |
| FromPlacematToBasket | 58.0 | 40.0 | 44.0 |
| FromPlacematToBowl | 64.0 | 66.0 | 72.0 |
| FromPlacematToPlate | 66.0 | 62.0 | 64.0 |
| FromPlacematToTieredshelf | 44.0 | 48.0 | 40.0 |
| FromPlateToBowl | 64.0 | 74.0 | 54.0 |
| FromPlateToCardboardbox | 50.0 | 54.0 | 52.0 |
| FromPlateToPan | 58.0 | 68.0 | 70.0 |
| FromPlateToPlate | 62.0 | 64.0 | 72.0 |
| FromTrayToCardboardbox | 52.0 | 50.0 | 60.0 |
| FromTrayToPlate | 64.0 | 64.0 | 58.0 |
| FromTrayToPot | 68.0 | 70.0 | 66.0 |
| FromTrayToTieredbasket | 50.0 | 46.0 | 50.0 |
| FromTrayToTieredshelf | 42.0 | 36.0 | 28.0 |
| Average | 56.7 | 56.6 | 56.3 |
If you find this work useful, please consider citing:
@article{ma2026dit4dit,
title={DiT4DiT: Jointly Modeling Video Dynamics and Actions for Generalizable Robot Control},
author={Ma, Teli and Zheng, Jia and Wang, Zifan and Jiang, Chunli and Cui, Andy and Liang, Junwei and Yang, Shuo},
journal={arXiv preprint arXiv:2603.10448},
year={2026}
}This project is licensed under the MIT License - see the LICENSE file for details.
This project builds upon:
- StarVLA
- Cosmos-Predict2.5 by NVIDIA
- GR00T by NVIDIA
- Robocasa
- LeRobot by Hugging Face








