- [2025.11] π MVT-1.5 (RICE) accepted to ICCV 2025! [Paper] [Model]
- [2024.07] π MVT-1.1 (MLCD) accepted to ECCV 2024! [Code]
- [2023.01] π MVT-1.0 (UNICOM) accepted to ICLR 2023! [Code]
The Margin-based Vision Transformer (MVT) series represents a family of state-of-the-art vision encoders designed for universal visual representation learning. The latest version, RICE (Region-based Cluster Discrimination), advances visual understanding by processing diverse semantic regions within images using a single forward pass.
RICE introduces a novel approach to visual representation learning that jointly captures:
- General visual semantics (objects, scenes)
- OCR semantics (text within images)
- Unified representations seamlessly integrating both modalities
This enables superior performance across multiple vision tasks including image retrieval, visual question answering, and multimodal understanding.
Figure 1: RICE architecture efficiently processes diverse semantic regions within images using region-based cluster discrimination.
RICE demonstrates state-of-the-art performance across multiple vision benchmarks. Using the LLaVA-NeXT framework with a high-resolution tiling strategy (2Γ2+1 grid), RICE achieves superior results compared to existing vision encoders.
Table 1: Comprehensive performance comparison of RICE with state-of-the-art vision encoders. Each input image is divided into a 2Γ2+1 grid of crops matching the pre-training resolution (e.g., 336px, 378px, or 560px).
# Install dependencies
# pip install torch transformers
# git clone https://github.com/deepglint/unicom
# cd unicom/mlcd
from vit_rope2d_hf import MLCDVisionModel
from transformers import CLIPImageProcessor
from PIL import Image
import requests
import torch
# Load model and processor
model = MLCDVisionModel.from_pretrained("DeepGlint-AI/rice-vit-large-patch14-560")
processor = CLIPImageProcessor.from_pretrained("DeepGlint-AI/rice-vit-large-patch14-560")
# Load and process an image
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)
inputs = processor(images=image, return_tensors="pt")
# Extract visual features
with torch.no_grad():
outputs = model(**inputs)
features = outputs.last_hidden_state
print(f"Extracted features shape: {features.shape}")# pip install torch transformers>=4.51.3
from transformers import AutoProcessor, AutoModel
from PIL import Image
import requests
import torch
# Load model and processor
model = AutoModel.from_pretrained("DeepGlint-AI/rice-vit-large-patch14-560")
processor = AutoProcessor.from_pretrained("DeepGlint-AI/rice-vit-large-patch14-560")
# Load and process an image
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)
inputs = processor(images=image, return_tensors="pt")
# Extract visual features
with torch.no_grad():
outputs = model(**inputs)
features = outputs.last_hidden_state[0]
print(f"Extracted features shape: {features.shape}")RICE maintains stable semantic focus across sequential frames, demonstrating robust visual understanding and tracking capabilities.
Figure 2: Semantic feature visualization using 2048-resolution images as input to ViT-B/16. Token features are projected onto RGB channels via PCA. Sequential frames (arranged vertically) show consistent attention on salient objects (ice skaters, deers, motorcyclists, cyclists), with stable color patterns maintained throughout the sequence.
All models are available on Hugging Face for easy integration.
| Model | Resolution | Patch Size | Download |
|---|---|---|---|
| RICE-ViT-L-14 | 560px | 14 | π€ HuggingFace |
| MLCD-ViT-bigG-14 | 448px | 14 | π€ HuggingFace |
| MLCD-ViT-L-14 | 336px | 14 | π€ HuggingFace |
| MLCD-ViT-B-32 | 224px | 32 | π€ HuggingFace |
- MVT-1.1 (MLCD): github.com/deepglint/unicom
- MVT-1.0 (UNICOM): github.com/deepglint/unicom
If you find this work useful, please cite our papers:
@inproceedings{yinxie_2025_rice,
title={Region-based Cluster Discrimination for Visual Representation Learning},
author={Xie, Yin and Yang, Kaicheng and An, Xiang and Wu, Kun and Zhao, Yongle and Deng, Weimo and Ran, Zimin and Wang, Yumeng and Feng, Ziyong And Roy, Miles And Ismail, Elezi And Deng, Jiankang},
booktitle={ICCV},
year={2025}
}@inproceedings{anxiang_2024_mlcd,
title={Multi-label Cluster Discrimination for Visual Representation Learning},
author={An, Xiang and Yang, Kaicheng and Dai, Xiangzi and Feng, Ziyong and Deng, Jiankang},
booktitle={ECCV},
year={2024}
}@inproceedings{anxiang_2023_unicom,
title={Unicom: Universal and Compact Representation Learning for Image Retrieval},
author={An, Xiang and Deng, Jiankang and Yang, Kaicheng and Li, Jiawei and Feng, Ziyong and Guo, Jia and Yang, Jing and Liu, Tongliang},
booktitle={ICLR},
year={2023}
}@inproceedings{anxiang_2022_partialfc,
author={An, Xiang and Deng, Jiankang and Guo, Jia and Feng, Ziyong and Zhu, XuHan and Yang, Jing and Liu, Tongliang},
title={Killing Two Birds With One Stone: Efficient and Robust Training of Face Recognition CNNs by Partial FC},
booktitle={CVPR},
year={2022}
}
@inproceedings{deng_2019_arcface,
title={Arcface: Additive angular margin loss for deep face recognition},
author={Deng, Jiankang and Guo, Jia and Xue, Niannan and Zafeiriou, Stefanos},
booktitle={CVPR},
year={2019}
}

