Skip to content

deepglint/MVT

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

8 Commits
Β 
Β 

Repository files navigation

Margin-based Vision Transformer (MVT)

ICCV ECCV ICLR Hugging Face


πŸ“° News

  • [2025.11] πŸŽ‰ MVT-1.5 (RICE) accepted to ICCV 2025! [Paper] [Model]
  • [2024.07] πŸŽ‰ MVT-1.1 (MLCD) accepted to ECCV 2024! [Code]
  • [2023.01] πŸŽ‰ MVT-1.0 (UNICOM) accepted to ICLR 2023! [Code]

πŸ”¬ Introduction

The Margin-based Vision Transformer (MVT) series represents a family of state-of-the-art vision encoders designed for universal visual representation learning. The latest version, RICE (Region-based Cluster Discrimination), advances visual understanding by processing diverse semantic regions within images using a single forward pass.

MVT-1.5: RICE (ICCV 2025)

RICE introduces a novel approach to visual representation learning that jointly captures:

  • General visual semantics (objects, scenes)
  • OCR semantics (text within images)
  • Unified representations seamlessly integrating both modalities

This enables superior performance across multiple vision tasks including image retrieval, visual question answering, and multimodal understanding.

RICE Highlights

Figure 1: RICE architecture efficiently processes diverse semantic regions within images using region-based cluster discrimination.


πŸ“Š Experiments

RICE demonstrates state-of-the-art performance across multiple vision benchmarks. Using the LLaVA-NeXT framework with a high-resolution tiling strategy (2Γ—2+1 grid), RICE achieves superior results compared to existing vision encoders.

Experimental Results

Table 1: Comprehensive performance comparison of RICE with state-of-the-art vision encoders. Each input image is divided into a 2Γ—2+1 grid of crops matching the pre-training resolution (e.g., 336px, 378px, or 560px).


πŸš€ Usage

Standard Usage

# Install dependencies
# pip install torch transformers
# git clone https://github.com/deepglint/unicom
# cd unicom/mlcd

from vit_rope2d_hf import MLCDVisionModel
from transformers import CLIPImageProcessor
from PIL import Image
import requests
import torch

# Load model and processor
model = MLCDVisionModel.from_pretrained("DeepGlint-AI/rice-vit-large-patch14-560")
processor = CLIPImageProcessor.from_pretrained("DeepGlint-AI/rice-vit-large-patch14-560")

# Load and process an image
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)
inputs = processor(images=image, return_tensors="pt")

# Extract visual features
with torch.no_grad():
    outputs = model(**inputs)
features = outputs.last_hidden_state

print(f"Extracted features shape: {features.shape}")

Using HuggingFace Transformers (β‰₯4.51.3)

# pip install torch transformers>=4.51.3

from transformers import AutoProcessor, AutoModel
from PIL import Image
import requests
import torch

# Load model and processor
model = AutoModel.from_pretrained("DeepGlint-AI/rice-vit-large-patch14-560")
processor = AutoProcessor.from_pretrained("DeepGlint-AI/rice-vit-large-patch14-560")

# Load and process an image
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)
inputs = processor(images=image, return_tensors="pt")

# Extract visual features
with torch.no_grad():
    outputs = model(**inputs)
features = outputs.last_hidden_state[0]

print(f"Extracted features shape: {features.shape}")

🎨 Visualization

RICE maintains stable semantic focus across sequential frames, demonstrating robust visual understanding and tracking capabilities.

Qualitative Results

Figure 2: Semantic feature visualization using 2048-resolution images as input to ViT-B/16. Token features are projected onto RGB channels via PCA. Sequential frames (arranged vertically) show consistent attention on salient objects (ice skaters, deers, motorcyclists, cyclists), with stable color patterns maintained throughout the sequence.


πŸ“¦ Model Zoo

All models are available on Hugging Face for easy integration.

Model Resolution Patch Size Download
RICE-ViT-L-14 560px 14 πŸ€— HuggingFace
MLCD-ViT-bigG-14 448px 14 πŸ€— HuggingFace
MLCD-ViT-L-14 336px 14 πŸ€— HuggingFace
MLCD-ViT-B-32 224px 32 πŸ€— HuggingFace

Related Repositories


πŸ“ Citation

If you find this work useful, please cite our papers:

RICE (ICCV 2025)

@inproceedings{yinxie_2025_rice,
  title={Region-based Cluster Discrimination for Visual Representation Learning},
  author={Xie, Yin and Yang, Kaicheng and An, Xiang and Wu, Kun and Zhao, Yongle and Deng, Weimo and Ran, Zimin and Wang, Yumeng and Feng, Ziyong And Roy, Miles And Ismail, Elezi And Deng, Jiankang},
  booktitle={ICCV},
  year={2025}
}

MLCD (ECCV 2024)

@inproceedings{anxiang_2024_mlcd,
  title={Multi-label Cluster Discrimination for Visual Representation Learning},
  author={An, Xiang and Yang, Kaicheng and Dai, Xiangzi and Feng, Ziyong and Deng, Jiankang},
  booktitle={ECCV},
  year={2024}
}

UNICOM (ICLR 2023)

@inproceedings{anxiang_2023_unicom,
  title={Unicom: Universal and Compact Representation Learning for Image Retrieval},
  author={An, Xiang and Deng, Jiankang and Yang, Kaicheng and Li, Jiawei and Feng, Ziyong and Guo, Jia and Yang, Jing and Liu, Tongliang},
  booktitle={ICLR},
  year={2023}
}

Related Work

@inproceedings{anxiang_2022_partialfc,
  author={An, Xiang and Deng, Jiankang and Guo, Jia and Feng, Ziyong and Zhu, XuHan and Yang, Jing and Liu, Tongliang},
  title={Killing Two Birds With One Stone: Efficient and Robust Training of Face Recognition CNNs by Partial FC},
  booktitle={CVPR},
  year={2022}
}

@inproceedings{deng_2019_arcface,
  title={Arcface: Additive angular margin loss for deep face recognition},
  author={Deng, Jiankang and Guo, Jia and Xue, Niannan and Zafeiriou, Stefanos},
  booktitle={CVPR},
  year={2019}
}

About

Margin-based Vision Transformer

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 7