Guillermo Pinto, Julián León, Brayan Quintero, Dana Villamizar and Hoover Rueda-Chacón
Research Group: Hands-on Computer Vision
Abstract: We propose a multimodal vision-language transformer designed to classify breast cancer from a single thermal image, irrespective of viewing angle, by integrating thermal features with structured clinical metadata. Our framework generates descriptive text prompts from patient-specific variables, which are embedded and jointly fused with image tokens via crossattention blocks. Extensive experiments on the DMR-IR and Breast Thermography datasets demonstrate that our method achieves 97.5% accuracy on DMR-IR and 83.7% on Breast Thermography, surpassing state-of-the-art multimodal and visual-only baselines in sensitivity. By combining clinical cues, our approach offers a flexible, robust, and interpretable solution for early breast cancer detection with potential for clinical translation in diverse healthcare scenarios.
Our experiments utilize two thermal imaging datasets for comprehensive evaluation:
The DMR-IR (Database for Research Mastology with Infrared Image) dataset contains thermal images for breast cancer detection. This dataset includes structured clinical metadata (patient age, symptoms, medical history) which is converted into descriptive text prompts for multimodal fusion. Original dataset was obtained from Departamento de Ciência da Computação Universidade Federal Fluminense and our version of that dataset can be found on Hugging Face here (this work uses revision 69ffd6240b4a50bc4a05c59b70773f3a506054f2 of the dataset).
Dataset of thermography images of area of the female torax with information of the results of pathology studies. The pictures were taken in a doctor's office at the Hospital San Juan de Dios - Sede Cali. Original dataset is available here; and our version of that dataset can be found on Hugging Face here (this work uses revision 6a84021f2a5b253d0da72f7948de93613fd9a788 of the dataset).
Our paper implements a multimodal vision-language transformer with the following components:
- Vision Encoder: ViT implementation with configurable depths and attention heads
- Language Model: GatorTron-based text processing for multimodal fusion
- Cross-Attention: Facilitates interaction between visual and textual features
- Segmentation Module: Optional segmentation branch for spatial analysis
- Fusion Layer: Alpha-weighted combination of modalities
- Python 3.10+
- CUDA-compatible GPU (recommended)
- conda or pip package manager
- Clone the repository:
git clone https://github.com/semilleroCV/breastcatt.git
cd breastcatt- Create and activate a conda environment:
conda create -n breastcatt python=3.10
conda activate breastcatt- Install dependencies:
pip install -r requirements.txt- Download pre-trained checkpoints:
# MAE pre-trained weights will be downloaded automaticallyTrain a custom breastcatt model:
python train.py \
--dataset_name "SemilleroCV/DMR-IR" \
--vit_version "base" \
--use_cross_attn True \
--use_segmentation False \
--num_train_epochs 10 \
--learning_rate 5e-5 \
--per_device_train_batch_size 8Fine-tune from HuggingFace Hub models:
python train.py \
--vit_version "pretrained" \
--model_name_or_path "SemilleroCV/tfvit-base-text-2" \
--dataset_name "SemilleroCV/DMR-IR" \
--num_train_epochs 5 \
--learning_rate 2e-5For standard transformer fine-tuning:
python finetune.py \
--model_name_or_path "google/vit-base-patch16-224-in21k" \
--dataset_name "SemilleroCV/DMR-IR" \
--num_train_epochs 10 \
--learning_rate 3e-5Choose from different model sizes:
| Model | Parameters | Embedding Dim | Attention Heads | Layers |
|---|---|---|---|---|
| Small | ~22M | 384 | 6 | 12 |
| Base | ~86M | 768 | 12 | 12 |
| Large | ~307M | 1024 | 16 | 24 |
Key parameters for customization:
--use_cross_attn: Enable cross-attention between modalities--use_segmentation: Include segmentation head--alpha: Fusion weight for multimodal combination--vit_version: Model size (small/base/large/pretrained)--checkpointing_steps: Save frequency for model checkpoints
Our experiments demonstrate state-of-the-art performance on thermal breast cancer detection across two datasets:
Comparison with multimodal fusion methods on DMR-IR dataset:
| Method | Accuracy | Precision | Sensitivity | Specificity |
|---|---|---|---|---|
| Sánchez-Cauce et al. | 0.940 | 1.000 | 0.670 | 1.000 |
| Mammoottil et al. | 0.938 | 0.941 | 0.889 | 0.967 |
| Tsietso et al. | 0.904 | 0.933 | 0.933 | 0.833 |
| Ours | 0.975 | 0.963 | 0.966 | 0.979 |
Classification results on Breast Thermography dataset:
| Model | Accuracy | Precision | Sensitivity | Specificity |
|---|---|---|---|---|
| ViT-Base | 0.884 | 0.833 | 0.556 | 0.971 |
| MobileNetV2 | 0.837 | 0.750 | 0.333 | 0.971 |
| Swin-Base | 0.907 | 0.857 | 0.667 | 0.971 |
| Ours (from scratch) | 0.837 | 0.000 | 0.000 | 1.000 |
| Ours (pretrain & fine-tune) | 0.837 | 0.769 | 0.714 | 0.897 |
- Superior Sensitivity: Best sensitivity (96.6%) on DMR-IR among all compared methods
- Multimodal Advantage: Integration of clinical metadata with thermal imaging significantly improves performance
- Transfer Learning Benefits: Pre-training and fine-tuning approach shows improved sensitivity on Breast Thermography dataset
Pre-trained models are available on HuggingFace Hub:
Explore our analysis and experiments:
demo.ipynb: Simple demonstration of our modelsegmentation-outputs.ipynb: Segmentation analysisprompt-linear-probing.ipynb: Prompt linear probing
If you find our work useful in your research, please cite:
@INPROCEEDINGS{pintobreastcatt,
author={Pinto, Guillermo and León, Julián and Quintero, Brayan and Villamizar, Dana and Rueda-Chacón, Hoover},
booktitle={2025 IEEE Colombian Conference on Applications of Computational Intelligence (ColCACI)},
title={Multimodal Vision-Language Transformer for Thermography Breast Cancer Classification},
year={2025},
volume={},
number={},
pages={1-6},
keywords={Sensitivity;Translation;Mortality;Infrared imaging;Medical services;Metadata;Transformers;Breast cancer;Standards;Periodic structures;Breast cancer;deep learning;thermography;vision-language transformer;clinical metadata;cross-attention},
doi={10.1109/ColCACI67437.2025.11230909}}We thank the researchers who provided the DMR-IR and Breast Thermography datasets and the HuggingFace team for their transformers library. Special recognition to the MAE authors for pre-trained vision transformer weights.
This project is licensed under the MIT License - see the LICENSE file for details.
For questions and collaborations:
- Guillermo Pinto: [email protected]
- Research Group: Hands-on Computer Vision

