Multimodal Vision-Language Transformer for Thermography Breast Cancer Classification

Guillermo Pinto, Julián León, Brayan Quintero, Dana Villamizar and Hoover Rueda-Chacón

Research Group: Hands-on Computer Vision

Paper | Checkpoints

Abstract: We propose a multimodal vision-language transformer designed to classify breast cancer from a single thermal image, irrespective of viewing angle, by integrating thermal features with structured clinical metadata. Our framework generates descriptive text prompts from patient-specific variables, which are embedded and jointly fused with image tokens via crossattention blocks. Extensive experiments on the DMR-IR and Breast Thermography datasets demonstrate that our method achieves 97.5% accuracy on DMR-IR and 83.7% on Breast Thermography, surpassing state-of-the-art multimodal and visual-only baselines in sensitivity. By combining clinical cues, our approach offers a flexible, robust, and interpretable solution for early breast cancer detection with potential for clinical translation in diverse healthcare scenarios.

Dataset

Our experiments utilize two thermal imaging datasets for comprehensive evaluation:

DMR-IR Dataset

The DMR-IR (Database for Research Mastology with Infrared Image) dataset contains thermal images for breast cancer detection. This dataset includes structured clinical metadata (patient age, symptoms, medical history) which is converted into descriptive text prompts for multimodal fusion. Original dataset was obtained from Departamento de Ciência da Computação Universidade Federal Fluminense and our version of that dataset can be found on Hugging Face here (this work uses revision 69ffd6240b4a50bc4a05c59b70773f3a506054f2 of the dataset).

Breast Thermography Dataset

Dataset of thermography images of area of the female torax with information of the results of pathology studies. The pictures were taken in a doctor's office at the Hospital San Juan de Dios - Sede Cali. Original dataset is available here; and our version of that dataset can be found on Hugging Face here (this work uses revision 6a84021f2a5b253d0da72f7948de93613fd9a788 of the dataset).

Architecture

Our paper implements a multimodal vision-language transformer with the following components:

Vision Encoder: ViT implementation with configurable depths and attention heads
Language Model: GatorTron-based text processing for multimodal fusion
Cross-Attention: Facilitates interaction between visual and textual features
Segmentation Module: Optional segmentation branch for spatial analysis
Fusion Layer: Alpha-weighted combination of modalities

Installation

Prerequisites

Python 3.10+
CUDA-compatible GPU (recommended)
conda or pip package manager

Setup

Clone the repository:

git clone https://github.com/semilleroCV/breastcatt.git
cd breastcatt

Create and activate a conda environment:

conda create -n breastcatt python=3.10
conda activate breastcatt

Install dependencies:

pip install -r requirements.txt

Download pre-trained checkpoints:

# MAE pre-trained weights will be downloaded automatically

Usage

Training from Scratch

Train a custom breastcatt model:

python train.py \
    --dataset_name "SemilleroCV/DMR-IR" \
    --vit_version "base" \
    --use_cross_attn True \
    --use_segmentation False \
    --num_train_epochs 10 \
    --learning_rate 5e-5 \
    --per_device_train_batch_size 8

Fine-tuning Pre-trained Models

Fine-tune from HuggingFace Hub models:

python train.py \
    --vit_version "pretrained" \
    --model_name_or_path "SemilleroCV/tfvit-base-text-2" \
    --dataset_name "SemilleroCV/DMR-IR" \
    --num_train_epochs 5 \
    --learning_rate 2e-5

Using Finetune Script

For standard transformer fine-tuning:

python finetune.py \
    --model_name_or_path "google/vit-base-patch16-224-in21k" \
    --dataset_name "SemilleroCV/DMR-IR" \
    --num_train_epochs 10 \
    --learning_rate 3e-5

Model Variants

Choose from different model sizes:

Model	Parameters	Embedding Dim	Attention Heads	Layers
Small	~22M	384	6	12
Base	~86M	768	12	12
Large	~307M	1024	16	24

Configuration Options

Key parameters for customization:

--use_cross_attn: Enable cross-attention between modalities
--use_segmentation: Include segmentation head
--alpha: Fusion weight for multimodal combination
--vit_version: Model size (small/base/large/pretrained)
--checkpointing_steps: Save frequency for model checkpoints

Results

Our experiments demonstrate state-of-the-art performance on thermal breast cancer detection across two datasets:

DMR-IR Dataset Results

Comparison with multimodal fusion methods on DMR-IR dataset:

Method	Accuracy	Precision	Sensitivity	Specificity
Sánchez-Cauce et al.	0.940	1.000	0.670	1.000
Mammoottil et al.	0.938	0.941	0.889	0.967
Tsietso et al.	0.904	0.933	0.933	0.833
Ours	0.975	0.963	0.966	0.979

Breast Thermography Dataset Results

Classification results on Breast Thermography dataset:

Model	Accuracy	Precision	Sensitivity	Specificity
ViT-Base	0.884	0.833	0.556	0.971
MobileNetV2	0.837	0.750	0.333	0.971
Swin-Base	0.907	0.857	0.667	0.971
Ours (from scratch)	0.837	0.000	0.000	1.000
Ours (pretrain & fine-tune)	0.837	0.769	0.714	0.897

Key Findings

Superior Sensitivity: Best sensitivity (96.6%) on DMR-IR among all compared methods
Multimodal Advantage: Integration of clinical metadata with thermal imaging significantly improves performance
Transfer Learning Benefits: Pre-training and fine-tuning approach shows improved sensitivity on Breast Thermography dataset

Checkpoints

Pre-trained models are available on HuggingFace Hub:

breastcatt-base-text

Notebooks

Explore our analysis and experiments:

demo.ipynb: Simple demonstration of our model
segmentation-outputs.ipynb: Segmentation analysis
prompt-linear-probing.ipynb: Prompt linear probing

Citation

If you find our work useful in your research, please cite:

@INPROCEEDINGS{pintobreastcatt,
  author={Pinto, Guillermo and León, Julián and Quintero, Brayan and Villamizar, Dana and Rueda-Chacón, Hoover},
  booktitle={2025 IEEE Colombian Conference on Applications of Computational Intelligence (ColCACI)}, 
  title={Multimodal Vision-Language Transformer for Thermography Breast Cancer Classification}, 
  year={2025},
  volume={},
  number={},
  pages={1-6},
  keywords={Sensitivity;Translation;Mortality;Infrared imaging;Medical services;Metadata;Transformers;Breast cancer;Standards;Periodic structures;Breast cancer;deep learning;thermography;vision-language transformer;clinical metadata;cross-attention},
  doi={10.1109/ColCACI67437.2025.11230909}}

Acknowledgements

We thank the researchers who provided the DMR-IR and Breast Thermography datasets and the HuggingFace team for their transformers library. Special recognition to the MAE authors for pre-trained vision transformer weights.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Contact

For questions and collaborations:

Guillermo Pinto: [email protected]
Research Group: Hands-on Computer Vision

Name		Name	Last commit message	Last commit date
Latest commit History 131 Commits
assets/figures		assets/figures
breastcatt		breastcatt
data/DMR-IR		data/DMR-IR
mae		mae
notebooks		notebooks
tests		tests
transunet		transunet
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
finetune.py		finetune.py
requirements.txt		requirements.txt
sweep_finetune.yaml		sweep_finetune.yaml
sweep_finetune_breast_thermography.yaml		sweep_finetune_breast_thermography.yaml
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Multimodal Vision-Language Transformer for Thermography Breast Cancer Classification

Paper | Checkpoints

Dataset

DMR-IR Dataset

Breast Thermography Dataset

Architecture

Installation

Prerequisites

Setup

Usage

Training from Scratch

Fine-tuning Pre-trained Models

Using Finetune Script

Model Variants

Configuration Options

Results

DMR-IR Dataset Results

Breast Thermography Dataset Results

Key Findings

Checkpoints

Notebooks

Citation

Acknowledgements

License

Contact

About

Uh oh!

Contributors 3

Uh oh!

Languages

License

semilleroCV/breastcatt

Folders and files

Latest commit

History

Repository files navigation

Multimodal Vision-Language Transformer for Thermography Breast Cancer Classification

Paper | Checkpoints

Dataset

DMR-IR Dataset

Breast Thermography Dataset

Architecture

Installation

Prerequisites

Setup

Usage

Training from Scratch

Fine-tuning Pre-trained Models

Using Finetune Script

Model Variants

Configuration Options

Results

DMR-IR Dataset Results

Breast Thermography Dataset Results

Key Findings

Checkpoints

Notebooks

Citation

Acknowledgements

License

Contact

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Contributors 3

Uh oh!

Languages