v1.0 · 2026-06 — ColBERT-style late-interaction visual document retriever built on Qwen3-VL-4B-Instruct (Apache-2.0) via
colpali_engine.models.ColQwen3. Evaluated on the full ViDoRe V3 public benchmark: NDCG@10 = 0.5584 (8 subtasks, full corpus, all queries, bootstrap 95% CI).
ColTurk-VDR is a multi-vector late-interaction (ColBERT/MaxSim) visual document retriever: it embeds document page images and text queries into per-token 128-dim vectors and ranks pages by MaxSim — no OCR, layout-aware, multilingual-capable. It is trained with LoRA (r=32) on a single A100 80GB from the raw Qwen/Qwen3-VL-4B-Instruct base using the colpali-engine training stack (transformers v5 native).
The long-term goal is Turkish enterprise documents (e-invoices, KYC, legal, financial); v1.0 is the Stage-1 EN+FR foundation model, evaluated and submitted on ViDoRe V3.
Official: NDCG@10 = 0.5584 · NDCG@5 = 0.5287 · recall@10 = 0.6110 (checkpoint-1000, processor-default visual tokens, seeded bootstrap 95% CI; raw JSONs in eval/results/).
| Subtask | NDCG@10 | 95% CI |
|---|---|---|
| computer_science | 0.7306 | [0.718, 0.743] |
| energy | 0.6238 | [0.608, 0.638] |
| pharmaceuticals | 0.6156 | [0.602, 0.629] |
| finance_en | 0.5851 | [0.571, 0.601] |
| hr | 0.5463 | [0.532, 0.560] |
| industrial | 0.4624 | [0.445, 0.482] |
| physics | 0.4564 | [0.443, 0.471] |
| finance_fr | 0.4467 | [0.430, 0.463] |
Training: 1000 steps (effective batch 32, LR 5e-5 linear, ~0.3 epoch of the 108K-pair manu/colpali EN+FR set, num_negs=2, bf16, gradient checkpointing). The checkpoint curve peaks at step 1000 (500 → 0.5441, 1000 → 0.5584, 1500 → 0.5518 = overfit onset); checkpoint selection is eval-gated, not loss-gated.
Every additional lever was evaluated on the full benchmark with the same harness and dropped on evidence:
| Lever | Predicted | Measured | Verdict |
|---|---|---|---|
num_negs 2→4 (more mined negatives) |
positive | −0.016 (worse on 8/8 subtasks) | K=2 optimal |
| Diverse-run weight averaging (2 runs, different seed) | positive | −0.006 (linear blend, zero synergy) | seed change also changes LoRA init → different basins |
| Train-match visual-token cap (768) at eval | positive | −0.017 | more inference tokens = more detail; uncapped optimal |
Full analysis: STAGE1_VALIDITY_REPORT.md (causal control, leakage tripwires, pHash contamination scan, bootstrap CIs, reproducibility).
import torch
from colpali_engine.models import ColQwen3, ColQwen3Processor
model = ColQwen3.from_pretrained(
"Verm1ion/ColTurk-VDR-Qwen3VL-4B-v1.0", # merged full model — loads directly
torch_dtype=torch.bfloat16,
device_map="cuda:0",
attn_implementation="sdpa",
).eval()
processor = ColQwen3Processor.from_pretrained("Verm1ion/ColTurk-VDR-Qwen3VL-4B-v1.0")
# documents = list of PIL page images; queries = list of strings
doc_batch = processor.process_images(documents).to(model.device)
qry_batch = processor.process_queries(queries).to(model.device)
with torch.no_grad():
doc_emb = model(**doc_batch)
qry_emb = model(**qry_batch)
scores = processor.score_multi_vector(qry_emb, doc_emb) # (n_queries, n_docs) MaxSimRequirements: colpali-engine>=0.3.16, transformers>=5.0, torch>=2.5. The published repo is a merged full model (LoRA baked in) — no PEFT loading step, no adapter key-prefix issues across transformers versions.
python scripts/eval/eval_colturk_checkpoint.py \
--adapter Verm1ion/ColTurk-VDR-Qwen3VL-4B-v1.0 \
--bootstrap 1000 --output eval/results/repro.jsonThe harness downloads the 8 ViDoRe V3 public subtasks (vidore/vidore_v3_*, split test), encodes the full corpus and all queries, scores with MaxSim, and reports NDCG@5/@10 + recall with a seeded bootstrap CI. Environment pins: REPRODUCIBILITY.md (seed 42; transformers 5.9 / peft 0.19 / colpali-engine 0.3.16 / torch 2.11).
configs/qwen3/ training configs (Stage-1 + ablation variants, all eval-gated)
scripts/training/ launcher (resume + HF checkpoint push), weight-averaging, attention experiments
scripts/eval/ ViDoRe V3 eval harness (full-corpus MaxSim + bootstrap CI), MTEB results builder
scripts/data/ corpus tooling, pHash contamination scan, synthetic-data QC
src/inference/ shared encode/MaxSim utilities
src/models/ MTEB integration wrapper
eval/results/ raw result JSONs (official numbers)
- Stage-1 EN+FR foundation (v1.0, this release) + ViDoRe V3 submission
- Stage-2 Turkish continual fine-tune (synthetic TR corpus pipeline is built; KVKK-compliant)
- ViDoRe-TR: public Turkish visual-retrieval split (BEIR format)
- Serving stack (FastAPI + Qdrant multi-vector + Docker Compose)
- Code: MIT — LICENSE
- Model weights: inherit Apache-2.0 from the base model; dataset/license matrix in LICENSE-NOTICE.md
- KVKK: no real PII in any published artifact; Turkish data work uses synthetic or public-domain sources only.
Mert Karatay — AI & Network Security Engineer, İstanbul · HuggingFace: Verm1ion · merttkaratayy@gmail.com
@misc{karatay2026colturkvdr,
author = {Karatay, Mert},
title = {ColTurk-VDR: A Late-Interaction Visual Document Retriever on Qwen3-VL-4B},
year = {2026},
url = {https://github.com/Verm1lion/ColTurk-VDR}
}- ColPali / colpali-engine — late-interaction visual retrieval framework (Faysse et al., ICLR 2025)
- ViDoRe V3 — benchmark and evaluation datasets
- Qwen3-VL — Apache-2.0 base vision-language model
- NVIDIA Nemotron ColEmbed v2 — training-recipe reference (hard negatives, K=2)
