Distill CLIP (DCLIP): Enhancing Image-Text Retrieval via Cross-Modal Transformer Distillation

Accepted to CLVision @ ICCV 2025

Authors: Daniel Csizmadia, Andrei Codreanu, Victor Alexander Sim, Vighnesh Prabhu, Kevin Zhu, Sean O'Brien

CLIP models are typically constrained by fixed image resolutions and limited context, which can hinder their effectiveness in retrieval tasks that require fine-grained cross-modal understanding. DCLIP addresses these challenges through a meta teacher-student distillation framework, where a cross-modal transformer teacher is fine-tuned to produce enriched embeddings via bidirectional cross-attention between YOLO-extracted image regions and corresponding textual spans. Despite being trained on only ~67,500 samples, DCLIP achieves over 20% Recall@1 gains in text-to-image retrieval while retaining 94% of CLIP's zero-shot accuracy.

Distill CLIP (DCLIP): Enhancing Image-Text Retrieval via Cross-Modal Transformer Distillation

Begin Your Journey