Accepted to CLVision @ ICCV 2025

Distill CLIP (DCLIP): Enhancing Image-Text Retrieval via Cross-Modal Transformer Distillation

Daniel Csizmadia, Andrei Codreanu, Victor Alexander Sim, Vighnesh Prabhu, Kevin Zhu, Sean O'Brien

Abstract

CLIP models are typically constrained by fixed image resolutions and limited context, which can hinder their effectiveness in retrieval tasks that require fine-grained cross-modal understanding. DCLIP addresses these challenges through a meta teacher-student distillation framework, where a cross-modal transformer teacher is fine-tuned to produce enriched embeddings via bidirectional cross-attention between YOLO-extracted image regions and corresponding textual spans. Despite being trained on only ~67,500 samples, DCLIP achieves over 20% Recall@1 gains in text-to-image retrieval while retaining 94% of CLIP's zero-shot accuracy.

Citation

Daniel Csizmadia, Andrei Codreanu, Victor Alexander Sim, Vighnesh Prabhu, Kevin Zhu, Sean O'Brien. "Distill CLIP (DCLIP): Enhancing Image-Text Retrieval via Cross-Modal Transformer Distillation". Accepted to CLVision @ ICCV 2025.

Resources

View on arXiv

Details

Conference: Accepted to CLVision @ ICCV 2025
Authors: 6 authors

Related Publications

Explore more research from Algoverse

NeurIPS 2025 (Spotlight)

Publish Your Research

Join Algoverse and work with world-class mentors to publish at top AI conferences.

Start Your Application