Accepted to CLVision @ ICCV 2025
Authors: Daniel Csizmadia, Andrei Codreanu, Victor Alexander Sim, Vighnesh Prabhu, Kevin Zhu, Sean O'Brien
CLIP models are typically constrained by fixed image resolutions and limited context, which can hinder their effectiveness in retrieval tasks that require fine-grained cross-modal understanding. DCLIP addresses these challenges through a meta teacher-student distillation framework, where a cross-modal transformer teacher is fine-tuned to produce enriched embeddings via bidirectional cross-attention between YOLO-extracted image regions and corresponding textual spans. Despite being trained on only ~67,500 samples, DCLIP achieves over 20% Recall@1 gains in text-to-image retrieval while retaining 94% of CLIP's zero-shot accuracy.

