Abstract
CLIP models are typically constrained by fixed image resolutions and limited context, which can hinder their effectiveness in retrieval tasks that require fine-grained cross-modal understanding. DCLIP addresses these challenges through a meta teacher-student distillation framework, where a cross-modal transformer teacher is fine-tuned to produce enriched embeddings via bidirectional cross-attention between YOLO-extracted image regions and corresponding textual spans. Despite being trained on only ~67,500 samples, DCLIP achieves over 20% Recall@1 gains in text-to-image retrieval while retaining 94% of CLIP's zero-shot accuracy.