Abstract
Fine-grained image-caption alignment is a crucial component of robust visuo-linguistic compositional reasoning, enabling models to perform effectively in socially critical contexts such as visual risk assessment and cultural context reasoning. MiSCHiEF (Minimal-pairs in Safety & Culture for Holistic Evaluation of Fine-grained alignment) consists of two datasets: MiS (Minimal-pairs in Safety) and a culture-focused component. Our benchmark reveals that models generally perform better at confirming correct image-caption pairs than rejecting incorrect ones, and achieve higher accuracy when selecting the correct caption from two highly similar captions for a given image.