MiSCHiEF: A Benchmark in Minimal-Pairs of Safety and Culture for Holistic Evaluation of Fine-Grained Image-Caption Alignment

Accepted to EACL Main Conference 2026

Authors: Sagarika Banerjee, Tangatar Madi, Advait Swaminathan, Nguyen Dao Manh Anh

Fine-grained image-caption alignment is a crucial component of robust visuo-linguistic compositional reasoning, enabling models to perform effectively in socially critical contexts such as visual risk assessment and cultural context reasoning. MiSCHiEF (Minimal-pairs in Safety & Culture for Holistic Evaluation of Fine-grained alignment) consists of two datasets: MiS (Minimal-pairs in Safety) and a culture-focused component. Our benchmark reveals that models generally perform better at confirming correct image-caption pairs than rejecting incorrect ones, and achieve higher accuracy when selecting the correct caption from two highly similar captions for a given image.

MiSCHiEF: A Benchmark in Minimal-Pairs of Safety and Culture for Holistic Evaluation of Fine-Grained Image-Caption Alignment

Begin Your Journey