Skip to main content

Spring Deadline: Sunday, March 1 @ 11:59pm PT. Click here to apply.

MiSCHiEF: A Benchmark in Minimal-Pairs of Safety and Culture for Holistic Evaluation of Fine-Grained Image-Caption Alignment

MiSCHiEF: A Benchmark in Minimal-Pairs of Safety and Culture for Holistic Evaluation of Fine-Grained Image-Caption Alignment

January 1, 2026

Fine-grained image-caption alignment is a crucial component of robust visuo-linguistic compositional reasoning, enabling models to perform effectively in socially critical contexts such as visual risk...

Accepted to EACL Main Conference 2026

Authors: Sagarika Banerjee, Tangatar Madi, Advait Swaminathan, Nguyen Dao Manh Anh

Fine-grained image-caption alignment is a crucial component of robust visuo-linguistic compositional reasoning, enabling models to perform effectively in socially critical contexts such as visual risk assessment and cultural context reasoning. MiSCHiEF (Minimal-pairs in Safety & Culture for Holistic Evaluation of Fine-grained alignment) consists of two datasets: MiS (Minimal-pairs in Safety) and a culture-focused component. Our benchmark reveals that models generally perform better at confirming correct image-caption pairs than rejecting incorrect ones, and achieve higher accuracy when selecting the correct caption from two highly similar captions for a given image.

Begin Your Journey

The application takes 10 minutes and is reviewed on a rolling basis. We look for strong technical signal—projects, coursework, or competition results—and a genuine curiosity to do real research.

If admitted, you will join a structured pipeline with direct mentorship to take your work from ideation to top conference submission at venues like NeurIPS, ACL, and EMNLP.