Accepted to DIG-BUG @ ICML 2025
Authors: Naaisha Agarwal, Ishant Yunay Chintapatla, Kazuma Choji, Andrew Lwin, Hannah You
We introduce COREVQA, a novel Visual Question Answering (VQA) benchmark designed to rigorously evaluate Vision-Language Models. COREVQA consists of 5608 image and synthetically generated true/false statement pairs, with images derived from the CrowdHuman dataset, to provoke visual entailment reasoning on challenging crowded images. Unlike existing crowd-based datasets that only focus on detection, recognition, and counting, COREVQA requires models to integrate fine-grain visual analysis with textual logic where visual ambiguity and seemingly trivial, easy-to-miss details are key. Results show that even top-performing VLMs achieve accuracy below 80%, with other models performing substantially worse (39.98%-69.95%).

