Accepted to DIG-BUG @ ICML 2025

COREVQA: A Crowd Observation and Reasoning Entailment Visual Question Answering Benchmark

Naaisha Agarwal, Ishant Yunay Chintapatla, Kazuma Choji, Andrew Lwin, Hannah You

Abstract

We introduce COREVQA, a novel Visual Question Answering (VQA) benchmark designed to rigorously evaluate Vision-Language Models. COREVQA consists of 5608 image and synthetically generated true/false statement pairs, with images derived from the CrowdHuman dataset, to provoke visual entailment reasoning on challenging crowded images. Unlike existing crowd-based datasets that only focus on detection, recognition, and counting, COREVQA requires models to integrate fine-grain visual analysis with textual logic where visual ambiguity and seemingly trivial, easy-to-miss details are key. Results show that even top-performing VLMs achieve accuracy below 80%, with other models performing substantially worse (39.98%-69.95%).

Citation

Naaisha Agarwal, Ishant Yunay Chintapatla, Kazuma Choji, Andrew Lwin, Hannah You. "COREVQA: A Crowd Observation and Reasoning Entailment Visual Question Answering Benchmark". Accepted to DIG-BUG @ ICML 2025.

Resources

View on arXiv

Details

Conference: Accepted to DIG-BUG @ ICML 2025
Authors: 5 authors

Related Publications

Explore more research from Algoverse

NeurIPS 2025 (Spotlight)

Publish Your Research

Join Algoverse and work with world-class mentors to publish at top AI conferences.

Start Your Application