Skip to main content

Spring Deadline: Sunday, March 1 @ 11:59pm PT. Click here to apply.

COREVQA: A Crowd Observation and Reasoning Entailment Visual Question Answering Benchmark

COREVQA: A Crowd Observation and Reasoning Entailment Visual Question Answering Benchmark

December 1, 2025

We introduce COREVQA, a novel Visual Question Answering (VQA) benchmark designed to rigorously evaluate Vision-Language Models. COREVQA consists of 5608 image and synthetically generated true/false st...

Accepted to DIG-BUG @ ICML 2025

Authors: Naaisha Agarwal, Ishant Yunay Chintapatla, Kazuma Choji, Andrew Lwin, Hannah You

We introduce COREVQA, a novel Visual Question Answering (VQA) benchmark designed to rigorously evaluate Vision-Language Models. COREVQA consists of 5608 image and synthetically generated true/false statement pairs, with images derived from the CrowdHuman dataset, to provoke visual entailment reasoning on challenging crowded images. Unlike existing crowd-based datasets that only focus on detection, recognition, and counting, COREVQA requires models to integrate fine-grain visual analysis with textual logic where visual ambiguity and seemingly trivial, easy-to-miss details are key. Results show that even top-performing VLMs achieve accuracy below 80%, with other models performing substantially worse (39.98%-69.95%).

Begin Your Journey

The application takes 10 minutes and is reviewed on a rolling basis. We look for strong technical signal—projects, coursework, or competition results—and a genuine curiosity to do real research.

If admitted, you will join a structured pipeline with direct mentorship to take your work from ideation to top conference submission at venues like NeurIPS, ACL, and EMNLP.