COREVQA

Donated on 8/6/2025

Recently, many benchmarks and datasets have been developed to evaluate Vision-Language Models (VLMs) using visual question answering (VQA) pairs, and models have shown significant accuracy improvements. However, these benchmarks rarely test the model's ability to accurately complete visual entailment, for instance, accepting or refuting a hypothesis based on the image. To address this, we propose COREVQA (Crowd Observations and Reasoning Entailment), a benchmark of 5608 image and synthetically generated true/false statement pairs, with images derived from the CrowdHuman dataset, to provoke visual entailment reasoning on challenging crowded images. Our results show that even the top-performing VLMs achieve accuracy below 80%, with other models performing substantially worse (39.98%-69.95%). This significant performance gap reveals key limitations in VLMs’ ability to reason over certain types of image–question pairs in crowded scenes.

Dataset Characteristics

Tabular, Text, Image

Subject Area

Computer Science

Associated Tasks

Other

Feature Type

Real, Categorical

# Instances

5608

# Features

-

Dataset Information

Has Missing Values?

No

Introductory Paper

COREVQA: A Crowd Observation and Reasoning Entailment Visual Question Answering Benchmark

By Ishant Chintapatla, Kazuma Choji, Naaisha Agarwal, Andrew Lin, Hannah You, Charles Duong, Kevin Zhu, Sean O'Brien, Vasu Sharma. 2025

Published in ICML

Variable Information

Upload images from Hugging Faces repo: https://huggingface.co/datasets/COREVQA2025/COREVQA

Dataset Files

FileSize
COREVQA_data.csv1.1 MB

Reviews

There are no reviews for this dataset yet.

Login to Write a Review
Download (312.4 KB)
1 citations
22 views

Creators

Ishant Chintapatla

ishantyunay@gmail.com

Westmont High School

License

By using the UCI Machine Learning Repository, you acknowledge and accept the cookies and privacy practices used by the UCI Machine Learning Repository.

Read Policy