Benchmark Project Page A Hierarchical Video Counting and Numerical Reasoning Benchmark

Video-based Numerical Reasoning Benchmark

VidNum-1.4K

A Comprehensive Benchmark for Video-based Numerical Reasoning

VidNum-1.4K is a strictly human-annotated benchmark with 1,379 video-question pairs across diverse real-world, documentary, and virtual-world videos.

Shaoyang Cui* Lingbei Meng*

Tsinghua University | Shenzhen Loop Area Institute

Click Explore to enter the full project page

About the VidNum1.4K

VidNum-1.4K statistics
Figure 2. Distribution of video topics, durations, and question composition in VidNum-1.4K.

Our Dataset

01

1,379 MCQs

Each question is grounded in an independent video clip (5s to 120s), with strict human annotation and verification.

02

Three Counting Targets

The benchmark covers Object (647), Action (343), and Event (389) to test broad numerical video reasoning.

03

Diverse Sources

Videos span real-world, documentary/educational, and virtual-world content with high heterogeneity.

Level 1

Homogeneous Counting

Count a single type of object/action/event under minimal constraints. This tests stable temporal tracking and object permanence.

Level 2

Constrained and Heterogeneous Counting

Count multiple entity types or constrained attributes, often across multi-shot videos requiring robust re-identification.

Level 3

compositional numerical reasoning

Perform temporally grounded arithmetic/comparison operations, moving from perception to multi-step logical deduction.

Three-Level Benchmark Questions

Level 1 Demo | QID = 1306

How many gray-green doors appeared in total in this video?

Options: (A) 1   (B) 4   (C) 2   (D) 3

Answer: D

Reasoning Sketch

QID 1306 reasoning image

Level 2 Demo | QID = 166

How many different dogs appeared in the video?

Options: (A) 9   (B) 8   (C) 7   (D) 10

Answer: C

Reasoning Sketch

QID 166 reasoning image

Level 3 Demo | QID = 505

A short-haired man in a red jersey shoots twice near the goal. Were there more players in white jerseys near him during the first shot, or more players in yellow jerseys during the second shot?

Options: (A) The first time.   (B) Hard to tell   (C) The second time   (D) Same.

Answer: A

Reasoning Sketch

QID 505 reasoning image

Manual Human Annotation Pipeline

Top half of manual annotation pipeline
Top Half Only. We show only the construction part of the pipeline here (evaluation protocol omitted on this page).

Source Video Collection

Videos are curated under a two-tier topic hierarchy with five macro-categories: Knowledge, Life Record, Sports Competition, Artistic Performance, and Film & Television.

Creation and Primary Verification

Group A creates timestamp-grounded numerical questions with a strict visual-description-only policy. Group B independently solves and filters ambiguous or prior-based questions.

Quality Review and Final Audit

Group C refines timestamps and validates reasoning relevance; Group D performs independent final audit. This process finalizes 1,379 verified pairs.

Download Benchmark JSONL

Use the official benchmark JSONL file for reproducible VidNum-1.4K evaluation. Code is available on GitHub.