Benchmark Project Page A Hierarchical Video Counting and Numerical Reasoning Benchmark

Video-based Numerical Reasoning Benchmark

VidNum-1.4K A Comprehensive Benchmark for Video-based Numerical Reasoning

VidNum-1.4K is a strictly human-annotated benchmark with 1,379 video-question pairs across diverse real-world, documentary, and virtual-world videos. It evaluates progressive capability from direct counting to compositional numerical reasoning through Level 1, Level 2, and Level 3 tasks.

Shaoyang Cui* Lingbei Meng*

Tsinghua University | Shenzhen Loop Area Institute

* Both authors contributed equally to this research.

Read Paper (PDF) GitHub Download Benchmark JSONL Use Our Dataset Open Leaderboard

About

About the VidNum1.4K

VidNum-1.4K statistics — **Figure 2.** Distribution of video topics, durations, and question composition in VidNum-1.4K.

Our Dataset

1,379 MCQs

Each question is grounded in an independent video clip (5s to 120s), with strict human annotation and verification.

Three Counting Targets

The benchmark covers Object (647), Action (343), and Event (389) to test broad numerical video reasoning.

Diverse Sources

Videos span real-world, documentary/educational, and virtual-world content with high heterogeneity.

Level 1

Homogeneous Counting

Count a single type of object/action/event under minimal constraints. This tests stable temporal tracking and object permanence.

Level 2

Constrained and Heterogeneous Counting

Count multiple entity types or constrained attributes, often across multi-shot videos requiring robust re-identification.

Level 3

compositional numerical reasoning

Perform temporally grounded arithmetic/comparison operations, moving from perception to multi-step logical deduction.

Demo Questions

Three-Level Benchmark Questions

Level 1 Demo | QID = 1306

How many gray-green doors appeared in total in this video?

Options: (A) 1 (B) 4 (C) 2 (D) 3

Answer: D

Reasoning Sketch

Level 2 Demo | QID = 166

How many different dogs appeared in the video?

Options: (A) 9 (B) 8 (C) 7 (D) 10

Answer: C

Reasoning Sketch

Level 3 Demo | QID = 505

A short-haired man in a red jersey shoots twice near the goal. Were there more players in white jerseys near him during the first shot, or more players in yellow jerseys during the second shot?

Options: (A) The first time. (B) Hard to tell (C) The second time (D) Same.

Answer: A

Reasoning Sketch

Data Construction

Manual Human Annotation Pipeline

Top half of manual annotation pipeline — **Top Half Only.** We show only the construction part of the pipeline here (evaluation protocol omitted on this page).

Source Video Collection

Videos are curated under a two-tier topic hierarchy with five macro-categories: Knowledge, Life Record, Sports Competition, Artistic Performance, and Film & Television.

Creation and Primary Verification

Group A creates timestamp-grounded numerical questions with a strict visual-description-only policy. Group B independently solves and filters ambiguous or prior-based questions.