V-ReasonBench: Toward Unified Reasoning Benchmark Suite for Video Generation Models

Yang Luo^1*, Xuanlei Zhao^1*, Baijiong Lin^2*, Lingting Zhu³, Liyao Tang⁴, Yuqi Liu⁵, Ying-Cong Chen², Shengju Qian^6†, Xin Wang⁶, Yang You¹

¹National University of Singapore ²The Hong Kong University of Science and Technology (Guangzhou) ³The University of Hong Kong ⁴The University of Sydney ⁵The Chinese University of Hong Kong ⁶LIGHTSPEED

(* indicates equal contribution, † indicates corresponding author)

Code Paper BibTex

Abstract

We introduce V-ReasonBench, a benchmark designed to assess video reasoning across four key dimensions: structured problem-solving, spatial cognition, pattern-based inference, and physical dynamics. The benchmark is built from both synthetic and real-world image sequences and provides a diverse set of answer-verifiable tasks that are reproducible, scalable, and unambiguous. Evaluations of six state-of-the-art video models reveal clear dimension-wise differences, with strong variation in structured, spatial, pattern-based, and physical reasoning. We further compare video models with strong image models, analyze common hallucination behaviors, and study how video duration affects Chain-of-Frames reasoning. Overall, V-ReasonBench offers a unified and reproducible framework for measuring video reasoning and aims to support the development of models with more reliable, human-aligned reasoning skills.

Results of Recent Video Generation Models

Model-level overall and per-dimension performance on V-ReasonBench. A pass@5 score for each model is calculated within each dimension and presented accordingly (left). Summary of V-ReasonBench performance across video models (right). The figure illustrates how six video generation models perform on 13 reasoning tasks, with scores rescaled within each dimension to enable direct comparison.

Reasoning Task Coverage and Video Totals

Overview of reasoning dimensions, tasks, and number of videos in V-ReasonBench. Each instance is an initial–final image pair, with each model generating five videos for pass@5 evaluation.

Precise and Reliable Evaluation

Example failure case from several reasoning tasks illustrating the limitations of VLM-based automatic evaluation. Although the underlying rule is simple, the VLM incorrectly assesses the model's output due to difficulties in recognizing small grid cells and fine structural differences.

Human–alignment validation of our benchmark's scoring pipeline. Each point compares binary Pass/Unpass decisions from the automatic evaluation with human judgments across four reasoning categories.

Generated Videos -- Structured Problem-Solving: Tic-Tac-Toe

Sora-2

Veo-3.1

Hailuo-02

Kling-2.5-Turbo-Pro

Vidu-Q2

Seedance-1.0-Lite

Generated Videos -- Spatial Cognition: Color Connection

Sora-2

Veo-3.1

Hailuo-02

Kling-2.5-Turbo-Pro

Vidu-Q2

Seedance-1.0-Lite

Generated Videos -- Pattern-based Inference: Rule Following

Sora-2

Veo-3.1

Hailuo-02

Kling-2.5-Turbo-Pro

Vidu-Q2

Seedance-1.0-Lite

Generated Videos -- Physical Dynamics: Block Sliding

Sora-2

Veo-3.1

Hailuo-02

Kling-2.5-Turbo-Pro

Vidu-Q2

Seedance-1.0-Lite

Reasoning Patterns in Video Generation

Example from the Seedance-1.0-Lite and Vidu-Q2 models on the Tic-Tac-Toe task. The model introduces additional decorative patterns across the mirrored axis, illustrating its tendency to enrich visual appearance rather than preserve original geometric form.

Influence of Duration on Video Reasoning

Effect of video duration on reasoning outcomes of Sora2 in the Chain-of-Frame setting. Each row compares model generations with different "thinking" durations for tasks such as Sudoku and Rule Following. Although longer durations correspond to longer reasoning processes (4s vs. 8s, 5s vs. 10s), the resulting outputs do not consistently improve.

Video Models vs. Image Models

Comparison between Veo-3.1 (video model) and NanoBanana (image model) on Block Sliding (left) and Code Execution (right). Video models leverage the Chain-of-Frames process to simulate intermediate states, enabling stronger performance on tasks that require causal or temporal reasoning, although intermediate transitions may still appear physically inconsistent. Image models provide clean and stable outputs and excel at text-based tasks such as code execution, but their single-frame reasoning limits their ability to capture the underlying physical dynamics.

Hallucination in Video Reasoning

Examples of hallucinations in the Chain-of-Frame reasoning process. Each row shows a trajectory from input to output, where the final frame is correct but intermediate frames display unrealistic or physically inconsistent transitions.

BibTeX

@misc{luo2025vreasonbench,
      title={V-ReasonBench: Toward Unified Reasoning Benchmark Suite for Video Generation Models}, 
      author={Yang Luo and Xuanlei Zhao and Baijiong Lin and Lingting Zhu and Liyao Tang and Yuqi Liu and Ying-Cong Chen and Shengju Qian and Xin Wang and Yang You},
      year={2025},
      eprint={2511.16668},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2511.16668}, 
}