We introduce V-ReasonBench, a benchmark designed to assess video reasoning across four key dimensions: structured problem-solving, spatial cognition, pattern-based inference, and physical dynamics. The benchmark is built from both synthetic and real-world image sequences and provides a diverse set of answer-verifiable tasks that are reproducible, scalable, and unambiguous. Evaluations of six state-of-the-art video models reveal clear dimension-wise differences, with strong variation in structured, spatial, pattern-based, and physical reasoning. We further compare video models with strong image models, analyze common hallucination behaviors, and study how video duration affects Chain-of-Frames reasoning. Overall, V-ReasonBench offers a unified and reproducible framework for measuring video reasoning and aims to support the development of models with more reliable, human-aligned reasoning skills.
Model-level overall and per-dimension performance on V-ReasonBench. A pass@5 score for each model is calculated within each dimension and presented accordingly (left). Summary of V-ReasonBench performance across video models (right). The figure illustrates how six video generation models perform on 13 reasoning tasks, with scores rescaled within each dimension to enable direct comparison.
Overview of reasoning dimensions, tasks, and number of videos in V-ReasonBench. Each instance is an initial–final image pair, with each model generating five videos for pass@5 evaluation.
Example failure case from several reasoning tasks illustrating the limitations of VLM-based automatic evaluation. Although the underlying rule is simple, the VLM incorrectly assesses the model's output due to difficulties in recognizing small grid cells and fine structural differences.
Human–alignment validation of our benchmark's scoring pipeline. Each point compares binary Pass/Unpass decisions from the automatic evaluation with human judgments across four reasoning categories.
Sora-2
Veo-3.1
Hailuo-02
Kling-2.5-Turbo-Pro
Vidu-Q2
Seedance-1.0-Lite
Sora-2
Veo-3.1
Hailuo-02
Kling-2.5-Turbo-Pro
Vidu-Q2
Seedance-1.0-Lite
Sora-2
Veo-3.1
Hailuo-02
Kling-2.5-Turbo-Pro
Vidu-Q2
Seedance-1.0-Lite
Sora-2
Veo-3.1
Hailuo-02
Kling-2.5-Turbo-Pro
Vidu-Q2
Seedance-1.0-Lite
Example from the Seedance-1.0-Lite and Vidu-Q2 models on the Tic-Tac-Toe task. The model introduces additional decorative patterns across the mirrored axis, illustrating its tendency to enrich visual appearance rather than preserve original geometric form.
Effect of video duration on reasoning outcomes of Sora2 in the Chain-of-Frame setting. Each row compares model generations with different "thinking" durations for tasks such as Sudoku and Rule Following. Although longer durations correspond to longer reasoning processes (4s vs. 8s, 5s vs. 10s), the resulting outputs do not consistently improve.
Comparison between Veo-3.1 (video model) and NanoBanana (image model) on Block Sliding (left) and Code Execution (right). Video models leverage the Chain-of-Frames process to simulate intermediate states, enabling stronger performance on tasks that require causal or temporal reasoning, although intermediate transitions may still appear physically inconsistent. Image models provide clean and stable outputs and excel at text-based tasks such as code execution, but their single-frame reasoning limits their ability to capture the underlying physical dynamics.
Examples of hallucinations in the Chain-of-Frame reasoning process. Each row shows a trajectory from input to output, where the final frame is correct but intermediate frames display unrealistic or physically inconsistent transitions.
@misc{luo2025vreasonbench,
title={V-ReasonBench: Toward Unified Reasoning Benchmark Suite for Video Generation Models},
author={Yang Luo and Xuanlei Zhao and Baijiong Lin and Lingting Zhu and Liyao Tang and Yuqi Liu and Ying-Cong Chen and Shengju Qian and Xin Wang and Yang You},
year={2025},
eprint={2511.16668},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2511.16668},
}