A Very Big Video Reasoning Suite

Maijunxian Wang; Ruisi Wang; Juyi Lin; Ran Ji; Thaddäus Wiedemer; Qingying Gao; Dezhi Luo; Yaoyao Qian; Lianyu Huang; Zelong Hong; Jiahui Ge; Qianli Ma; Hang He; Yifan Zhou; Lingzi Guo; Lantao Mei; Jiachen Li; Hanwen Xing; Tianqi Zhao; Fengyuan Yu; Weihang Xiao; Yizheng Jiao; Jianheng Hou; Danyang Zhang; Pengcheng Xu; Boyang Zhong; Zehong Zhao; Gaoyun Fang; John Kitaoka; Yile Xu; Hua Xu; Kenton Blacutt; Tin Nguyen; Siyuan Song; Haoran Sun; Shaoyue Wen; Linyang He; Runming Wang; Yanzhi Wang; Mengyue Yang; Ziqiao Ma; Raphaël Millière; Freda Shi; Nuno Vasconcelos; Daniel Khashabi; Alan Yuille; Yilun Du; Ziming Liu; Bo Li; Dahua Lin; Ziwei Liu; Vikash Kumar; Yijiang Li; Lei Yang; Zhongang Cai; Hokin Deng

A Very Big Video Reasoning Suite

Intermediate

Maijunxian Wang, Ruisi Wang, Juyi Lin et al.2/23/2026

arXiv

Key Summary

•This paper builds a gigantic library of video puzzles (VBVR) so AI can practice not just making pretty videos, but actually thinking through what happens over time.
•It also creates a fair, rule-based test kit (VBVR-Bench) that scores answers like a math teacher would, not like another AI guessing.
•VBVR covers 200 carefully designed tasks and over a million training videos—about 1,000 times bigger than past video reasoning datasets.
•The tasks are grouped by five brain-inspired skills: perception, spatiality, transformation, abstraction, and knowledge.
•Rule-based graders strongly agree with humans (Spearman’s rho > 0.9), so the scores are reproducible and trustworthy.
•Training an open-source model (Wan 2.2) on VBVR boosts its overall score from 0.371 to 0.685, a big step up over other models.
•Even the best trained model is still far from humans, showing that today’s video models struggle with long-term planning and keeping objects stable across frames.
•Scaling up data helps both familiar (in-domain) and brand-new (out-of-domain) tasks, but progress levels off, leaving a performance gap to close with better architectures.
•VBVR highlights an important idea: controllability before reasoning—if a model keeps the scene stable, its reasoning actions can be checked and improved.
•All data, tools, and models are publicly available, laying a foundation for future research in generalizable video reasoning.

Why This Research Matters

Video is how the real world unfolds—through time—so teaching AI to reason in videos makes it more useful in everyday life. With VBVR, models can learn reliable, checkable skills like rotating, sorting, navigating, and following multi-step instructions. This helps build safer home robots, better educational tools that show and verify steps, and smarter assistants for labs, sports, and design. Because the scoring is rule-based and human-aligned, results are trustworthy and reproducible, speeding up research. Perhaps most importantly, VBVR moves AI beyond pretty visuals to provable thinking in motion, which is essential for dependable, real-world systems.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Top Bread (Hook): Imagine you’re watching a Rube Goldberg machine video. A marble rolls, hits a lever, pops a balloon, and lights a candle. You understand not just each scene, but how one action causes the next across time.

🥬 Filling (The Actual Concept):

What it is: Video reasoning means understanding and following cause-and-effect stories in moving pictures, not just recognizing what’s in a single frame.
How it works:
1. Notice what’s in the scene (objects, colors, shapes).
2. Track who moves where and when.
3. Connect events (“this made that happen”).
4. Use that understanding to plan or predict the next steps.
Why it matters: Without video reasoning, an AI can make a beautiful clip but won’t know if the ball actually reached the goal or if the key ever opened the door.

🍞 Bottom Bread (Anchor): In a maze video, the AI must move the green dot to the exit without walking through walls—this is reasoning, not just drawing.

The World Before: For years, AI video models mainly focused on how nice videos look—textures, lighting, camera motion—like making a blockbuster trailer. But they weren’t trained to solve verifiable tasks, like “rotate the triangle 90 degrees around this point” or “grab the blue key, then open the blue door.” Benchmarks that existed were small, often had no training splits, and leaned on AI judges, which could be inconsistent.
The Problem: Researchers wanted to know: If we train video models specifically on reasoning, do they really learn it? And do those skills transfer to new tasks they’ve never seen? Two big blockers stood in the way: (a) there wasn’t a large, diverse training set of reasoning videos, and (b) most evaluations weren’t strictly checkable—answers were judged by another model or by fuzzy rubrics.

🍞 Top Bread (Hook): You know how a good science fair needs lots of different experiments and clear rules about how to judge them?

🥬 Filling (The Actual Concept):

What it is: The community lacked a big, varied set of video reasoning tasks and a fair, rule-based scoring system.
How it works (the need):
1. Many tasks across skills (like perception or planning) to avoid overfitting.
2. Scalable generation so we can study how performance grows with more data.
3. Verifiable scoring so two people (or two days) get the same answer.
Why it matters: Without scale and clear scoring, we can’t run clean experiments to see what truly improves reasoning.

🍞 Bottom Bread (Anchor): If two baking contests use different recipes and one judge changes their mind daily, you can’t tell which cake recipe is actually better.

Failed Attempts: Earlier benchmarks mostly tested models after the fact; they provided tiny test sets, few to no training samples, and often let another AI be the judge. That made it tough to (a) train for reasoning at scale, (b) reproduce results, and (c) compare models fairly.
The Gap: The field needed a “very big” training source built from carefully designed tasks, plus a testing kit with strict, rule-based graders that match human preferences. It also needed a first look at what happens when you scale data for reasoning.

🍞 Top Bread (Hook): Think of a huge puzzle book with answer keys and a teacher’s guide that explains exactly how to grade each puzzle.

🥬 Filling (The Actual Concept):

What it is: VBVR (Very Big Video Reasoning) fills this gap with 200 tasks and over a million videos, plus VBVR-Bench, a rule-based, human-aligned scoring kit.
How it works:
1. Tasks come from five brain-inspired buckets: perception, spatiality, transformation, abstraction, knowledge.
2. Generators produce many variations per task (10,000+) with ground-truth solutions.
3. Rule-based scorers check spatial accuracy, path validity, temporal consistency, and logic.
Why it matters: Now researchers can train, test, and meaningfully measure progress on video reasoning.

🍞 Bottom Bread (Anchor): In “Key–Door Matching,” the grader checks: did the agent pick the right-colored key (30%), avoid walls with a valid path (30%), keep the path efficient (20%), and animate smoothly (20%).

Real Stakes: Better video reasoning helps with tutoring (showing and checking steps), robotics (following procedures safely), education (visual math and logic), sports analysis, and science demos (verifying causal stories in experiments). It pushes AI from “pretty pictures” to “provable thinking in motion.”

🍞 Top Bread (Hook): You know how the brain has different jobs—seeing, remembering, planning—yet works together?

🥬 Filling (The Actual Concept) — Cognitive Architecture:

What it is: A brain-inspired map for organizing tasks into five skills: perception, spatiality, transformation, abstraction, and knowledge.
How it works:
1. Perception: detect what’s there.
2. Spatiality: know where things are and how to navigate.
3. Transformation: mentally change or move things.
4. Abstraction: find rules and patterns.
5. Knowledge: use learned facts and concepts.
Why it matters: Without this structure, tasks would be a random pile; with it, we can diagnose strengths and weaknesses.

🍞 Bottom Bread (Anchor): Sorting shapes by size (perception) is different from planning a shortest path in a maze (spatiality) or rotating a shape around a point (transformation). VBVR has dedicated tasks for each.

02Core Idea

🍞 Top Bread (Hook): Imagine training for a triathlon: you need swimming, biking, and running, and a stopwatch with clear rules to know if you’re improving.

🥬 Filling (The Actual Concept) — VBVR Dataset:

What it is: A gigantic, carefully designed set of video puzzles (2,015,000 images; 1,007,500 videos) across 200 tasks that teach and test reasoning.
How it works:
1. Experts design tasks under five skills with clear, unique solutions.
2. Parameterized generators create 10,000+ variations per task.
3. Each sample includes the first frame + prompt (inputs) and the ground-truth trajectory (how to solve it), enabling learning the steps, not just the final answer.
Why it matters: Scale plus structure lets models practice reasoning in many ways and reveals how performance grows with more data.

🍞 Bottom Bread (Anchor): A task might ask: “Rotate the red triangle 90° around the marked dot.” The dataset provides thousands of such variations and the exact correct motion path.

🍞 Top Bread (Hook): Think of a science test with answer keys that anyone can re-check at home.

🥬 Filling (The Actual Concept) — VBVR-Bench:

What it is: A fair, rule-based evaluation kit that scores videos the same way every time and aligns with human judgments (rho > 0.9).
How it works:
1. Dual split: 50 in-domain (familiar task families, new settings) + 50 out-of-domain (entirely new tasks).
2. Each task has a dedicated, rule-based scorer (e.g., positions, paths, timing, logic).
3. Scores combine sub-criteria like spatial accuracy, path efficiency, temporal consistency, and logical validity.
Why it matters: No more “AI-as-judge” fuzziness; results are reproducible and explainable.

🍞 Bottom Bread (Anchor): In the Key–Door task, the grader checks the exact path against the optimal BFS path and verifies the correct key–door color match—like a math proof checker.

The “Aha!” Moment (one sentence): If we give video models a massive, well-structured practice set and grade them with strict, human-like rules, they begin to show real, transferable video reasoning skills.
Multiple Analogies:

Sports: Training drills (dataset) + referees with rulebooks (bench) → better gameplay (reasoning).
Cooking: Recipe variations (dataset) + taste-test rubric (bench) → reliably good meals (consistent reasoning).
Music: Scales and etudes (dataset) + metronome and tuner (bench) → cleaner performances (precise control over time).

Before vs After:

Before: Models impressed with visuals but broke logic or changed the scene mid-way, making steps unverifiable.
After: Models trained on VBVR hold scenes steady and perform targeted edits—“controllability before reasoning”—so their actions can be checked and improved.

Why It Works (intuition, no equations): Reasoning in video needs two things at once: (a) stable scenes so actions are meaningful, and (b) clear objectives with verifiable steps. Massive, varied practice builds reusable “primitives” (like rotate, move, align), while rule-based scoring rewards precise execution and penalizes drift. Over time, the model learns to keep identity consistent and follow constraints.
Building Blocks:

Task families mapped to five skills (coverage).
Deterministic generators (scalable variety).
First-frame + prompt inputs with full ground-truth trajectories (learn the process).
Dual ID/OOD splits (measure transfer, not memorization).
Rule-based scorers aligned with humans (trustworthy measurement).

🍞 Bottom Bread (Anchor): After VBVR training, Wan 2.2 cleanly deletes just the marked symbol (no extras), rotates around the right pivot, and moves a book into the correct shelf slot—precise, checkable steps that were shaky before.

03Methodology

At a high level: Input (first frame + prompt) → Model generates a solution video → VBVR-Bench’s rule-based scorer checks spatial/temporal/logic details → Final score, with category breakdowns.

Step-by-step recipe:

Designing the Tasks (grounded in five skills)

What happens: Experts propose tasks that probe perception, spatiality, transformation, abstraction, or knowledge, each with clear goals and unique, checkable answers.
Why it exists: If tasks are vague or solvable from one static picture, you can’t test real video reasoning.
Example data: A maze task shows start, walls, keys, and doors in the first frame; the prompt states the rules (e.g., get the matching key before the door).

Implementing Generators

What happens: Each approved task becomes a parameterized generator that can create 10,000+ diverse instances. It outputs: firstframe.png, prompt.txt, finalframe.png, and groundtruth.mp4.
Why it exists: Diversity prevents memorization and enables scaling studies; ground-truth videos teach the model how to solve, not just what the answer is.
Example data: A “rotate-around-point” generator varies shape type, size, angle, pivot, and background, and always produces the exact correct rotation path.

Distributed Production

What happens: Cloud workers (like many bakers in a kitchen) generate millions of samples in parallel, with automatic quality checks (e.g., solvable, clear visuals, no boundary glitches).
Why it exists: Only at this scale can we study how performance changes as data grows.
Example data: 1,000,000 training samples across 100 training tasks; 7,500 test samples across 150 test tasks; disjoint random seeds avoid leakage.

🍞 Top Bread (Hook): You know how you practice both familiar problems and brand-new ones to see if you really understand the idea, not just the example?

🥬 Filling (The Actual Concept) — In-Domain and Out-of-Domain Generalization:

What it is: Testing on familiar task types with new settings (ID) and on entirely new task types (OOD) to check transfer.
How it works:
1. ID: same family, unseen parameters (e.g., new maze layouts).
2. OOD: new families entirely (e.g., from mazes to domino logic).
Why it matters: If a model only aces ID but flops on OOD, it memorized patterns instead of learning general reasoning.

🍞 Bottom Bread (Anchor): A student who can solve only the exact worksheet problems (ID) hasn’t mastered the skill; handling new but related problems (OOD) shows real understanding.

Scoring with Rule-Based Evaluators

What happens: Each task has a specific scorer that measures multiple aspects (spatial accuracy, path validity, temporal smoothness, logical correctness) and combines them with weights.
Why it exists: This makes results deterministic and explainable; anyone can re-run the scorer and get the same score.
Example data: Key–Door Matching: 30% target identification, 30% path validity (no wall collisions), 20% path efficiency (compare to optimal BFS), 20% animation quality.

🍞 Top Bread (Hook): It’s like having a stopwatch, a tape measure, and a judge’s checklist for a science fair.

🥬 Filling (The Actual Concept) — Task-Specific Scorers:

What it is: Customized graders that check exactly what matters for each puzzle.
How it works:
1. Encode the puzzle’s rules (e.g., which colors must match).
2. Compare generated trajectories with ground truth.
3. Aggregate sub-scores into a final result.
Why it matters: A one-size-fits-all grader would miss important details; task-specific checks catch the real mistakes.

🍞 Bottom Bread (Anchor): In a rotation task, the scorer verifies the pivot point stayed fixed and the shape reached the correct angle—no sliding or stretching allowed.

🍞 Top Bread (Hook): When humans and the scoreboard agree on who won, you trust the game.

🥬 Filling (The Actual Concept) — Human-Aligned Scorers:

What it is: Scorers whose rankings match human preferences very closely (Spearman’s rho > 0.9).
How it works:
1. Collect pairwise human preferences between model outputs.
2. Compare to automatic rankings from rule-based scores.
3. High correlation = trustworthy evaluation.
Why it matters: If the automatic score disagreed with people, we couldn’t rely on it to guide research.

🍞 Bottom Bread (Anchor): If humans pick Model A over Model B in most matchups, and the scorer says the same, we’ve got a fair referee.

Training Protocol (VBVR–Wan 2.2)

What happens: Start with Wan 2.2 (I2V), apply LoRA on key transformer modules, train one epoch per experiment at different data scales (0K to 500K) to chart scaling curves.
Why it exists: To isolate the effect of data scale on reasoning without changing the architecture.
Example data: Overall score rises from 0.371 (0K) to ~0.685 (500K), with ID and OOD improving but OOD consistently lower.

The Secret Sauce:

Standardized, parameterized generators ensure massive, clean variety.
Rule-based, human-aligned scorers make progress measurable and reproducible.
Dual ID/OOD splits reveal whether models truly learned reasoning, not just task styles.
Emphasis on controllability makes step-by-step edits verifiable, which is essential for reasoning in motion.

04Experiments & Results

The Test: Models had to solve 100 diverse tasks under two conditions—In-Domain (ID: familiar task families with new settings) and Out-of-Domain (OOD: brand-new task families). Scores measured spatial accuracy, path validity/efficiency, temporal consistency, and logical correctness, then averaged within five skill areas.
The Competition: The study compared eight leading systems: four proprietary (Sora 2, Veo 3.1, Runway Gen-4 Turbo, Kling 2.6) and four open-source (Wan 2.2, CogVideoX-1.5, HunyuanVideo, LTX-2). Humans served as an upper bound.
The Scoreboard (with context):

Human: ~0.974 (ID) and ~0.763 (OOD) — the gold standard.
Open-source base: Wan 2.2 at 0.371 overall — stronger than other open-source models (0.27–0.31 range), but with lots of room to grow.
Proprietary leaders: Sora 2 at 0.546 and Veo 3.1 at 0.480 — clearly above open-source baselines, especially in abstraction and transformation.
VBVR-trained model: VBVR–Wan 2.2 at 0.685 — a big jump over its base (+84.6% relative improvement), setting a new state of the art across all five skills, especially perception and spatiality. Think of this like moving from a B- to a strong A- when peers hover around C+ to B-.

Category insights:

Perception & Spatiality: Big gains after VBVR training, reflecting better scene stability and navigation.
Abstraction & Transformation: Improved but still challenging—combining precise edits with long-horizon plans remains tough.

Surprising Findings:

Controllability before reasoning: VBVR training nudged the model to keep scenes stable and perform minimal, correct edits (e.g., delete exactly the marked symbol), avoiding side effects seen in stronger proprietary models.
Emergent behaviors: On unseen tasks, the trained model often used a self-chosen, consistent policy (like smooth “fade-in” completions) and showed signs of multi-step planning (understand → act → adjust), even if not perfectly faithful to the gold process.
Persistent generalization gap: As data scaled from 0K to 500K, both ID and OOD improved (ID ~0.412 → ~0.760; OOD ~0.329 → ~0.610), but OOD stayed ~15% lower, suggesting data alone won’t close the gap.
Plateauing: Performance rises then flattens around 300K–500K samples, pointing to architectural limits (e.g., identity stability over long horizons, cumulative noise, and process faithfulness).

VBench++ sanity check:

After VBVR training, core video quality stayed strong; camera-motion consistency improved, and dynamics got more disciplined (less unnecessary motion). This matches the theme: better control supports better reasoning.

05Discussion & Limitations

Limitations:

Long-horizon consistency: The model can still duplicate or flicker agents and drift over long sequences.
Process faithfulness: It can reach the right final answer by an incorrect or unfaithful method (e.g., skipping explicit try-and-check steps), which current scorers only partly capture.
Generalization gap: OOD remains notably lower than ID even with lots of data.
Architecture constraints: Diffusion + transformers without explicit state tracking struggle to preserve identity and logic across many frames.

Required Resources:

Data: Access to the VBVR Dataset (1M+ samples) and generators.
Compute: Training with LoRA is lighter than full finetuning, but scaling studies still need significant GPU time.
Tooling: VBVR-Bench evaluators, task-specific scorers, and the data factory for adding new tasks.

When NOT to Use:

Pure cinematography tests where artistry matters more than rule-following.
Non-verifiable tasks (open-ended storytelling) where there isn’t a single checkable outcome.
Extremely precise physics simulations beyond the rendering assumptions of the generators.

Open Questions:

How to encode explicit state/memory so objects keep identity perfectly across hundreds of frames?
Can we train models to match not just end results but the correct intermediate reasoning steps (process supervision)?
What architectural changes (structured slots, symbolic planners, self-correction loops) best reduce the OOD gap?
How do the five skills co-develop—can we boost one without hurting another, and why do some pairs trade off?
What curriculum or compositional task design best encourages reusable reasoning primitives?

06Conclusion & Future Work

Three-sentence summary: This paper introduces VBVR, a massive suite of video reasoning tasks and a strict, human-aligned evaluation kit that makes progress measurable and reproducible. Training on VBVR greatly improves controllable, verifiable reasoning in video models and reveals how performance scales with more data. Yet, even with big gains, a sizable gap to human ability remains, especially for long-horizon identity and process faithfulness.

Main achievement: Turning video generation into video reasoning practice at scale—by pairing 200 brain-inspired tasks and over a million examples with rule-based scorers that strongly match human judgment.

Future directions: Add richer compositional tasks, tighten process-level supervision, and explore architectures with explicit state tracking and self-correction to close the ID–OOD gap. Study capability trade-offs among the five skills to design better curricula and modules.

Why remember this: VBVR shifts the field from “pretty videos” to “provable thinking in motion,” giving the community the data, tools, and first scaling study needed to build generalizable video reasoners.

Practical Applications

•Train video models to follow precise editing commands (move, rotate, delete) without disturbing the rest of the scene.
•Evaluate new video architectures with VBVR-Bench to get reproducible, human-aligned scores.
•Run scaling studies (50K → 500K samples) to find data-efficiency sweet spots and where gains plateau.
•Design curricula by mixing tasks across the five skills to build balanced reasoning abilities.
•Diagnose model weaknesses (e.g., long-horizon identity drift) using category-wise scores and targeted tasks.
•Add new task generators to the VBVR data factory to expand coverage and study compositional generalization.
•Use process-focused training (compare to ground-truth trajectories) to improve step-by-step faithfulness.
•Benchmark commercial vs. open-source models on the same, rule-based suite for fair comparisons.
•Tune controllability (scene stability) before advanced reasoning to make actions verifiable.
•Adopt VBVR tasks in education (e.g., visual math/logical puzzles) to build and test student-friendly AI tutors.

Version: 1

Key Summary

Why This Research Matters

Detailed Explanation

01Background & Problem Definition

02Core Idea

03Methodology

04Experiments & Results

05Discussion & Limitations

06Conclusion & Future Work

Practical Applications

Notes