From Perception to Action: An Interactive Benchmark for Vision Reasoning

Yuhao Wu; Maojia Song; Yihuai Lan; Lei Wang; Zhiqiang Hu; Yao Xiao; Heng Zhou; Weihua Zheng; Dylan Raharja; Soujanya Poria; Roy Ka-Wei Lee

From Perception to Action: An Interactive Benchmark for Vision Reasoning

Beginner

Yuhao Wu, Maojia Song, Yihuai Lan et al.2/24/2026

arXiv

Key Summary

•The paper introduces CHAIN, a hands-on 3D playground that tests if AI can not only see objects but also plan and act under real physics.
•Unlike old tests that ask one question about a single picture, CHAIN makes models try multi-step tasks like unlocking wooden puzzles and packing shapes tightly into a box.
•CHAIN runs in a physics engine, so moves must obey gravity, collisions, and support—no magic teleports or ghosting through parts.
•The benchmark tracks not just if a model finishes, but also how many extra steps it wastes and how many tokens (and dollars) it spends to get there.
•State-of-the-art models do much better on stacking blocks than on taking apart interlocking puzzles, showing big gaps in structure-aware reasoning.
•Even strong models often fail to turn what they see into a reliable long plan, especially when early choices shrink future options.
•One-shot (single-image, no feedback) solving performs far worse than interactive attempts, proving the value of closed-loop trial, observe, and revise.
•Video ‘world models’ also fail badly at physically valid disassembly, often hallucinating parts or breaking constraints.
•CHAIN offers 109 interactive levels with clear difficulty tiers and unified interfaces to push research from passive perception to active problem solving.
•Overall, CHAIN reveals a persistent gap between seeing and acting, motivating better physically grounded, plan-first AI.

Why This Research Matters

Home robots, factory arms, and AR assistants must follow the laws of physics while completing multi-step tasks, not just label pictures. CHAIN reveals whether today’s AI can plan ahead, keep options open, and act safely when objects interlock or need support. This benchmark helps engineers identify where models fail—like hallucinating motions or creating dead-end placements—before deploying them in real settings. It also encourages designs that reason about constraints, making assistants more reliable and less costly to run. Over time, better CHAIN scores should translate into safer, more capable embodied systems that can help with assembly, repair, and organization in the real world.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine you’re building a LEGO model without instructions. You don’t just stare at the pile—you try something, see what changes, and then try the next step. Real problem solving is a back-and-forth dance between looking and doing.

🥬 The Concept (Vision-Language Models, or VLMs): What they are: Computer programs that read pictures and words together to answer questions or follow directions. How they often work today: 1) Look at a single image, 2) Read a question, 3) Output a short answer. Why this is limited: Real tasks need many steps and must follow physics; one snapshot doesn’t tell you what will happen after you act.

🍞 Anchor: A VLM can tell you “the red block is on top,” but that doesn’t mean it knows how to safely remove it without making the tower fall.

🍞 Hook: You know how playing Jenga isn’t just about seeing the tower—you also test which piece wiggles easily and which one is holding everything up.

🥬 The Concept (Physics-driven environments): What they are: Virtual worlds where gravity, collisions, and support behave like real life. How they work: 1) You choose an action (pull, rotate, place), 2) The engine simulates the motion and contacts, 3) You observe the new state, 4) Repeat. Why they matter: Without real physics, an AI could “cheat” by sliding parts through each other or floating blocks in midair.

🍞 Anchor: If an AI moves a puzzle beam straight through another beam, we know the test is broken—physics engines prevent that.

🍞 Hook: Think of a treasure hunt with clues. Each clue you pick changes which clues are still reachable next.

🥬 The Concept (Multi-step interaction): What it is: Solving by taking several actions in order, where each step changes what’s possible next. How it works: 1) Observe, 2) Pick an action, 3) See result, 4) Update plan, 5) Continue. Why it matters: One wrong early move can block the only path to the goal later.

🍞 Anchor: In a burr puzzle, removing the wrong stick first may jam the core so nothing else can come out.

🍞 Hook: Picture a keyring full of keys connected in a tricky pattern. To free one, you must understand how the loops lock each other.

🥬 The Concept (Structural reasoning): What it is: Figuring out how shapes, contacts, and supports fit and limit motion. How it works: 1) Read the geometry (sizes, orientations), 2) Infer constraints (what blocks what), 3) Plan a feasible order of moves that keeps options open, 4) Adjust using feedback. Why it matters: Without structural reasoning, the AI guesses randomly, wastes steps, and gets stuck.

🍞 Anchor: When packing a suitcase, placing shoes first along the edge can save space; dumping a sweater first may waste a corner and block everything else.

The world before this paper: Most benchmarks asked static, single-turn questions like “What color is the ball?” They were great for checking if models could recognize objects and describe scenes. But real robots or assistants must act: pick up, rotate, insert, remove—following laws of physics. Many attempts to extend evaluation used simplified 2D puzzles or single images, which don’t test the hardest part: whether early choices protect later feasibility in 3D.

The problem: We lacked a rigorous, interactive, physics-true test to see if models can convert perception into action sequences that obey contact, support, and geometric constraints over many steps.

Failed attempts: 1) Static Q&A (good at naming, bad at doing). 2) 2D toy grids (avoid 3D collisions and supports). 3) Pure video generation “world models” (often hallucinate shapes or violate constraints, looking plausible but physically impossible).

The gap: A benchmark that forces models to plan, act, and adapt, with real physics and long horizons, across tasks that truly need order-sensitive moves.

Real stakes: This matters for home robots (don’t topple shelves), factories (assemble parts in a safe sequence), design tools (pack components without clashes), and education (teaching how structure and cause-effect work). If AIs can’t reason about structure, they can’t be trusted to handle everyday physical chores safely.

02Core Idea

🍞 Hook: You know how a line of dominoes only falls if you tip the first one, and the shape of the layout decides what happens next? The order—and the connections—make all the difference.

🥬 The Concept (CHAIN: Causal Hierarchy of Actions and Interactions): What it is: An interactive 3D benchmark that tests whether models can understand, plan, and execute action sequences that obey physical constraints. How it works: 1) Present a 3D task (interlocking puzzle or packing), 2) The model observes from multiple views, 3) Picks a feasible action from a standard API, 4) The physics engine updates the world, 5) The model revises its plan and continues until success or the step budget ends. Why it matters: Without CHAIN, we can’t tell if a model’s “understanding” survives contact with reality—where geometry, contact, and support rule what’s possible.

🍞 Anchor: In CHAIN, a model that simply names parts won’t pass; it must figure out the unlock move in a burr puzzle or pack shapes to fully fill a container with no gaps or overlaps.

The “Aha!” moment in one sentence: To truly test physical reasoning, stop grading answers to pictures and start grading action sequences that obey physics and preserve future options.

Multiple analogies:

Toolbox analogy: Before, we asked, “What tool is this?” Now we ask, “Use the right tool in the right order without breaking anything.”
Cooking analogy: It’s not naming ingredients; it’s doing the recipe step by step so the cake actually rises.
Maze analogy: Not pointing at the exit on a map, but navigating turn by turn without hitting dead-ends.

Before vs. After:

Before: Single photo, single answer. Little sense of how actions change the world.
After: Many observations and actions. Plans must respect geometry and contact rules, and each move reshapes what’s still doable.

Why it works (intuition, not equations): Real feasibility is shaped by constraints: 3D pieces can collide, gravity pulls down, and supports are needed to avoid collapse. CHAIN enforces these rules via a physics engine and carefully designed tasks, so a model can’t bluff. Success demands recognizing structure and planning an order that keeps a path open to the goal.

Building blocks:

Task families: (1) Interlocking mechanical puzzles (Kongming/Lu Ban locks, burr puzzles) for constraint-aware, contact-rich reasoning. (2) 3D stacking/packing (polycubes into boxes) for long-horizon space management and stability under gravity.
Standardized interaction: A simple action API with color-coded objects and multi-view observations, removing controller confounds.
Metrics beyond success: Steps vs. optimal plan length, token spent per solve, and cost per solve, so we compare not just “if” but “how well and how efficiently.”

🍞 Hook: Think of a careful gardener trimming a hedge: each snip changes the shape and the next good snip. If you cut the wrong branch first, future cuts get harder.

🥬 The Concept (Interactive 3D benchmark): What it is: A test bed where models must act repeatedly in a 3D physics world. How it works: 1) Input images + history + goal, 2) Choose an action, 3) Physics updates, 4) Repeat. Why it matters: It reveals whether a model can adapt, not just recite a pre-planned script.

🍞 Anchor: CHAIN’s stacking tasks punish greedy early placements that create unreachable cavities, forcing real lookahead.

🍞 Hook: When you pack a lunchbox, you learn to place the big, stiff items first and keep room for the small ones—otherwise you can’t close the lid.

🥬 The Concept (Structural reasoning, revisited): What it is: Reading the 3D layout to keep future moves feasible. How it works: 1) Detect blockers and supports, 2) Choose an order that frees constraints bit by bit, 3) Revise if feedback shows a dead-end. Why it matters: It’s the difference between a lucky guess and a reliable plan.

🍞 Anchor: Models that place “easy” pieces first often end up with leftover shapes that can’t fit—CHAIN catches that.

03Methodology

At a high level: Images + Goal → Observe multi-view scene + Read action history → Choose next action (pick/rotate/move/place) → Physics engine updates the world → Repeat until solved or out of steps.

Step-by-step (like a recipe):

Environment setup

What happens: The benchmark loads either an interlocking puzzle or a stacking/packing level with a predefined start state and fixed action set. Objects are color-coded, and multi-view renders are produced to reduce occlusion problems.
Why it exists: Ensures fairness (same tools and views for every model) and avoids controller quirks that could hide reasoning weaknesses.
Example: “Select the blue beam; slide +x by 1 unit.” Or “Rotate the green L-block 90° around z, then place.”

Perception–action loop

What happens: At time t, the agent gets (a) the task goal, (b) a short summary of recent steps, and (c) current multi-view images. It chooses an action; the simulator applies physics (contacts, collisions, gravity); it returns new observations for t+1.
Why it exists: Real problem solving needs closed-loop feedback to discover constraints and adapt plans.
Example: After trying to pull a beam and seeing it won’t budge, the agent infers a hidden interlock and tries freeing a different piece.

Task families

Interlocking mechanical puzzles 🍞 Hook: You know how some wooden brainteasers only open if you slide the “key” piece first? 🥬 The Concept: What it is: Multi-piece 3D locks where pieces block each other through tight contacts. How it works: 1) Identify the key piece, 2) Slide along allowed rails, 3) Avoid collisions, 4) Follow the precise order. Why it matters: Random moves jam the structure; only the right sequence unlocks it. 🍞 Anchor: A six-piece burr requires removing a single unlocking beam before any other piece can exit.
3D stacking/packing 🍞 Hook: Packing a suitcase so there are no gaps and the zipper closes. 🥬 The Concept: What it is: Filling a box with shaped blocks to exactly cover volume with no overlap or holes. How it works: 1) Choose orientation, 2) Place stably with support, 3) Keep future space fillable. Why it matters: Early sloppy placements create unreachable cavities that ruin the endgame. 🍞 Anchor: In a 3×3×4 box, greedy placements can leave a single-cell void you can’t legally fill later.

Physics-driven execution

What happens: Unity (for contact-rich puzzles) or a lightweight 3D Python engine (for stacking) enforces collisions, gravity, supports, and kinematic constraints.
Why it exists: Prevents unrealistic shortcuts and guarantees repeatability across models.
Example: A beam cannot pass through another; a block must be supported or it falls.

Metrics (how we grade)

Task success: 🍞 Hook: Think of a spelling test: did you spell the word right or not? 🥬 The Concept (Pass@1): What it is: The fraction of levels solved in a single run. How it works: 1) Try each level once, 2) Count solved vs. total, 3) Compute the percentage. Why it matters: Shows baseline reliability without retries. 🍞 Anchor: If a model solves 25 out of 109 levels, Pass@1 ≈ 22.9%.
Plan efficiency (only on solved runs): 🍞 Hook: If two kids clean a room, the one who uses fewer trips did the smarter plan. 🥬 The Concept (Average Steps and Distance-to-Optimal): What it is: Average Steps counts how many actions were used; Distance-to-Optimal counts the extra actions beyond the shortest known plan. How it works: 1) Measure steps taken, 2) Compare to the task’s minimal plan length, 3) Sum or average across solved tasks. Why it matters: Finishing is good; finishing without wandering is better. 🍞 Anchor: If the best solution is 3 steps and your plan took 6, Dist2Opt adds 3 for that level.
Token & cost efficiency: 🍞 Hook: Imagine paying per word you say to a helper. Talking more costs more. 🥬 The Concept (Solved/Tokens and Solved/USD): What it is: How many tasks are solved per million tokens, and per dollar spent. How it works: 1) Count all input/output tokens, 2) Convert tokens to dollars with provider prices, 3) Divide solved tasks by tokens or dollars. Why it matters: Two models with the same success can differ a lot in cost. 🍞 Anchor: A “flash” model might be cheap but solve few tasks; a stronger model may cost more per call yet be cheaper per successful solve.

Difficulty and dataset

What happens: 109 interactive levels: 32 puzzles (easy/medium/hard) and 77 stacking tasks (easy/medium/hard). Stacking is programmatically generated and scalable.
Why it exists: Clear tiers expose where models break (often at contact-rich, order-sensitive puzzles; or at hard packings needing lookahead).
Example: Easy stacking (2×2×3) is almost trivial; hard burr puzzles can need non-intuitive unlock moves.

The secret sauce (what’s clever):

Closed-loop, physics-true interaction forces genuine feasibility reasoning.
Two complementary task families catch different failure modes: contact-constrained unlocking vs. global space planning.
Efficiency and cost metrics reveal trade-offs beyond raw success.
A unified, simple action API and multi-view inputs isolate reasoning quality from low-level control.

04Experiments & Results

The test: Evaluate state-of-the-art VLMs and video world models under the same interactive protocol. We measure success (Pass@1), plan efficiency (Average Steps, Distance-to-Optimal), and resource efficiency (Solved/Tokens, Solved/USD).

The competition: Both closed-source and open-source VLMs were tested, along with diffusion/video world models for a disassembly subtask. All used identical sampling settings and action APIs, with generous step budgets (30–60) and a short trajectory memory window (5 turns).

The scoreboard with context:

Overall difficulty: CHAIN is hard. Even the best VLM (GPT‑5.2) solves about 22.9% of all levels (≈25/109). That’s like getting a solid C on a very tough exam where most others score much lower.
Puzzle vs. Stacking: Puzzle success is tiny (≈0.0–3.1%) across models, while Stacking can reach up to 31.2%. Translation: Models can manage space-filling a bit, but interlocking constraints defeat them.
Efficiency and cost: Stronger models sometimes backtrack, increasing extra steps and spending more tokens. Yet, when counting “cost per successful solve,” mid-strength models can beat ultra-cheap ones whose low success drives up total spending per win.

Surprising findings:

One-shot collapse: Without interaction (single fixed image, no feedback), accuracy plunges. For Puzzle, it’s 0% one-shot across top models; for Stacking, drops like 31.2% → 9.1% show the big benefit of closed-loop probing.
World model failures: State-of-the-art video generators tasked with physically valid disassembly hallucinate parts, break contact rules, or produce impossible motions, especially as complexity grows. Looks plausible, but physics fails.
Difficulty stratification matters: Easy stacking is close to solved by top models, but medium and especially hard levels expose poor lookahead—early greedy placements lead to fragmented leftover spaces.

Concrete examples:

Cost trade-offs: A pricier model that solves more on the first try can be cheaper per success than a very cheap model that needs many failed runs.
Dead-end behavior: On puzzles, models often try random beams, fail to infer the key piece, and wander aimlessly. On stacking, placing “easy” blocks first commonly creates holes that no remaining piece can fill.

Bottom line: CHAIN shows a persistent gap between seeing and acting. Models can often describe a scene but struggle to turn that perception into a long, constraint-aware plan that leaves room for future moves.

05Discussion & Limitations

Limitations (what this can’t do yet):

Scale: Interlocking puzzles require meticulous Unity engineering to capture tight contacts and kinematics; thus, puzzle variety grows slowly. Stacking scales well, but the hardest contact-rich puzzles are finite for now.
Evaluation breadth: Because interactive runs are costly, Pass@1 is the main metric; best-of-K statistics are limited, though initial tests show similar trends.
Controller-free scope: The benchmark isolates reasoning by using a simple action API. It doesn’t test low-level continuous control (e.g., robot arm dynamics) directly.

Required resources:

Compute and API budget for many interactive steps and image I/O.
Physics-capable environment (Unity or 3D Python engine).
Logging and token accounting to analyze cost efficiency.

When not to use:

If you only need static recognition (e.g., labeling objects in photos), CHAIN is overkill.
If your agent requires end-to-end motor control benchmarking (torques, grasps), you’ll need a control-focused suite in addition to CHAIN’s operation-level reasoning tests.
If you cannot afford iterative interactions (strict latency or cost constraints), CHAIN’s multi-step format may be impractical.

Open questions:

How to teach models to preserve future feasibility? Can explicit constraint graphs, object-centric memory, or search with verifier feedback help?
Can process reward models or environment verifiers reliably guide long-horizon selection better than current rerankers?
What curricula best transfer from stacking to interlocking unlocks (and vice versa)?
How to integrate high-fidelity 3D perception (multi-view, point clouds) with symbolic planners without losing speed?
Can world models internalize contact-consistent dynamics to stop hallucinating structure under tight constraints?

06Conclusion & Future Work

Three-sentence summary: CHAIN replaces single-picture Q&A with an interactive, physics-true exam of whether models can plan and act through multi-step constraints. Across interlocking puzzles and 3D stacking, state-of-the-art models frequently fail to convert perception into robust, long-horizon plans, especially when early moves shrink future options. The results expose a clear gap between seeing and acting and provide a grounded path to improve structure-aware reasoning.

Main achievement: A unified, open interactive 3D benchmark—complete with task families, physics enforcement, standardized APIs, and efficiency/cost metrics—that reveals where current models break under real constraints.

Future directions: Add more contact-rich puzzles, expand programmatic stacking difficulty, integrate stronger verifier signals, explore object-centric memory and search, and develop physics-faithful world models. Report broader best-of-K results as evaluation budgets grow.

Why remember this: If we want trustworthy embodied assistants, we must test not just what they see but how they act over time under physics. CHAIN is a practical, reproducible step toward that goal, forcing models to respect geometry, contact, and support—or fail loudly where it matters.

Practical Applications

•Robot assembly assistants that choose safe, feasible part orders without jamming or breaking components.
•Warehouse packing and bin-picking planners that minimize wasted space and avoid creating unreachable cavities.
•AR-guided furniture assembly that adapts instructions based on real-time progress and detected constraints.
•Educational puzzle apps that teach causal, spatial, and structural reasoning with physics-true feedback.
•Industrial maintenance planners that sequence disassembly steps for tight assemblies without collisions.
•Household tidying robots that stack and store items stably, preserving room for later additions.
•Design tools for product packaging that verify exact-fit packs with realistic insertion paths.
•Simulation curricula for training embodied agents with progressive difficulty in contact-rich tasks.
•Quality-control checkers that replay and verify action sequences against constraints before production.
•Cost-aware AI agents that optimize reasoning verbosity (tokens) to reduce API spend per successful task.

Version: 1