Recurrent-Depth VLA: Implicit Test-Time Compute Scaling of Vision-Language-Action Models via Latent Iterative Reasoning

Yalcin Tur; Jalal Naghiyev; Haoquan Fang; Wei-Chuan Tsai; Jiafei Duan; Dieter Fox; Ranjay Krishna

Recurrent-Depth VLA: Implicit Test-Time Compute Scaling of Vision-Language-Action Models via Latent Iterative Reasoning

Intermediate

Yalcin Tur, Jalal Naghiyev, Haoquan Fang et al.2/8/2026

arXiv

Key Summary

•Robots often use the same amount of thinking for easy and hard moves, which wastes time on easy steps and isn’t enough for tricky ones.
•This paper introduces RD-VLA, a robot brain that can think in small hidden steps (in its head) and choose to think longer only when needed.
•Instead of writing out long reasoning as words or tokens, RD-VLA reasons silently in a continuous hidden space, which uses constant memory and runs much faster.
•A single small, shared reasoning block is reused over and over (recurrent depth), so the robot can take 2, 4, or 12 refinement steps at test time without retraining.
•The model knows when to stop thinking by checking if two consecutive action guesses are almost the same (a convergence test).
•On the LIBERO benchmark, RD-VLA hits 93.0% success with fixed steps and 92.5% with adaptive steps, beating or matching bigger token-reasoning models.
•On CALVIN, it achieves the longest average chain of tasks (3.39), showing strong long-horizon planning.
•In real robots (like towel folding and bread toasting), RD-VLA is robust and benefits from thinking longer when actions are complex.
•Compared to token-based Chain-of-Thought methods, RD-VLA keeps memory use constant and can be up to 80× faster at inference.
•The big idea: move reasoning from slow, word-like outputs to fast, continuous hidden states and let the robot decide how long to think based on difficulty.

Why This Research Matters

Robots that can flex their thinking—speeding through easy steps and slowing down for hard ones—are safer and more useful in the real world. RD-VLA shows how to do this without writing long token explanations, keeping memory constant and making decisions much faster. That means home assistants that don’t hesitate on simple chores but can carefully handle delicate tasks like pouring water or folding laundry. In hospitals and warehouses, it can boost both reliability and throughput by allocating compute only when needed. This approach also opens the door to clearer safety controls: when the model needs many iterations, it can automatically replan sooner or ask for help. Overall, it makes capable, resource-efficient robot helpers more practical and affordable.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: You know how you tie your shoes super fast when you’ve done it a million times, but you slow down and think carefully when learning a new knot? We naturally spend more brainpower on harder tasks and less on easy ones.

🥬 The Concept: Fixed Computational Depth in Robot Brains

What it is: Many robot models use a fixed amount of thinking for every control step, no matter how easy or hard the situation is.
How it works:
1. The camera and language instruction go into a model.
2. The model passes data through the same number of layers every time.
3. It outputs an action for the robot to do.
Why it matters: If every step gets the same compute, simple moves waste time and complex moves don’t get enough thought, causing failures.

🍞 Anchor: Imagine a robot adjusting its grip (easy) and threading a needle (hard), both with the same effort; that makes no sense and causes mistakes.

🍞 Hook: Imagine explaining how to solve a maze out loud, step by step—it’s slow and you have to remember every word you already said.

🥬 The Concept: Chain-of-Thought (CoT) for Robots

What it is: CoT makes the model “think aloud” in words or tokens before acting, creating a step-by-step plan.
How it works:
1. The model generates reasoning tokens (text or visual traces).
2. It uses that reasoning to decide actions.
3. Longer chains mean more memory and time.
Why it matters: In continuous control (like moving a gripper), word-like steps are clunky and slow, and memory grows with every extra token.

🍞 Anchor: It’s like narrating every tiny hand motion while trying to catch a ball—by the time you finish the sentence, the ball is gone.

🍞 Hook: Think of taking a photo and then shrinking it to a tiny icon and back again—you lose detail each time.

🥬 The Concept: Output-Space vs Latent-Space Reasoning

What it is: Output-space reasoning forces the model to convert rich internal thoughts into low-detail outputs (like text or coarse coordinates), then re-encode them; latent-space reasoning keeps thinking inside a rich, continuous hidden space.
How it works:
1. Output-space: decode to text/coordinates → re-encode → repeat.
2. Latent-space: keep all steps inside the model’s hidden vectors.
3. Only at the end, decode once into actions.
Why it matters: Output-space creates bottlenecks (information loss); latent-space keeps full detail and runs faster with constant memory.

🍞 Anchor: It’s better to plan a route using a high-resolution map in your head than to redraw it on a tiny sticky note every step.

🍞 Hook: Picture a chef who tastes the soup, tweaks the salt, tastes again, and repeats until it’s perfect.

🥬 The Concept: Recurrent Depth

What it is: Reusing the same small reasoning block several times at test time to refine a decision.
How it works:
1. Start with a rough guess in hidden space.
2. Apply the same block again and again to polish the guess.
3. Stop when changes become tiny.
Why it matters: Lets the model “think longer” for hard cases without changing its size or retraining.

🍞 Anchor: Like doing another quick lap of proofreading until all typos are gone.

The Problem: Before this paper, many robot models either used fixed compute or relied on CoT-style token reasoning. That meant slow inference, growing memory, and mismatches with smooth, continuous motions. Diffusion-style action refinement helps generate actions but still operates in output space and doesn’t grow the model’s understanding as it refines.

Failed Attempts:

Fixed-depth heads: fast but brittle on complex tasks.
Token CoT: flexible but slow and memory-hungry, plus hard to curate reasoning data for robots.
Diffusion policies: good for sampling diverse actions, but they refine outputs, not the internal understanding.

The Gap: We needed a way for robots to think more when needed, entirely inside their hidden representations, with constant memory, and without writing out long token chains.

Real Stakes: In homes, hospitals, and factories, robots must be safe and reliable. If they can spend extra compute when it matters—like grasping a slippery cup—but breeze through easy moves, they get both speed and safety right.

02Core Idea

🍞 Hook: Imagine doing math in your head, quietly checking your work with a few extra passes, instead of writing a whole paragraph explaining each step.

🥬 The Concept: Latent Iterative Reasoning with Recurrent Depth (the Aha!)

What it is: A robot controller that refines its hidden thoughts step by step using the same small block, and only decodes into actions at the end.
How it works:
1. Build a grounded hidden workspace from vision and language.
2. Initialize a noisy “scratchpad.”
3. Run a shared reasoning block multiple times to clean and sharpen the plan.
4. Stop when two consecutive action guesses barely differ.
5. Decode once into smooth actions.
Why it matters: You get adaptive test-time compute, constant memory, and big speedups without tokenizing thoughts.

🍞 Anchor: Like sketching, erasing, and refining a drawing in your head, then making one clean final line on paper.

Multiple Analogies (3 ways):

Chef analogy: Taste → adjust → taste again; stop when it’s right; serve once.
Detective analogy: Re-review clues with the same toolkit; stop when the story fits; report once.
Photo editing analogy: Apply the same filter multiple times until the picture is crisp; save once.

Before vs After:

Before: Fixed compute or token-based reasoning with growing memory and latency; refinement happened in output space with information loss.
After: Adaptive compute inside a rich hidden space, constant memory, and faster inference; refinement strengthens understanding before action.

Why It Works (intuition):

Reusing the same block (weight-tying) teaches a stable “refine” operation.
Keeping reasoning in latent space avoids bottlenecks from decoding/encoding.
A stop rule based on convergence links “how long to think” to “how confident I am.”
Training with truncated backprop builds stability across many iteration counts.

Building Blocks (each with the sandwich pattern):

🍞 Hook: Think of warming up before a race to get your muscles ready. 🥬 The Concept: Prelude

What it is: A small front-end that grounds learned queries by attending to mid-layer visual-language features.
How it works: It self-attends the queries, cross-attends to vision/language features, and builds a stable foundation (S_pre) for reasoning.
Why it matters: Without a good foundation, later refinements drift or collapse. 🍞 Anchor: Like lining up tools neatly on your workbench before you start fixing a bike.

🍞 Hook: Imagine using the same trusted eraser and pencil for each correction pass. 🥬 The Concept: Weight-Tied Recurrent Core

What it is: One transformer block reused across iterations to iteratively improve the latent scratchpad.
How it works: At each step, it sees the current scratchpad plus the foundation and attends to final vision/language features and robot state.
Why it matters: Reuse keeps parameters small, enables deep thinking at test time, and stabilizes refinement. 🍞 Anchor: Like looping the same checklist until everything’s green.

🍞 Hook: You don’t read every draft out loud; you publish only the final version. 🥬 The Concept: Coda (Decoder to Actions)

What it is: A final pass that turns the polished hidden state into real robot actions.
How it works: Attend to self and high-level features; apply a linear head to output actions.
Why it matters: Decoding only once avoids repeated information loss and saves time. 🍞 Anchor: Snap one final photo instead of taking dozens of blurry ones.

🍞 Hook: When practicing piano, you stop repeating a bar once two takes sound the same. 🥬 The Concept: Adaptive Stopping Criterion

What it is: A rule to stop iterating when successive action predictions are nearly identical.
How it works: Compare ak and ak-1; if the difference is below a small threshold, stop and act.
Why it matters: Saves compute on easy steps and spends more on hard ones. 🍞 Anchor: Quit studying when your practice tests all score the same A.

🍞 Hook: If the fog is thick, you drive shorter stretches before checking the map again. 🥬 The Concept: Adaptive Execution

What it is: Shorten or lengthen how many actions you execute before replanning, based on how long you had to think.
How it works: If many iterations were needed (uncertain), execute fewer steps and replan sooner; if few were needed, execute more.
Why it matters: Prevents small mistakes from snowballing in tricky situations. 🍞 Anchor: In a maze, take shorter moves when unsure; take longer strides when the path is clear.

03Methodology

At a high level: Cameras + instruction + robot state → Prelude (grounded foundation) → Recurrent Core (iterate K times) → Coda (decode once) → Actions.

Step-by-step with the sandwich pattern for new ideas:

🍞 Hook: Starting a painting with a primed canvas makes colors stick better. 🥬 The Concept: Build a Grounded Foundation (Prelude)
- What it is: The Prelude turns learned queries into a foundation (S_pre) by cross-attending to mid-layer visual-language features.
- How it works:
  1. Create K learned queries and let them self-attend.
  2. Cross-attend them to mid-layer VLM features (vision + latent tokens).
  3. Output S_pre, a stable base for reasoning.
- Why it matters: Without S_pre, the model forgets what it saw when iterating many times. 🍞 Anchor: Like sketching guidelines before coloring.
🍞 Hook: Before writing an essay, you might dump rough ideas onto a scratchpad and then clean them up. 🥬 The Concept: Latent Scratchpad Initialization
- What it is: Start the scratchpad S as structured noise so the model learns to “clean” it by refining.
- How it works:
  1. Sample S from a truncated normal distribution.
  2. Keep S_pre fixed as the anchor.
  3. At each step, combine S and S_pre and adapt them back to the manifold.
- Why it matters: This teaches a general refine-operator that works from any reasonable start. 🍞 Anchor: Like turning a messy brainstorming page into a neat outline.
🍞 Hook: Using the same favorite tool over and over helps you get consistent results. 🥬 The Concept: Weight-Tied Recurrent Core (Iterative Refinement)
- What it is: A single transformer block reused each iteration to update the scratchpad S.
- How it works:
  1. Concatenate current S with S_pre; adapt and normalize.
  2. Self-attend across K queries to share information.
  3. Cross-attend to final VLM features and robot proprioception.
  4. Output a refined S for the next step.
- Why it matters: Enables arbitrarily deep compute at test time with constant memory and minimal parameters. 🍞 Anchor: Like applying the same sharpening filter until the image stops changing.
🍞 Hook: You don’t keep revising forever; you stop once two drafts read the same. 🥬 The Concept: Adaptive Stopping Criterion
- What it is: Stop iterating when actions stabilize.
- How it works:
  1. After each iteration, decode a provisional action.
  2. Compare it to the previous iteration’s action (MSE as a proxy for KL).
  3. If below a tiny threshold δ, stop and output the final action.
- Why it matters: Saves time on easy moves, invests on hard ones. 🍞 Anchor: Stop practicing a song section once back-to-back attempts sound identical.
🍞 Hook: If a trail gets rocky, you take smaller steps. 🥬 The Concept: Adaptive Execution Horizon
- What it is: Decide how many actions to execute before replanning based on how long you had to think.
- How it works:
  1. If iterations k* are high (uncertain), use a short horizon.
  2. If k* is low (confident), use a long horizon.
  3. Optionally use a linear decay schedule tied to k*.
- Why it matters: Reduces compounding errors in tricky states. 🍞 Anchor: Drive slower and check directions more often in fog.
🍞 Hook: Studying the last chapters before a test often gives most of the benefit. 🥬 The Concept: Truncated Backpropagation Through Time (TBPTT)
- What it is: During training, backpropagate gradients only through the last d iterations (e.g., last 8), while earlier steps are detached.
- How it works:
  1. Randomly sample a total iteration count from a heavy-tailed distribution.
  2. Unroll that many steps.
  3. Backprop through only the last chunk.
- Why it matters: Efficient training that teaches stability across many iteration counts and prevents memory blow-up. 🍞 Anchor: Review the latest lessons most intensely instead of re-reading the whole textbook.
🍞 Hook: Packing one multi-tool instead of a whole toolbox saves space. 🥬 The Concept: Latent vs Token Reasoning (Secret Sauce)
- What it is: Keep all the thinking inside the hidden space and decode once, instead of generating long token chains.
- How it works:
  1. No autoregressive token loops.
  2. Recurrent latent refinement only.
  3. One final projection to actions.
- Why it matters: Constant memory, much faster inference (up to 80× vs token reasoning), and better alignment with continuous control. 🍞 Anchor: Quietly figure out the plan in your head, then make one smooth move.

Concrete mini-example:

Input: “Put the red block in the blue bowl,” with camera images and joint angles.
Prelude builds S_pre grounded in the scene (spots red block, blue bowl).
Scratchpad starts noisy; the Core refines: first find the block; then align gripper; then plan lift-and-place; changes shrink.
Stop when action predictions converge; Coda decodes a precise reach-grasp-lift-place motion chunk.

04Experiments & Results

🍞 Hook: When you test a new bike, you don’t just ask, “Does it roll?” You try hills, turns, and bumps to see how it really performs.

🥬 The Tests: What and Why

What they measured:
1. Success rates on many manipulation tasks (LIBERO).
2. How long a robot can chain tasks in a row (CALVIN ABC→D).
3. Real-world task progress (cube to bowl, dish wiping, towel folding, bread toasting).
4. How performance scales as you let the model think for more iterations.
5. Whether adaptive stopping saves compute without losing accuracy.
Why: To prove the model can think longer for harder tasks, stay fast, and generalize to real robots.

🍞 Anchor: It’s like timing runners on sprints and marathons and checking how well they handle curves.

The Competition:

End-to-end VLAs (predict actions directly) like OpenVLA, SmolVLA, π-FAST.
Token reasoning VLAs (generate thoughts as tokens) like CoT-VLA, ThinkAct, MolmoAct.
RD-VLA (latent reasoning) with fixed and adaptive iterations.

Scoreboard with context:

LIBERO benchmark:
- RD-VLA (fixed iterations) hits 93.0% success, topping prior methods including strong token-reasoning baselines.
- RD-VLA (adaptive) gets 92.5%—almost the same accuracy—while using fewer iterations on average (compute trimmed ~34% at a good threshold).
- Interpretation: That’s like scoring an A+ while doing less homework on easy problems.
Scaling with iterations:
- From 1 to 8 iterations, performance rises log-linearly, then saturates around 8–12.
- Some tasks jump from 0% to >90% success by 4 iterations—clear evidence that extra thinking helps hard actions.
CALVIN ABC→D (long-horizon chaining):
- RD-VLA achieves the longest average chain length (3.39) and a task-5 success of 45.3%, indicating robust sequential planning.
Real robots:
- On bimanual tasks (cube→bowl, dish wiping, towel folding, bread toasting), RD-VLA outperforms Diffusion Policy and π0.5 baselines, with near-perfect dish wiping using fixed 8 iterations.
- Adaptive mode stays competitive and even leads on cube placement while saving compute.

Surprising findings:

Emergent adaptivity: The model naturally takes more iterations on complex steps (like grasping) and fewer on simple moves (like navigate/place), even without explicit labels for “hardness.”
Constant memory plus deep thinking: You can unroll many refinement steps at test time without growing memory, thanks to weight-tying and latent reasoning.
Safe replanning: Coupling “how long I thought” to “how long I execute” reduces compounding errors in tricky scenes.

05Discussion & Limitations

🍞 Hook: Even the best hikers know their limits—bring water, check the weather, and don’t overdo it.

🥬 Honest Assessment

Limitations:
1. Over-iterating can lead to saturation or slight drops—there’s a sweet spot (often ~8–12 iterations on LIBERO).
2. Adaptive thresholds need tuning per setup; too strict wastes time, too loose may stop early.
3. While small (0.5B) and efficient, bigger backbones and more diverse data could further improve stability and generalization.
4. Latent convergence is a proxy for confidence; it’s helpful but not perfect—edge cases may need extra safety checks.
Required Resources:
- A VLM backbone (here ~0.5B with LoRA), GPU for training TBPTT, robot sensors (cameras, proprioception).
- Inference is light: constant memory, adjustable iterations.
When NOT to Use:
1. Ultra time-critical reflexes with hard real-time deadlines that can’t tolerate even a few extra milliseconds.
2. Tasks dominated by precise physics simulation where domain-specific controllers excel.
3. Environments with extreme distribution shift without enough visual grounding—may require larger pretraining.
Open Questions:
1. How far can latent recurrence scale with larger models and datasets before diminishing returns?
2. Can we design even better confidence signals directly in latent space?
3. What hybrid systems emerge if we combine token CoT and latent recurrence smartly?
4. How to guarantee safety with formal bounds tied to convergence behavior?

🍞 Anchor: Like packing a trek bag—this model travels light and far, but you still plan your route and check conditions.

06Conclusion & Future Work

Three-sentence summary:

RD-VLA lets a robot think quietly inside its hidden space, reusing one small reasoning block multiple times and stopping when its plan stabilizes.
This gives adaptive test-time compute with constant memory and big speedups, avoiding slow token-based reasoning.
Across simulators and real robots, it succeeds more often on hard tasks by simply thinking longer when needed.

Main achievement: Showing that implicit, latent-space iterative reasoning with a weight-tied recurrent core can replace explicit token reasoning for Vision–Language–Action models, delivering state-of-the-art success with far better efficiency.

Future directions:

Scale to larger backbones and richer datasets; refine convergence signals; explore hybrids that combine CoT and latent recurrence; and formalize safety via uncertainty-aware execution.

Why remember this:

It flips the script from “say your thoughts out loud” to “think deeply in your head,” giving robots a simple, powerful way to spend more brainpower only when the world demands it.

Practical Applications

•Home service robots that swiftly handle simple tidying but think longer for delicate tasks like arranging glassware.
•Warehouse picking systems that adaptively spend more compute on tricky grasps (e.g., deformable items) and less on standard boxes.
•Hospital assistants that carefully manipulate medical tools or supplies when uncertainty is high and speed up on routine steps.
•Factory assembly robots that refine alignment steps more when tolerances are tight and breeze through bulk motions.
•Kitchen robots that stabilize grasp-and-cut actions by iterating more on slippery or irregular foods.
•Mobile manipulators that replan more frequently in cluttered aisles while cruising confidently in open spaces.
•Teleoperation assist that flags high-uncertainty states (many iterations) and shortens execution horizon to keep a human-in-the-loop safer.
•Education and research platforms to study confidence-aware planning and test-time compute scaling without collecting CoT traces.
•Retrofits for existing VLM-based robot stacks by swapping in a recurrent latent head to gain adaptive compute with minimal memory cost.
•Energy-aware deployments where compute time and battery are budgeted dynamically based on the task’s difficulty.

Version: 1