VLANeXt: Recipes for Building Strong VLA Models

Xiao-Ming Wu; Bin Fan; Kang Liao; Jian-Jian Jiang; Runze Yang; Yihang Luo; Zhonghua Wu; Wei-Shi Zheng; Chen Change Loy

VLANeXt: Recipes for Building Strong VLA Models

Intermediate

Xiao-Ming Wu, Bin Fan, Kang Liao et al.2/20/2026

arXiv

Key Summary

•This paper studies Vision–Language–Action (VLA) robots under one fair setup to find which design choices truly matter.
•Starting from a simple baseline like RT-2/OpenVLA, the authors test changes in three areas: core building blocks, perception inputs, and how actions are modeled.
•They discover a practical recipe: use a stronger vision-language backbone, a soft connection to a larger policy module, and predict actions in chunks.
•Continuous action training works best; flow matching loss is strong and gets an extra boost from a tiny frequency-domain loss.
•Feeding proprioception (the robot’s own body state) into the VLM, plus multi-view cameras, beats putting it only in the policy or skipping it.
•Adding many past frames didn’t help; it sometimes hurt, so keep inputs focused and clean.
•World modeling (predicting future images) helps but costs about 3× more training time, so the paper leaves it out for efficiency.
•Their final model, VLANeXt (about 2.5B params), beats larger prior SOTA on LIBERO and is about 10% better than OpenVLA-OFT on LIBERO-plus.
•The team releases a simple, unified codebase so others can reproduce results and explore new VLA designs.
•Bottom line: careful design beats just making models bigger.

Why This Research Matters

Reliable household, hospital, and warehouse robots must be good at seeing, understanding language, and moving smoothly despite changing lights, camera angles, or object layouts. This paper shows a clear, reproducible recipe to make such robots stronger without just making them bigger. The findings improve safety and precision by promoting smooth, body-aware control. They also lower adoption barriers by identifying which ingredients (like multi-view and proprioception-in-VLM) deliver the most benefit per cost. Finally, the shared codebase helps teams build on the same foundation, accelerating progress toward capable, helpful robot assistants.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Top Bread (Hook): Imagine teaching a helper robot at home. You show it pictures (what you see), tell it what you want (your words), and expect it to act (pick and place). If it understands all three well, life gets easier.

🥬 Filling (The Actual Concept)

What it is: Before this paper, people built Vision–Language–Action (VLA) models in many different ways, but compared them with different rules, so it was hard to know which ideas really worked.
How it works (then): Many teams glued a vision-language model (VLM) to a small action maker (policy), trained them on robot demos, and hoped the combo would generalize.
Why it matters: Without one fair “field” and shared rules, it’s like comparing soccer teams who play on different fields with different ball sizes—you can’t tell which strategy is best.

🍞 Bottom Bread (Anchor): Two teams both claim their robots are great at “open the drawer and put the can inside,” but one used brighter lights and more camera angles. Who’s actually better? You can’t tell without the same setup.

Now, let’s gently unpack all the core ideas using the Sandwich pattern for each key concept, in the order that builds understanding.

🍞 VISION–LANGUAGE–ACTION MODEL (VLA)

Hook: You know how you look at a picture, read instructions, and then do the task? Like, “Put the red cup in the box.”
What it is: A VLA is a model that sees images, reads words, and outputs actions for a robot.
How it works:
1. Take in camera frames and a text instruction.
2. Understand them together.
3. Predict the next moves the robot arm should make.
Why it matters: Without a VLA, the robot can’t connect “what it sees” and “what you say” to “what it must do.”
Anchor: When you say “place the tomato sauce in the basket,” a VLA finds the bottle in the image and moves the arm to pick and place it.

🍞 VISION–LANGUAGE MODEL (VLM)

Hook: Imagine a super reader who can also understand pictures.
What it is: A VLM is the brain that jointly understands images and text.
How it works: It turns images and words into tokens, mixes them with attention, and produces a shared understanding.
Why it matters: If the brain can’t understand the scene and the instruction together, the robot will guess.
Anchor: The VLM notices the “black bowl between the plate and the ramekin” and points the policy to the right spot.

🍞 POLICY MODULE

Hook: Think of a coach who takes the game plan and tells players exactly where to move.
What it is: The policy module turns understanding into specific robot actions.
How it works: It reads features from the VLM and outputs a sequence of arm commands.
Why it matters: Without a good policy, even a smart brain won’t move the robot correctly.
Anchor: After the VLM finds the bowl, the policy tells the gripper when to close and where to place the bowl.

🍞 ACTION PREDICTION

Hook: Like following a recipe step by step—stir, pour, bake.
What it is: Action prediction means forecasting the robot’s future moves.
How it works: Given images and words, predict a series of arm poses, gripper commands, and timing.
Why it matters: If the future steps are wrong or jerky, the task fails.
Anchor: Predicting “reach → grasp → lift → move → release” to place an object accurately.

The world before this paper:

Lots of inventive VLA ideas appeared (RT-2, OpenVLA, π-series, and more). But each used different evaluation and training tricks, so comparing was messy.
People also disagreed on many knobs: Should actions be predicted one step at a time or in chunks? Should we discretize or regress actions? Is extra video predicting (world modeling) worth it? Where should proprioception (the robot’s body state) go? Should we use one camera or two? Should the VLM and policy be tightly wired or loosely connected?

The problem the researchers faced:

The design space was a "primordial soup"—too many ideas, not enough structure. Without unified tests, the community couldn’t tell which choices were truly impactful.

Failed attempts and confusion:

Some models reused text tokens for actions—simple but sometimes limiting.
Others discretized continuous actions into bins—easy, but often less precise.
Some added lots of past frames—extra info that sometimes just added noise.
Some tried world modeling—helpful, but very expensive.

The gap:

A clean, fair, and repeatable study across a wide set of choices to distill a practical recipe.

The real stakes:

In daily life, we need robots that obey natural language and work reliably in changing rooms, lighting, and layouts. A solid recipe helps everyone build robots that actually help—whether in homes, hospitals, or warehouses.

02Core Idea

The Aha! moment in one sentence: You can build a stronger, smaller VLA by choosing the right ingredients and connections—especially a soft VLM-to-policy link, action chunking with continuous losses, and feeding proprioception into the VLM—rather than just scaling up parameters.

We now explain the main ingredients using Sandwich blocks, with multiple analogies and before-vs-after intuition.

🍞 FOUNDATIONAL COMPONENTS

Hook: Think of assembling a bike: frame, wheels, and chain must fit well.
What it is: The essential architecture choices (VLM backbone, policy design, loss) that everything else depends on.
How it works:
1. Pick a capable VLM backbone.
2. Connect it to a sufficiently expressive policy head.
3. Train with a loss that matches how actions really look (continuous!).
Why it matters: A weak backbone, tiny policy, or wrong loss caps performance no matter what data you feed.
Anchor: Upgrading from a basic VLM to Qwen3-VL and enlarging the policy head is like swapping training wheels for proper tires—suddenly the ride is smooth.

🍞 PERCEPTION ESSENTIALS

Hook: When you clean your room, two eyes (views) and a sense of your own body help a lot.
What it is: The key inputs—multi-view images and proprioception—given to the model.
How it works:
1. Use both a third-person and a wrist camera.
2. Provide proprioception (joint angles, gripper state) to the VLM.
3. Avoid stuffing in redundant past frames if they don’t help.
Why it matters: Clear, complementary views and body awareness make actions more reliable.
Anchor: Seeing a drawer from the side and from the wrist view reduces confusion and improves grasping.

🍞 ACTION MODELING PERSPECTIVE

Hook: A dance teacher thinks in 8-counts, not single moves.
What it is: Treat action sequences as structured time series, not isolated steps.
How it works:
1. Predict actions in chunks (e.g., 8 steps at a time).
2. Train with continuous losses like regression or flow matching.
3. Add a tiny frequency-domain loss to keep motions smooth and efficient.
Why it matters: Robots move through time; modeling that structure makes actions stable and accurate.
Anchor: Pouring water needs a smooth arc, not jittery, one-step guesses; chunked prediction gets that arc right.

🍞 SOFT CONNECTION STRATEGY (VLM ↔ Policy)

Hook: Like a flexible joint between your arm and hand—strong but not rigid.
What it is: A layer-by-layer connection with learnable queries as a buffer between VLM and policy.
How it works:
1. At each layer, pass information through a small set of trainable query tokens.
2. Let the policy read from these queries instead of hooking directly into VLM layers.
3. Tune end-to-end so the buffer learns what to keep and pass along.
Why it matters: Too loose can starve the policy; too tight can overwhelm it. Soft buffering transfers the “right amount” of knowledge.
Anchor: It’s like a translator who filters and clarifies, so instructions arrive just right.

🍞 MULTI-VIEW OBSERVATIONS

Hook: Watching a soccer game from two cameras helps you see where the ball really is.
What it is: Using both third-person and wrist cameras.
How it works:
1. Encode each view.
2. Fuse them in the VLM.
3. Give the policy a fused 3D-aware understanding.
Why it matters: Different angles fix blind spots and reduce spatial confusion.
Anchor: The wrist view confirms the gripper’s alignment before closing.

🍞 PROPRIOCEPTIVE CONDITIONING

Hook: A dancer knows where their feet are without looking.
What it is: Feed the robot’s body state into the VLM so vision and body sense join early.
How it works:
1. Project proprioception into tokens.
2. Mix them with image and text tokens in the VLM.
3. Let the policy act on this fused state.
Why it matters: Early fusion helps the model plan moves that fit the robot’s current pose.
Anchor: If the elbow is already bent, the plan should adjust reach distance.

🍞 ACTION CHUNKING

Hook: Baking cookies in batches is faster and more consistent than one at a time.
What it is: Predict several future actions at once (e.g., 8 steps).
How it works:
1. Define a chunk length.
2. Train the policy to output the whole chunk.
3. At run-time, execute and roll forward.
Why it matters: Gives temporal context, improves smoothness, and speeds inference.
Anchor: “Reach, close, lift, move, open” flow feels natural when learned together.

🍞 FLOW MATCHING LOSS

Hook: Like guiding a toy car along the right track smoothly from start to finish.
What it is: A continuous training loss that learns a velocity field to transform noise into target actions.
How it works:
1. Start from noisy actions.
2. Learn a field that “flows” them toward true actions.
3. Train by matching this flow across time.
Why it matters: Real robot actions are continuous; this captures smooth trajectories better than coarse bins.
Anchor: The robot’s wrist rotates fluidly to align with a handle.

🍞 FREQUENCY-DOMAIN MODELING

Hook: When you hear music, you notice bass (slow) and treble (fast) patterns.
What it is: An extra tiny loss that checks predicted actions in frequency space (via DCT) to control smoothness.
How it works:
1. Convert action sequences to frequencies.
2. Compute MSE between predicted and true spectra.
3. Weight it lightly (e.g., 0.1–0.2) alongside flow matching.
Why it matters: Robotic motions are often low-rank; matching frequencies reduces jitter.
Anchor: The pouring motion keeps a steady rhythm instead of wobbling.

🍞 WORLD MODELING

Hook: Before you move, you imagine what the scene will look like next.
What it is: Predict future images as an auxiliary task.
How it works:
1. Tokenize images.
2. Predict future frame tokens at a fixed horizon.
3. Train jointly with action prediction.
Why it matters: Helps the model “think ahead,” but it’s costly (about 3× training time here).
Anchor: Anticipating the drawer’s new position after opening can aid planning.

🍞 TEMPORAL OBSERVATION HISTORY

Hook: Sometimes too many reminders just clutter your mind.
What it is: Feeding many past frames into the model.
How it works:
1. Stack or stream multiple past images.
2. Let the VLM attend over them.
3. Train to use the history if helpful.
Why it matters: Extra frames can add noise; in this study, single current frames worked better.
Anchor: The robot focuses on the clear present scene rather than blurry past snapshots.

Before vs. after:

Before: Reused text tokens for actions, small policy heads, tight/loose VLM-policy links, discretized actions, and single views.
After: A stronger VLM, soft connection with meta-queries, larger policy, chunked actions, continuous losses (flow matching + tiny frequency loss), multi-view images, and proprioception in the VLM.

Why it works (intuition):

Matching the problem’s physics (smooth, continuous motion over time, seen from multiple angles, grounded in body state) aligns the learning signals with reality, so training becomes easier and generalization improves.

Building blocks (recap): stronger VLM, bigger policy head (multi-token), soft coupling, action chunking, flow matching + frequency loss, multi-view vision, proprioception to VLM, skip redundant history, optional world modeling if you can afford it.

03Methodology

At a high level: Input (multi-view images + instruction + proprioception) → VLM encoding and fusion (with proprioception) → soft meta-query bridge → large policy head → action chunk prediction (flow matching) + tiny frequency-domain loss → Output (next action chunk).

Step-by-step, like a recipe:

Inputs and tokenization

What happens: The system takes two camera views (third-person and wrist), a language instruction, and proprioception (e.g., joint angles, gripper state). Images are encoded; text is embedded; proprioception is projected to tokens.
Why this step exists: Multi-view fixes blind spots; proprioception grounds the plan to the robot’s body; instructions set the goal.
Example: Instruction: “Pick up the tomato sauce and place it in the basket.” Images show the table and the gripper’s perspective. Proprioception says the elbow is bent 30°, gripper open.

VLM fusion (Qwen3-VL-2B)

What happens: The VLM ingests image, text, and proprioception tokens together, producing a shared multimodal representation.
Why this step exists: Early fusion helps the model connect “what I see,” “what I want,” and “how my body is” all at once.
Example: It highlights the tomato bottle location, recognizes “place in basket,” and notes the current wrist orientation.

Soft connection with meta queries

What happens: Layer by layer, a small set of learnable query tokens sits between the VLM and the policy head. The policy reads from these queries, not directly from raw VLM layers.
Why this step exists: A buffered link transfers the right amount of knowledge—neither starving nor overwhelming the policy.
Example: The queries carry “where the bottle is,” “where the basket is,” and “how the wrist is oriented,” distilled for action planning.

Large policy head (multi-token, deeper)

What happens: Instead of reusing text tokens, a dedicated policy module with multiple tokens and more layers predicts actions.
Why this step exists: Decoupling from language space and increasing capacity lets the policy represent complex, time-extended maneuvers.
Example: The policy plans “reach → close → lift → move → open,” with target waypoints and gripper timing.

Action chunking (e.g., 8 steps)

What happens: The policy outputs a short horizon of actions at once (positions/velocities/efforts, etc.).
Why this step exists: Batching time steps improves temporal coherence and speeds inference.
Example: The next 8 control steps include a smooth lift and a lateral move toward the basket.

Training with continuous objectives

What happens: Actions are trained with a flow matching loss. Additionally, a small frequency-domain MSE (via discrete cosine transform) regularizes motion smoothness.
Why this step exists: Real robot controls are continuous and structured over time; these losses fit that reality better than coarse classification.
Example: The flow loss makes the wrist rotate fluently; the frequency loss prevents tiny jitters during the lift.

Optional world modeling (excluded in final for efficiency)

What happens (optional): Predict future image tokens at the chunk horizon to encourage look-ahead.
Why this step exists: Thinking a step ahead can guide better actions but significantly increases training time (~3× here).
Example: Envisioning the drawer more open after a pull. Helpful, but expensive.

Output and control loop

What happens: The predicted chunk is executed; then the next observation arrives and the process repeats.
Why this step exists: Closed-loop control adapts to small changes and disturbances.
Example: If the bottle slips slightly, the next chunk adjusts the gripper path.

What breaks without each piece:

Remove multi-view: Depth or occlusion errors rise; grasps miss more often.
Skip proprioception: Plans ignore current joint states; collisions or awkward paths increase.
Tight or loose connection instead of soft: Either too little or too much signal flows; performance drops slightly.
Small policy: Complex sequences get underfit; long-horizon moves degrade.
No chunking: Motions get jittery; inference can be slower per effective horizon.
Classification loss: Coarse bins fail to capture smooth control; accuracy drops.
No frequency loss: Fine jitters sneak in; less robust under perturbations.

Concrete data example:

Task: “Open the middle drawer.”
Inputs: Two views show the drawer front; proprioception shows wrist yaw misaligned.
VLM+soft link: Extracts “middle drawer handle,” keeps wrist alignment salient.
Policy+chunking: Predicts “move to handle → align yaw → close gripper → pull 5 cm → pull 5 cm → stop.”
Losses: Flow matching trains the smooth pull; frequency loss suppresses micro-oscillation.
Output: The drawer opens stably even with small lighting shifts.

The secret sauce:

Balance and fit: Soft coupling passes just-right information; chunking captures time structure; continuous losses reflect the physics of motion; proprioception enters early where it fuses best. Combined with a capable VLM and a sufficiently big policy, these choices deliver big gains without just scaling parameter count.

04Experiments & Results

The tests and why they matter:

The team evaluated on LIBERO (standard tasks across Spatial, Object, Goal, and Long suites) and LIBERO-plus (same tasks but with unseen perturbations like lighting, camera shifts, layout changes, language rewrites). This checks both skill and sturdiness.

The competition:

They compared VLANeXt to strong direct policy baselines (e.g., Diffusion Policy, Octo, MDT) and leading VLA models (OpenVLA, π-series, NORA, UniVLA, FLOWER, OpenVLA-OFT, etc.). These are the best-known reference points.

The scoreboard (with context):

On LIBERO, VLANeXt reaches about 97.4% average success, which is like getting an A+ when other top students are getting solid As.
On LIBERO-plus, VLANeXt averages about 80.1%, roughly 10 percentage points higher than OpenVLA-OFT (about 69.6%). That’s like winning by a comfortable margin in a championship game with tricky weather.
Despite being smaller (~2.5B) than some rivals (e.g., 7B), VLANeXt wins—evidence that the recipe matters more than sheer size.

Surprising and notable findings:

Temporal observation history (many past frames) didn’t help and sometimes hurt. Less is more when history adds noise.
Multi-view (third-person + wrist) gave a big boost. Two angles cleared up 3D ambiguities.
Proprioception worked best when fed into the VLM, not just the policy. Early fusion wins.
Continuous objectives (regression/flow matching) beat classification; regression was top in simple cases, while flow matching offered strong performance and flexibility for more complex/multimodal action spaces.
A stronger VLM backbone helped more when paired with a larger policy module; the capacity must be usable.
World modeling improved accuracy but cost ~3× training time—great if you have compute, but not in the final efficient recipe.

What changed because of the idea:

Instead of blindly scaling models, the recipe identifies key levers that reliably increase success and robustness. This directs community effort to the highest-return design choices and provides a shared, reproducible platform to test new ideas.

Why the numbers make sense:

Real robot control is smooth, continuous, and temporal. Multi-view plus proprioception feeds cleaner signals; chunking and continuous losses fit the physics; soft coupling passes useful knowledge without overload. Each piece nudges performance upward; together, they add up to state-of-the-art results.

05Discussion & Limitations

Limitations:

Benchmarks: Results center on LIBERO and LIBERO-plus. While broad and perturbed, they’re still curated; performance on drastically different setups (e.g., outdoor scenes, mobile manipulation, deformable objects) isn’t guaranteed.
Compute and hardware: Multi-view cameras and proprioception are assumed. If you only have a single camera or limited sensors, gains may shrink.
Training cost trade-offs: World modeling helps but is expensive (~3×). Even without it, training large VLMs and policies requires solid compute.
Supervised demos: The recipe focuses on imitation learning. It doesn’t address interactive fine-tuning or RL for exploration and recovery.
Sensitivity to design scale: Benefits from stronger VLMs showed up more with a larger policy. Very tiny policies may underuse backbone strength.

Required resources:

Data: Demonstrations per task (hundreds for LIBERO suites; more helps).
Sensing: Third-person + wrist cameras; reliable proprioceptive signals.
Compute: GPUs sufficient for multi-billion-parameter finetuning; moderate memory for action chunking and soft-layer links.
Software: The provided unified codebase standardizes the pipeline and evaluations.

When NOT to use this approach:

Ultra-low compute or single-sensor settings where multi-view and proprioception aren’t available and must remain so.
Extremely high-frequency control loops with very tight latency budgets that can’t accommodate a VLM-policy pipeline.
Non-manipulation domains where perception and action structure differ (e.g., very long-horizon navigation without local manipulation cues) unless adapted.

Open questions:

How to get most of the world-modeling benefits at a fraction of the cost?
Can interactive post-training (RL, preference alignment, self-correction) pair with this recipe to further improve robustness?
What’s the best way to scale down (tiny models) without losing the key gains from soft coupling and chunking?
How does the recipe transfer to new embodiments (humanoids, mobile manipulators) and sensing modalities (tactile, depth, audio)?
Can frequency-domain ideas be extended (learned spectral weights, multi-scale temporal pyramids) for even smoother control?

06Conclusion & Future Work

Three-sentence summary:

VLANeXt shows that careful architectural and training choices—soft VLM-policy coupling, larger policy heads, multi-view vision, proprioception in the VLM, action chunking, and continuous objectives with a tiny frequency loss—consistently beat just making models bigger.
Under a unified, fair evaluation, these choices deliver state-of-the-art success on LIBERO and a roughly 10% average gain over OpenVLA-OFT on LIBERO-plus.
The released lightweight framework makes it easy for the community to reproduce, extend, and test new VLA ideas on a shared foundation.

Main achievement:

Distilling a practical, evidence-backed recipe that reliably strengthens VLA models and a simple model (VLANeXt) that validates the recipe.

Future directions:

Cheaper ways to “think ahead” (efficient world modeling), interactive post-training to handle surprises, scaling to varied robots and sensors, and richer temporal modeling (learned spectral priors, hierarchical chunking).

Why remember this:

It reframes progress in VLA from ad-hoc tinkering to principled choices that fit the physics of action and the structure of perception. With this recipe, better, smaller, and more robust robot helpers move from possibility to practice.

Practical Applications

•Home assistance: Robustly pick, place, and organize items when rooms are messy or lighting changes.
•Hospitals: Follow spoken instructions to fetch tools or deliver supplies with smooth, careful motions.
•Warehouses: Handle varied packages from multiple views, maintaining accuracy under layout shifts.
•Manufacturing: Insert, fasten, or assemble parts with wrist-view precision and stable trajectories.
•Retail restocking: Place products neatly on shelves despite occlusions and busy backgrounds.
•Elder care: Assist with daily tasks using natural language, accounting for small environment changes.
•Lab automation: Manipulate containers and instruments with repeatable, low-jitter movements.
•Education and research: Use the unified codebase to test new VLA ideas quickly and fairly.
•Teleoperation assist: Smooth operator commands into coherent action chunks for safer control.
•Field service robots: Maintain stable control when cameras shake or lighting flickers.

Version: 1