RealWonder: Real-Time Physical Action-Conditioned Video Generation

Wei Liu; Ziyu Chen; Zizhang Li; Yue Wang; Hong-Xing Yu; Jiajun Wu

RealWonder: Real-Time Physical Action-Conditioned Video Generation

Intermediate

Wei Liu, Ziyu Chen, Zizhang Li et al.3/5/2026

arXiv

Key Summary

•RealWonder is a system that turns a single picture and 3D physical actions (like pushes, wind, and robot gripper moves) into a realistic video in real time.
•The key trick is using a physics simulator as a bridge: it converts hard-to-understand 3D actions into easy-to-read visual signals (optical flow and a rough color preview) for a video model.
•RealWonder reconstructs a 3D scene from one image, runs fast physics to predict motion, and then guides a distilled 4-step video generator to make the final video frames.
•It achieves up to 13.2 frames per second at 480×832 resolution, making it interactive (you act, it shows the result right away).
•Unlike other methods, it doesn’t need rare action-video training pairs and doesn’t try to turn continuous forces into discrete tokens.
•It works across different materials (rigid objects, cloth, liquids, smoke, sand) and with robot actions and camera motion.
•In comparisons, it is preferred by humans for action following and physical plausibility, and it scores best or second best on automated metrics.
•Ablations show both optical flow and the RGB preview are needed: without either, motion or structure goes wrong.
•This opens doors for AR/VR, robot learning, and motion planning with live, physics-aware visual feedback.

Why This Research Matters

RealWonder turns physical actions into immediate, believable videos, bringing interactivity to places where visuals used to be passive. This makes AR/VR worlds feel responsive, as if you’re actually touching and changing them. Robots can practice and plan with fast, physics-aware visual feedback, which can improve safety and efficiency. Educators can let students experiment with “what-if” physics in a way that looks and feels real. Creators can iterate ideas rapidly—try a different wind, a new push, or a robot grip—and see the outcome instantly. Because it avoids hard-to-find action-video datasets, it’s practical to build and extend. Overall, it’s a new, live bridge between how we act in 3D and what we see on screen.

Reading Workflow

Turn this paper into a decision

Scan fast. Promote only the papers that survive triage.

No workflow history yet.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine you have a toy boat in a bathtub. When you blow on it, it moves. If you pinch a towel, it crumples. You don’t just want a pretty video—you want the video to react to your pushes and pinches the way real things do.

🥬 The World Before: Before this work, video generators were great at making beautiful clips from text or a single picture. But they were mostly passive storytellers: you’d say “a boat on a lake,” and they’d show a boat moving the way their training taught them. If you asked, “What happens if I blow from the right?” the models didn’t truly understand forces, friction, or how a push travels through a scene. Many tools only let you control simple 2D things, like dragging a point or giving a 2D motion line on the screen. That helps a bit, but it’s not the same as a real 3D push with strength and direction in space.

The Problem: Real 3D actions—forces, torques, and robot gripper commands—live in the physical world. They’re continuous (any strength, any direction) and unbounded (no fixed menu of options), so turning them into tokens (like words) doesn’t work well. Also, training data that matches exact actions to resulting videos is rare: it’s very hard to look at a video and perfectly measure the hidden forces that caused it.

Failed Attempts: People tried to control videos with 2D drags or trajectories—fun for drawing a path, but not physically grounded. Others tried to stuff continuous actions into token-like inputs, but forces don’t fit neatly. Even physics-based pipelines that rebuild 3D and run solvers could be slow, taking minutes for short clips. Real-time interaction stayed out of reach.

The Gap: We needed a translator—something that could take a real 3D action (like “push here with this strength”) and turn it into visual guidance that video models naturally understand, without asking the video model to learn physics from scratch or to read weird action tokens.

Real Stakes: Why this matters to daily life:

AR/VR: You flick a virtual curtain, and it sways instantly and realistically, making experiences feel alive.
Robotics: A robot tries a new grip on a real object and gets quick, physics-aware visual feedback to plan its moves.
Education: Students explore “what-if” physics by pushing a block or blowing wind at sandcastles and seeing believable outcomes.
Creativity: Artists test stylized winds, splashes, or crumples interactively instead of waiting minutes for every change.

🍞 Anchor: Think of RealWonder like a playground sandbox that responds as you poke, blow, or pinch. You do an action in 3D; it quickly predicts how everything moves; then it paints a high-quality video of that motion right away, so you can try the next idea without waiting.

02Core Idea

🍞 Hook: You know how a weather map shows arrows and colors to turn complex wind math into a picture we can understand? That picture becomes a bridge between the hard math and what your eyes can read.

🥬 The "Aha!" Moment (one sentence): Use a fast physics simulator to convert 3D physical actions into visual motion cues (optical flow and a rough RGB preview) that a video generator can follow in real time.

Multiple Analogies (3 ways):

Translator analogy: Forces speak “math,” video models speak “pictures.” Physics simulation is the translator that turns math (pushes, pulls, torques) into pictures (motion arrows and rough frames).
Choreographer analogy: The simulator is the choreographer giving dance steps (flow) to the video model, which is the dancer that performs it beautifully on stage (final frames).
Map-and-artist analogy: The simulator draws a sketchy map (optical flow + preview), and the video model, like a skilled artist, paints a photorealistic scene following that map.

Before vs After:

Before: Video models guessed motion patterns from text and pixels but had no true grasp of 3D forces; 2D drags helped only a little; real-time action-conditioning was missing.
After: Physics handles the action logic; the video model handles the look. Together they react instantly as you apply forces, move a robot gripper, or pan a camera.

Why It Works (intuition, not equations):

Video models are amazing at making images look real, but they don’t inherently know physics. Physics simulators are amazing at predicting motion from forces, but their renderings often look simple. Combining them gives you both: plausible motion and pretty pictures.
The simulator outputs optical flow (per-pixel motion) and a coarse RGB preview. These are visual signals that video diffusion models already understand well—so no weird action tokens are needed.
A special distillation makes the video model run in just a few steps, frame by frame, so it feels live.

Building Blocks:

3D Scene Reconstruction: Build a usable 3D scene (geometry + materials) from one image.
Physics Simulation: Apply forces, solve motion with the right solver (rigid, cloth, liquid, sand), and produce motion fields.
Visual Bridge: Turn motion into optical flow and a rough RGB preview.
Distilled Video Generator: A teacher-student setup adds flow conditioning, then compresses it into a 4-step, causal model for real-time streaming.

🍞 Anchor: Picture a sandcastle photo. You add a leftward wind action. The simulator converts that into a leftward flow field and a rough preview of collapsing sand. The video model—guided by that flow and preview—renders the final, realistic video of the sandcastle falling left, all at interactive speed.

03Methodology

At a high level: Single Image + 3D Actions → (A) 3D Reconstruction → (B) Physics Simulation → (C) Optical Flow + RGB Preview → (D) 4-step Causal Video Generator → Output Frames (streaming)

A) 3D Scene Reconstruction 🍞 Hook: Imagine building a little 3D diorama from a single photo so you can poke it and see what happens.

🥬 What it is: The system turns one image into a simulatable 3D scene: background points, object points, and material guesses.

How it works (recipe):
1. Segment objects from background (so we know what can move).
2. Estimate depth for each pixel and unproject into 3D points.
3. Complete hidden surfaces by generating a 3D mesh and aligning it to the scene.
4. Classify materials (rigid, cloth, liquid, smoke, granular, elastic), with parameters you can override.
Why it matters: Without a 3D scene, you can’t apply real forces in space—motion would be guesswork.

🍞 Anchor: From a photo of stacked persimmons, the system builds 3D persimmon shapes and a background wall so a sideways hit can be simulated.

B) Physics Simulation as the Bridge 🍞 Hook: You know how a game engine makes balls bounce and cloth flap? That’s what the simulator does, but fast and tailored to guide video generation.

🥬 What it is: Given the current 3D state plus actions (forces, robot moves, camera pose), compute the next positions and velocities.

How it works (recipe):
1. Represent actions:
  - External forces at 3D points.
  - Robot end-effector commands turned into joint torques via inverse kinematics.
  - Camera pose updated for rendering.
2. Pick solvers per material and step dynamics.
3. Output 3D motion that we later convert into visual cues.
Why it matters: Without correct physical motion underneath, the video could look pretty but act wrong.

Math (state update): $(p_{t+1}, v_{t+1}) = PhysicsStep(S_t, a_t)$ . Example: If a $2$ kg block at rest gets a $10$ N push for $t=0.1$ s, acceleration is $10/2=5$ m/s $^2$ , so $v_{t+1}=0.5$ m/s and $p_{t+1}=0.05$ m (starting from $0$ ).

🍞 Anchor: Push the middle persimmon to the left; the solver updates its 3D motion so later we can show it toppling realistically.

C) Visual Bridge: Optical Flow + Coarse RGB Preview

Optical Flow 🍞 Hook: Think of tiny arrows on each pixel showing where it will move next.

🥬 What it is: Per-pixel motion from the 3D velocities, projected into the camera view.

How it works (recipe):
1. Take each 3D point and its velocity.
2. Project where it is now and where it will be shortly.
3. The difference in pixels is the flow arrow for that point.
Why it matters: The video model understands visual motion (arrows) much better than raw force vectors.

Math (optical flow): $F_t(u,v) = \Pi(p_t + \Delta t\cdot v_t) - \Pi(p_t)$ . Example: If $p_t=(1,0,5)$ m, $v_t=(0.1,0,0)$ m/s, $\Delta t=0.1$ s, and pinhole $\Pi(x,y,z)=(x/z,y/z)$ , then current $(u,v)=(0.2,0)$ , next $(u',v')=(0.22,0)$ , so flow $(0.02,0)$ pixels.

🍞 Anchor: For a rightward wind on a cloth, flow arrows point right where the fabric is headed next.

Coarse RGB Preview 🍞 Hook: Like a quick sketch before painting the final picture.

🥬 What it is: A fast, rough rendering of the moving 3D points.

How it works (recipe):
1. Rasterize point clouds to get a low-cost color preview.
2. Include occlusion hints (what’s in front of what).
Why it matters: Flow alone can miss shape or occlusion cues; the preview guides structure.

🍞 Anchor: When sand collapses, the preview shows where parts start to hide other parts, so the final video handles occlusions correctly.

D) Real-Time Video Generation (Distilled, Causal, 4-Step)

Flow-Conditioned Teacher via Warped Noise 🍞 Hook: Imagine starting a painting not on blank canvas, but on a canvas already brushed in the direction the objects should move.

🥬 What it is: A pretrained image-to-video model adapted so the initial noise already “contains” the motion from optical flow.

How it works (recipe):
1. Take base model and training videos; compute their optical flow.
2. Sample Gaussian noise $z$ and warp it with flow: $z^F = Warp(z, F)$ .
3. Train so the model learns to turn flow-shaped noise into proper motion.
Why it matters: Encoding motion directly in noise is simple and efficient—no new heavy modules needed.

Math (warped noise): $z^F = Warp(z, F)$ . Example: If flow at pixel $(10,10)$ is $(+2,0)$ , the noise value from $(8,10)$ moves to $(10,10)$ , so motion is baked into the starting noise.

🍞 Anchor: For a leftward wind, noise patterns are nudged left, making leftward motion natural to generate.

Distillation to a 4-Step Causal Student 🍞 Hook: Like squeezing a long recipe into a quick, tasty 4-step version you can cook every night.

🥬 What it is: Compress the teacher into a fast, streaming model that generates one frame at a time.

How it works (recipe):
1. Use Distribution Matching Distillation so the student’s distribution matches the teacher’s.
2. Train with autoregressive rollout tricks so it stays stable for long videos.
Why it matters: Without distillation, generation needs many steps and is too slow for real-time.

Math (distillation objective): $\nabla L_{DMD} = \mathbb{E}_t[\nabla_\theta KL(p_{fake,t}\Vert p_{real,t})]$ . Example: For a binary toy case with $p_{fake}=(0.6,0.4)$ and $p_{real}=(0.8,0.2)$ , $KL=0.6\log(0.6/0.8)+0.4\log(0.4/0.2)\approx0.091$ —training nudges $p_{fake}$ toward $p_{real}$ .

🍞 Anchor: After training, the student paints smooth, consistent frames at 13.2 FPS.

RGB Conditioning with SDEdit Mixing 🍞 Hook: Blend the rough preview with guided noise so the model gets both structure and motion clues.

🥬 What it is: Start denoising from a mix of the encoded preview and the flow-shaped noise.

How it works (recipe):
1. Encode the coarse RGB preview via the VAE encoder $E(\tilde V_t)$ .
2. Mix it with $z^F$ using a coefficient $\alpha^{(3)}$ at step 3: $V_{t,(3)} = \alpha^{(3)} E(\tilde V_t) + \sqrt{1-\alpha^{(3)}} z^F_t$ .
3. Finish the remaining denoising steps.
Why it matters: Keeps motion accurate (flow) and structure/occlusion correct (preview).

Math (mixing): $V_{t,(3)} = \alpha^{(3)} E(\tilde V_t) + \sqrt{1-\alpha^{(3)}} z^F_t$ . Example: If $\alpha^{(3)}=0.7$ , $E(\tilde V_t)=2.0$ , and $z^F_t=1.0$ (toy scalars), then $V_{t,(3)}=0.7\cdot 2.0 + \sqrt{0.3}\cdot 1.0 \approx 1.4 + 0.548 = 1.948$ .

🍞 Anchor: For cloth flapping right, the mixture keeps the rightward motion (flow) and the cloth’s folds/overlaps (preview).

E) Streaming Inference 🍞 Hook: Like two conveyor belts: one computes motion, the other paints frames, both in sync.

🥬 What it is: A loop where physics produces fresh flow/preview, and the causal generator produces the next frame using recent history.

How it works (recipe):
1. Physics stream runs fast to update $F_t$ and $\tilde V_t$ .
2. Video stream consumes them plus past frames to output $V_t$ .
Why it matters: Without streaming, you’d wait for whole clips; with streaming, you interact live.

🍞 Anchor: You drag a robot gripper to squeeze a sponge; each tiny move updates the flow/preview and the next frame shows the sponge deforming right away.

04Experiments & Results

🍞 Hook: Imagine a science fair where different video makers compete: Who follows the push correctly? Whose motion looks most natural? Who looks best—and who can do it live?

🥬 The Test (what they measured and why):

Visual quality and aesthetics: Do frames look good to humans?
Consistency: Do frames stay coherent over time (no sudden glitches)?
Physical realism: Does the motion make sense for the given action?
Human preferences: Which video do people pick as better for action following, motion fidelity, and plausibility?
Speed: Can it run in real time with low latency?

The Competition (baselines):

PhysGaussian: Physics-integrated 3D representation that optimizes dynamics but is slower and less photorealistic.
CogVideoX-I2V: A strong open video generator conditioned on text and images (no true 3D actions).
Tora: Allows drag-based 2D trajectory control (still screen-space, not full 3D forces).

Scoreboard (with context):

RealWonder scores around 0.708 (visual quality), 0.593 (aesthetics), 0.265 (consistency), and 0.705 (physical realism). Think of these like getting an A- to A when others are getting mostly B’s.
Human 2AFC studies (400 participants) prefer RealWonder strongly over baselines for action following and physical plausibility—like picking the right answer 4 out of 5 times when others get picked much less.
Speed: RealWonder streams at up to 13.2 FPS at $480×832$ , with sub-100 ms per-frame latency; the others are far slower and not truly streamable.

Surprising Findings:

Without the physics simulator (text-only action hints), motion can be pretty but wrong (e.g., smoke doesn’t change direction as commanded).
Without optical flow, the model may ignore motion; without the RGB preview, motion follows but structure/occlusion suffers. Both are necessary and complementary.
The video model can add visually plausible details that the simulator doesn’t explicitly model (like gentle water ripples around a boat), making results look more lifelike while still following the physical cue.

🍞 Anchor: For sandcastle-in-the-wind, RealWonder shows the castle collapsing toward the wind side quickly and clearly, while some baselines either don’t move the sand properly or can’t stream the result in real time.

05Discussion & Limitations

🍞 Hook: Imagine a supercar that needs good fuel and a smooth road; it’s fast and amazing, but you should know where it shines and where it stumbles.

🥬 Limitations (honest assessment):

3D reconstruction from a single image can be imperfect (e.g., wrong depth on an object’s back), which can mislead simulation and thus the final video.
Material estimates from vision-language cues are sometimes off (e.g., calling snow “sand”), although the system is fairly robust and users can override.
Physics aims for plausibility, not perfect scientific accuracy; small artifacts can appear, especially in complex multi-material interactions.
Trained flow control relies on 2D flow-video pairs; extreme edge cases (very unusual motions) may generalize less.

Required Resources:

A GPU capable of running the simulator and the distilled 4-step causal video generator in parallel (the paper measured on high-end GPUs).
Pretrained models for segmentation, depth, and mesh completion.

When NOT to Use:

If you require precise, engineering-grade physics (e.g., exact torque curves or safety-critical simulation), use a full physics engine and validation tools.
If the input image lacks needed scene information (e.g., heavy occlusion of the main object) and no user correction is possible.
If you need ultra-high resolution and ultra-long sequences without any compromise in real-time speed.

Open Questions:

Can reconstruction be made both faster and more accurate with emerging large reconstruction models, reducing failure cases?
How far can we push physical accuracy while keeping real-time speeds?
Can we learn better cross-material interactions (e.g., cloth on water) directly from data without needing special tuning?
How to design user interfaces that let people specify 3D actions easily (like painting wind fields) while staying intuitive for non-experts?

🍞 Anchor: Think of RealWonder as the best real-time “physics painter” today—wonderful for interactive visuals and learning—but for critical measurements, you still bring in the lab-grade instruments.

06Conclusion & Future Work

🍞 Hook: Picture a magic window: you touch the world (push, blow, grab), and the scene responds instantly and looks real.

🥬 3-Sentence Summary:

RealWonder uses a physics simulator as a translator to turn 3D actions into visual motion cues (optical flow + a rough preview).
A distilled, 4-step, causal video generator follows these cues to render photorealistic frames at up to 13.2 FPS from a single image.
This avoids tokenizing continuous actions and needing action-video datasets, enabling real-time, physically plausible, action-conditioned video.

Main Achievement:

Bridging physical action and visual generation through a carefully engineered intermediate representation—so real-time video can both follow physics and look great.

Future Directions:

Stronger, faster single-image 3D reconstruction to reduce error and boost stability.
Richer physics (multi-material fidelity, better fluid/cloth coupling) while keeping real-time speeds.
Friendlier action authoring tools (e.g., paint-on force fields) and tighter robot learning integration.

Why Remember This:

RealWonder shows that the missing link between “how we act” (forces) and “what we see” (video) can be a physics-informed visual bridge. That simple idea unlocks responsive, believable, and creative experiences in AR/VR, robotics, and education—right now, in real time.

🍞 Anchor: Next time you flick a virtual towel or nudge a virtual boat, imagine the physics drawing arrows behind the scenes—and the video model painting those arrows into the final, living picture.

Practical Applications

•AR/VR prototyping where users push, pull, or blow on virtual objects and see instant, realistic reactions.
•Robot skill rehearsal with quick, physics-guided visual feedback for grasping, pushing, or tool use.
•Interactive physics education, letting students explore forces on cloth, fluids, or sand and observe outcomes.
•Creative direction for films and games, rapidly testing wind, impacts, or camera motion on a single reference image.
•Industrial design previews, simulating how materials deform or shift under loads before full CAD simulations.
•Human-robot interaction demos where live gripper motions produce immediate, plausible visuals for stakeholders.
•Virtual tryouts of camera moves (dollies, pans) over dynamic scenes to previsualize shots.
•Rapid iteration of special effects (splashes, crumples) to select looks before heavy offline rendering.
•Training data generation for downstream models that benefit from action-conditioned visual sequences.
•Science communication, showing how forces and materials interact in visually engaging, real-time demos.

Version: 1