Generated Reality: Human-centric World Simulation using Interactive Video Generation with Hand and Camera Control

Linxi Xie; Lisong C. Sun; Ashley Neall; Tong Wu; Shengqu Cai; Gordon Wetzstein

Generated Reality: Human-centric World Simulation using Interactive Video Generation with Hand and Camera Control

Intermediate

Linxi Xie, Lisong C. Sun, Ashley Neall et al.2/20/2026

arXiv

Key Summary

•This paper builds a "generated reality" system that lets AI-made videos react to your real head and hand movements in VR.
•It adds fine-grained hand control by mixing two signals: a 2D stick-figure hand image and precise 3D hand joint angles.
•These signals are blended into a video diffusion transformer using a simple but powerful trick called token addition.
•The camera (your head position and direction) is also controlled precisely using 6-DoF data turned into Plücker ray features.
•A slower, bidirectional "teacher" model is distilled into a faster, causal "student" that runs interactively (about 11 FPS with 1.4 s latency on an H100).
•On benchmarks, the hybrid 2D–3D hand method beats alternatives on 3D hand accuracy and keeps video quality high.
•In user studies, people completed fine-motor tasks far more often with hand control (71.2%) than with text-only prompts (3.0%).
•Users felt much more in control with tracked hands (4.21/7) than with text-only (1.74/7).
•This approach enables zero-shot, immersive XR experiences without building detailed 3D assets.
•Limitations include latency, long-horizon visual drift, and lower quality than polished VR engines, but the path is clear for rapid improvement.

Why This Research Matters

This work turns AI-generated videos into interactive worlds where your real hands and head are the controllers. That shift enables practice, training, and therapy without building expensive 3D assets or coding complex physics. Designers and educators can spin up new environments on demand, tailored to a learner’s exact motions. Rehabilitation can adapt exercises to a patient’s current range of motion with precise finger tracking. Entertainment and creativity become more participatory—less watching, more doing. As latency drops and quality rises, this approach could power everyday XR experiences that feel personal, responsive, and useful.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: You know how building a theme park in a video game takes forever because you have to place every tree, fence, and ride by hand? Making virtual reality (VR) worlds can feel like that—tons of work just to get started.

🥬 The Concept (Extended Reality, XR): XR is the family of technologies that blend real and virtual worlds so you can look around, move, and interact.

How it works:
1. A headset tracks how your head moves.
2. Hand sensors track your fingers and wrists.
3. A computer renders a world that responds to your movement.
Why it matters: Without XR, you’re just watching a flat screen. XR lets you step inside and do things, not just see them. 🍞 Anchor: When you turn your head wearing a headset and the scene turns with you, that’s XR in action.

🍞 Hook: Imagine wanting a brand-new adventure world, but instead of building every rock and door, you just describe it and it appears.

🥬 The Concept (Video world model): A video world model is a smart generator that creates realistic video of a world that changes over time.

How it works:
1. It learns patterns of how scenes and objects behave.
2. You give it instructions (text or controls).
3. It predicts the next frames so the world looks and feels consistent.
Why it matters: Without a world model, you must manually build 3D assets and game logic. That’s slow and expensive. 🍞 Anchor: Type “a snowy forest with a glowing lamppost,” and it creates a moving scene of snowflakes, trees, and light.

🍞 Hook: But here’s the snag—try pushing a tiny button or twisting a jar lid in that generated world with only a keyboard or a short text. It’s like trying to play the piano with oven mitts on.

🥬 The Problem: Most current video world models only listen to coarse controls (like text or camera movement), not your detailed finger motions.

How it shows up:
1. You can look around or change scenes.
2. But you can’t precisely grab, turn, or press objects with lifelike hands.
3. The AI guesses your intent and often gets it wrong.
Why it matters: Without fine hand control, many XR tasks (surgery practice, instrument training, assembly, therapy) don’t work well. 🍞 Anchor: If you ask the model to “open the jar,” it might wiggle the wrong lid or not line up your grip correctly.

🍞 Hook: People tried a few shortcuts—like moving the camera accurately or using a stick-figure person—but fingers still acted like clumsy cartoon paws.

🥬 Failed Attempts:

What they tried:
1. Camera-only control: Great for where you look, not for how you grasp.
2. Full-body pose: Good for arms and legs, too rough for fingers.
3. Binary hand masks: Show where a hand is, not its depth or exact angles.
Why these fail: They miss detailed, 3D finger articulation and precise wrist movement. 🍞 Anchor: It’s like having a map that shows the city but not the street numbers—you can’t find the exact doorbell to press.

🍞 Hook: So what’s missing? The model needs to understand both where your hand is in the picture (2D) and how it’s shaped in space (3D).

🥬 The Gap: We need a hand signal that is both spatially aligned with the image and disambiguates depth and finger angles.

How to fill it:
1. Give the model a 2D hand skeleton image (great spatial alignment).
2. Also give it 3D hand pose parameters (finger joints + wrist in 3D).
3. Fuse them so they agree.
Why it matters: Without both, the AI mistakes near/far, overlaps, and finger bends. 🍞 Anchor: A stick figure shows where the hand is on the screen; the 3D angles tell if the index finger is curled behind the thumb or pointing forward.

🍞 Hook: If we get this right, your hands become the controller, and the world reacts smoothly, like magic gloves for any virtual place.

🥬 Real Stakes: Faster creation of training and learning experiences without building custom 3D assets.

How it helps:
1. Practice skills (open jars, push buttons, grip tools) safely.
2. Rehab and therapy plan sessions that match a patient’s actual hand motions.
3. Designers test ideas quickly—no need to model every screw.
What breaks without it: Generations feel impressive but not usable—you watch a video instead of truly doing things in it. 🍞 Anchor: A student practicing “turn the steering wheel” can actually turn it with their tracked hands and see the car respond.

02Core Idea

🍞 Hook: Imagine building a puppet show where a flat drawing shows where the puppet is on stage, while hidden strings control every finger bend in 3D—now the show looks real and reacts to your exact moves.

🥬 The Concept (Key insight): Combine a 2D hand skeleton image with precise 3D hand pose parameters, add explicit camera control, and fuse them simply (token addition) inside a video diffusion transformer—then distill it to run interactively.

How it works:
1. Track your head (camera) and both hands.
2. Render a 2D skeleton (aligned to the image) and compute 3D hand pose parameters (HPP: wrist + 20 finger joints per hand).
3. Turn camera poses into Plücker ray features.
4. Encode everything and fuse them via element-wise token addition in the diffusion transformer.
5. Distill a fast, causal model from a stronger bidirectional teacher for real-time play.
Why it matters: 2D tells the model “where on the screen,” 3D tells it “how in space,” and camera tells it “from which viewpoint.” Together, hands feel accurate and responsive. 🍞 Anchor: When you reach left to push a green button, the generated hand lines up with the button and presses it, instead of awkwardly poking the table.

🍞 Hook: Think of three analogies for the same trick:

🥬 Multiple Analogies:

Map + Altimeter: 2D is like a street map (position on the page), 3D is the altimeter (height/depth), and camera is your moving viewpoint—together they guide you exactly.
Recipe + Photo + Timer: The 3D angles are the recipe steps (precise), the 2D skeleton is the photo (looks like it), and the camera is the oven timer (keeps timing and order right) so the dish turns out.
Glasses + Gloves: Camera is your glasses (what you see), 2D skeleton is a tracing overlay (where to draw), and 3D HPP are smart gloves (how to bend fingers) so your drawing matches your real hand. 🍞 Anchor: With all three, grabbing a sword in a fantasy scene looks natural: the hand’s on the sword (2D), the grip curls correctly (3D), and the view follows your head.

🍞 Hook: Before this, models listened mostly to text or rough motion; after this, they follow your head and fingers with finesse.

🥬 Before vs After:

Before: Good at scenery and camera sweeps; clumsy at pinching, twisting, pressing.
After: Still good at scenery, now also precise in grasp, rotation, and interaction timing.
Why it matters: It shifts from “watching” to “doing.” 🍞 Anchor: Opening a jar goes from a vague animation to a successful twist with aligned fingers.

🍞 Hook: Why does this simple fusion work so well?

🥬 Why It Works (intuition):

2D skeleton anchors hands to exact image pixels—reduces guessing.
3D HPP resolves depth and self-occlusion—no more flat, wrong bends.
Token addition blends signals directly into the visual tokens—clean and stable.
Camera Plücker features align scene motion with your head turns.
Distillation preserves quality but speeds up inference for interactivity. 🍞 Anchor: Near the frame edge, where skeleton lines can be cut off, the 3D angles keep the grip complete, so the hand doesn’t vanish.

🍞 Hook: Let’s break the big idea into bite-sized parts.

🥬 Building Blocks:

2D ControlNet-style skeleton: a per-frame image of the hand bones.
3D HPP (UmeTrack): wrist pose + 20 joint angles per hand.
Token addition: element-wise add the encoded hand and camera features to patch tokens.
Camera embeddings: 6-DoF head pose converted to Plücker rays, then encoded.
Iterative encoder training: stabilize learning by training hand and camera encoders in stages.
Distillation to causal: a fast student model learns from the bidirectional teacher for real-time rollouts. 🍞 Anchor: Together, these blocks let you swing a golf club in a generated course and see your grip and viewpoint update instantly.

03Methodology

🍞 Hook: Imagine a relay race—your headset hands the baton (your motion) to the model, the model turns it into video frames, and your headset shows you the result, over and over.

🥬 Overview (the pipeline): At a high level: Head/hand tracking → [Encode 2D skeleton, 3D hand pose, camera] → [Fuse via token addition → Diffusion Transformer] → [Decode to frames] → Output video.

Why it matters: Each step keeps the video grounded in your real movements; skip one and alignment or realism breaks. 🍞 Anchor: You lift your right hand to a jar; the next frames show your right hand rising and fingers curling on the lid from your current viewpoint.

Step 1. Track the user

What happens: A Meta Quest 3 tracks head pose (6-DoF) and both hands (wrist + finger joints per frame).
Why it exists: Without accurate tracking, the model is guessing.
Example: If the headset reports your head rotated 15° right and your index finger flexed 20°, the model uses both signals next frame.

🍞 Hook: You know how a stick-figure drawing shows where a hand is on the page?

🥬 Step 2. Make a 2D hand skeleton (ControlNet-style)

What happens:
1. From the tracked hand, render a per-frame 2D skeleton image (bones/joints) aligned to the camera view.
2. Encode it with the same VAE used for the video to get z_c.
Why it exists: This gives pixel-accurate spatial grounding; without it, the hand might drift away from where it should appear.
Example: If your thumb tip should be 120 px right of center, the skeleton marks that exact spot. 🍞 Anchor: The model sees the stick-hand exactly where your real hand should be on the screen.

🍞 Hook: A flat drawing can’t tell if a finger is in front of or behind another.

🥬 Step 3. Compute 3D hand pose parameters (HPP)

What happens:
1. Use a parametric hand model (UmeTrack) to get wrist pose + 20 joint angles per hand.
2. A small motion encoder (1D conv) turns these numbers into hand feature tokens.
Why it exists: 3D resolves depth and occlusion; without it, fingers can fold the wrong way or pass through objects.
Example: A 25° curl at the proximal joint and 10° at the distal joint produces a natural grip shape. 🍞 Anchor: The AI knows your index finger is curled behind the jar lid, not floating in front.

🍞 Hook: Turning your head changes everything you see—even if your hands stay put.

🥬 Step 4. Encode camera (head) pose with Plücker rays

What happens:
1. Convert the 6-DoF pose per frame into Plücker ray embeddings across the image grid.
2. A camera encoder maps them to the size of the patch tokens.
Why it exists: This ties the generated scene motion to your actual head movement; without it, the world won’t pivot correctly.
Example: A small left yaw makes objects sweep right in the next frame, consistent with perspective. 🍞 Anchor: When you nod, the horizon dips; when you shake your head, objects slide correctly.

🍞 Hook: Now we have three helpful notes—let’s stick them onto the model’s workspace.

🥬 Step 5. Fuse signals by token addition

What happens:
1. Encode the raw video latent z_r and the skeleton latent z_c with a shared VAE; concatenate channels.
2. Add (element-wise) the HPP hand tokens and the camera tokens to the patchified latents.
3. Feed into the diffusion transformer (DiT) blocks with self- and cross-attention.
Why it exists: Token addition is simple, stable, and keeps signals synchronized with visual tokens; more complex methods here underperformed on limited data.
Example: x = patchify([z_r, z_c]) + E_conv(H) + E_cam(P) 🍞 Anchor: It’s like putting colored overlays directly on the drawing—blue for hand shape, green for camera—so the artist (the DiT) paints with all hints in place.

🍞 Hook: Great artists learn from rough sketches to finished pieces.

🥬 Step 6. Train the bidirectional teacher (diffusion with flow matching)

What happens:
1. Use rectified flow and conditional flow matching to train a strong, bidirectional video diffusion transformer on HOT3D (and later GigaHands).
2. Iteratively train motion encoders: start with camera or hand alone to stabilize, then jointly fine-tune.
Why it exists: The teacher learns high-quality, temporally coherent generations; skipping it weakens the student later.
Example: The teacher learns that clockwise wrist rotation + a steady grip turns a jar lid correctly. 🍞 Anchor: The teacher can watch and refine the whole clip at once, learning smooth motion.

🍞 Hook: Now we need the fast version—like turning a chef’s gourmet method into a quick, reliable recipe.

🥬 Step 7. Distill to a causal, autoregressive student (self-forcing)

What happens:
1. Teach a smaller causal model to imitate the teacher’s outputs chunk by chunk (e.g., 12 frames).
2. The student uses only past frames + current conditioning to predict the next frames at runtime.
Why it exists: Real-time interactivity demands low latency and streaming; without causality, you must wait for the whole clip.
Example: At 11 FPS with ~1.4 s latency, you can reach and see your hand respond in a beat. 🍞 Anchor: Like reading a story one sentence at a time as it’s being written, not waiting for the whole chapter.

🍞 Hook: Finally, wire it into a live VR loop.

🥬 Step 8. Integrate with the headset and stream

What happens:
1. Unity on Quest 3 sends fresh head/hand data to a remote H100 server.
2. The student model generates 12-frame chunks conditioned on that data.
3. Frames stream back to the headset for viewing.
Why it exists: This closes the loop so your motions change the world immediately; remove streaming and you lose interactivity.
Example: When you approach a door handle and twist, the door opens in sync. 🍞 Anchor: You move, it responds—like a conversation between you and the generated world.

The Secret Sauce

Hybrid 2D–3D hand conditioning: 2D skeleton for pixel alignment + 3D HPP for articulation and depth.
Token addition fusion: a simple, robust way to blend multiple control streams with visual tokens.
Joint head–hand conditioning: camera and hands move together realistically.
Distillation: preserves teacher quality in a fast student for interactivity.

04Experiments & Results

🍞 Hook: If you want to know if a robot chef is good, you don’t just taste the soup—you also watch how steadily it stirs and whether it follows the recipe.

🥬 The Test: The team measured three things—video quality, hand accuracy, and camera accuracy.

Video quality: PSNR (pixel correctness), SSIM (structure), LPIPS (perceptual similarity), and FVD (how realistic the sequence feels overall).
Hand accuracy: 3D joint error (MPJPE), 3D surface error (MPVPE), and 2D landmark distance per frame.
Camera accuracy: Rotation and translation errors via recovered trajectories (GLOMAP).
Why it matters: A pretty video is not enough—hands must match the intended grip, and the camera must follow your head. 🍞 Anchor: It’s like scoring a dance: how nice it looks, how well the feet hit the marks, and whether the camera captures the moves.

🍞 Hook: Who did they compete against?

🥬 The Competition:

Camera-only control (CameraCtrl): great head motion, weak hands.
Hand-only control (best baseline from ablation): accurate hands, no camera.
Variants for hand conditioning: token concatenation, token addition, AdaLN, cross-attention; also 2D-only (ControlNet-style), binary masks, and the proposed hybrid 2D–3D.
Datasets: HOT3D (precise, smaller), GigaHands (bigger, more varied). 🍞 Anchor: It’s like races where one runner is fast at sprints (camera), another at hurdles (hands), and the new method does both well.

🍞 Hook: So, how did it score?

🥬 The Scoreboard (with context):

Hand conditioning ablation (HOT3D): Token addition beat other injection methods for 3D hand accuracy. 2D-only skeleton control was strong, but the hybrid 2D–3D achieved the best hand accuracy (MPJPE/MPVPE/2D error) while keeping video quality competitive. It approached the measurement lower bounds set by the hand estimator (MPJPE ~9.4 mm, MPVPE ~7.7 mm, 2D ~9 px), which is like getting close to the best-possible score given the ruler you’re using.
Joint head–hand control: The joint model achieved the best overall balance—top video quality (e.g., PSNR/SSIM), with camera errors nearly as low as the camera-only model and hand errors near the hand-only model. Think: getting an A in both math and science instead of an A+ in one and a C in the other.
Generalization: On the larger GigaHands dataset, the hybrid reduced errors notably over 2D-only (e.g., ~10–34% reductions across metrics), showing scalability.
Runtime: The distilled student runs interactively at ~11 FPS with ~1.4 s latency on a remote H100—good enough for controlled tasks, though not instant. 🍞 Anchor: Near image borders, where 2D lines get cut, the hybrid still keeps a full, correct hand—like finishing a puzzle even when a corner piece is smudged.

🍞 Hook: What about real people using it?

🥬 User Study Results:

Tasks: “Push the green button,” “Open the jar,” “Turn the steering wheel.”
Baseline: Text-only prompting (no tracked hand control) averaged ~3.0% success—like guessing.
Ours: With tracked hands, success jumped to ~71.2%—a big leap from “the AI will try” to “you can actually do it.”
Perceived control (7-point scale): 1.74 (baseline) vs 4.21 (ours). Users felt much more in charge.
Takeaway: Fine hand control is not a luxury—it’s the difference between watching and doing. 🍞 Anchor: When told to push a button, users with hand control pushed it; without, the model often missed or pressed the wrong spot.

🍞 Hook: Any surprises?

🥬 Surprising Findings:

2D-only skeletons were already strong (they anchor pixels well), but hybrid 2D–3D handled occlusions and edges more robustly.
Complex conditioning methods like cross-attention and AdaLN underperformed on limited hand-data, while simple token addition was stable and accurate.
Camera-only models got viewpoint perfect but fumbled object choice without hand intent. 🍞 Anchor: Simple glue (token addition) bound the signals better than fancy glue when the parts were small and precise.

05Discussion & Limitations

🍞 Hook: Think of this as building a prototype race car—it’s fast and exciting, but it still needs tuning before the big championship.

🥬 Limitations:

Latency (~1.4 s) and 11 FPS are noticeable; not yet seamless XR.
Long-horizon drift: visuals degrade after a few seconds without resets.
No stereo, lower resolution/quality than polished VR engines.
Physical realism is learned from data, not guaranteed by a physics engine. 🍞 Anchor: Great for short, guided tasks; less ideal for multi-minute free roaming.

🥬 Required Resources:

Hardware: a tracked headset (e.g., Quest 3), a strong GPU (H100 for current numbers), reliable networking for streaming.
Data: hand–object datasets with 3D annotations (e.g., HOT3D, GigaHands) to train/improve fidelity.
Software: Unity integration, diffusion transformer stack, encoders for hand and camera. 🍞 Anchor: It’s like needing a good kitchen, quality ingredients, and a recipe app to make the dish reliably.

🥬 When NOT to Use:

Ultra-low-latency needs (<20 ms), e.g., fast reflex training or competitive esports.
High-precision physics or safety-critical simulations (surgery on real patients) where errors are unacceptable.
Very long scenes without resets, where drift accumulates. 🍞 Anchor: Use a flight simulator with verified physics for pilot exams; use this for creative prototyping and practice.

🥬 Open Questions:

Can we cut latency to imperceptible levels and run on-device?
How to keep quality over long horizons (better memory, hybrid AR video anchors)?
Add full-body, gaze, and tactile feedback for richer interactivity?
Smarter distillation/architectures to boost FPS and stereo rendering? 🍞 Anchor: The next version could feel like magic glasses that never lag and never forget.

06Conclusion & Future Work

🍞 Hook: Picture pulling on smart gloves that make any imagined world react to your real hands and head.

🥬 3-Sentence Summary:

This paper presents a human-centric video world model that jointly controls camera (head) and detailed hands using a hybrid 2D–3D conditioning strategy.
A simple fusion (token addition) inside a diffusion transformer, plus distillation to a causal model, enables interactive, egocentric video generation.
The approach improves both objective metrics and human task success, turning passive generations into active, controllable experiences.

🥬 Main Achievement:

Demonstrating that fusing a 2D skeleton image with 3D hand pose parameters, alongside explicit camera control, yields accurate, responsive hand–object interactions in generated video—validated by large gains in user task completion and perceived control.

🥬 Future Directions:

Reduce latency and increase FPS; add stereo and higher resolution.
Strengthen long-horizon stability; expand to full-body control and richer sensors (gaze, foot placement).
Move toward on-headset deployment and broader, real-world training data.

🥬 Why Remember This:

It shifts video generation from “tell me a story” to “let me do the story.”
Hands become first-class controls for generated worlds, unlocking training, therapy, and creation without hand-built 3D assets.
The simple, effective fusion strategy charts a practical path to truly interactive, human-in-the-loop world simulation. 🍞 Anchor: From pressing a virtual button to steering a virtual wheel, your real fingers now write the script of the generated world.

Practical Applications

•VR training modules where trainees actually grip tools, press buttons, and turn knobs with tracked fingers.
•Physical therapy sessions that mirror a patient’s exact hand motions for graded exercises and feedback.
•Rapid prototyping of interactive products (e.g., jar lids, steering wheels) without modeling full 3D assets.
•STEM education labs where students manipulate virtual objects (microscopes, circuits) with fine hand control.
•Creative media and filmmaking: actors pantomime actions, and the system renders scenes that match their hands.
•Sports skill practice (e.g., golf grip, racket holds) with immediate visual feedback in generated environments.
•Assistive training for daily living tasks (opening containers, turning doorknobs) in safe, customizable scenes.
•Game modding and UGC: creators specify worlds by text and control interactions directly with their hands.
•Remote guidance: an expert demonstrates precise hand maneuvers while the learner mirrors and sees the result.
•Industrial safety drills (press the right emergency stop, operate a valve) without building costly simulators.

Version: 1