DreamWorld: Unified World Modeling in Video Generation

Boming Tan; Xiangdong Zhang; Ning Liao; Yuqing Zhang; Shaofeng Zhang; Xue Yang; Qi Fan; Yanyong Zhang

DreamWorld: Unified World Modeling in Video Generation

Intermediate

Boming Tan, Xiangdong Zhang, Ning Liao et al.2/28/2026

arXiv

Key Summary

•DreamWorld is a new way to make videos that not only look real but also follow common-sense rules about motion, space, and meaning.
•Instead of learning from just one helper model, it learns from several experts at once: motion (optical flow), 3D space (VGGT), and semantics (DINOv2).
•It trains the video model to predict both pixels and these expert features together, so the model builds a shared 'world sense.'
•To prevent flicker and glitches when mixing many goals, it slowly softens the extra rules during training using Consistent Constraint Annealing (CCA).
•At generation time, it uses Multi-Source Inner-Guidance to nudge the video toward good motion, shape, and meaning based on the model’s own predicted features.
•On VBench, DreamWorld improves the overall score to 80.97, beating a strong baseline (Wan2.1) by 2.26 points, and also leads on VBench 2.0 and WorldScore.
•It reduces weird physics, keeps objects consistent across frames, and improves spatial relationships compared to past methods like VideoJAM.
•This matters for long, coherent, and reliable videos—useful in education, storytelling, robotics, and simulation.
•The approach is efficient to fine-tune (LoRA, 32k videos, 2k steps) and uses common expert features prepared offline.
•Limitations include compute needs and data diversity; future work will make it faster and broaden world knowledge.

Why This Research Matters

DreamWorld moves video generation from pretty frames to believable worlds by making motion, space, and meaning agree over time. This is crucial for education (showing physics right), for robotics (learning from videos that reflect real rules), and for storytelling (consistent characters and scenes). It can help simulators train safer autonomous systems by reducing glitches that mislead decisions. The method stays practical by re-using strong expert models and fine-tuning efficiently. As world models get better, we get closer to AI that can imagine, plan, and explain—not just draw.

Reading Workflow

Turn this paper into a decision

Scan fast. Promote only the papers that survive triage.

No workflow history yet.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

You know how when you watch a movie, your brain expects cups not to melt through tables and shadows to move the right way as the sun shifts? Those are your built-in world rules. AI video models have gotten very good at making pretty pictures in motion, but they often forget these simple rules, especially over time. That’s why you sometimes see hands passing through objects or water flowing in odd directions in AI-generated clips.

Before: Video generators focused mainly on making frames look realistic, frame by frame. Think of a very talented painter who paints one picture after another but doesn’t always remember what was in the last one. These models learned to match pixel patterns really well but didn’t deeply grasp how the world actually behaves—how things move, how shapes stay solid, or how stories remain consistent over seconds.

The Problem: When a video needs to keep track of motion (temporal dynamics), 3D shape and layout (spatial geometry), and what things are (semantic meaning), juggling all these at once is hard. Past methods often added just one kind of extra knowledge—like motion alone—or tried to tightly match a big teacher model in one feature space. This helped a little but didn’t create a true, unified world understanding.

Failed Attempts: A popular idea was to align the video model’s inner features with one expert model (Representation Alignment). It’s like telling a student, “Make your notes look like the teacher’s.” VideoREPA softened this matching to focus on relationships between token features (Token Relation Distillation), which worked better for a single expert. But when people tried to add multiple experts at once—one for motion, one for semantics, one for geometry—the student got mixed instructions. The gradients from different teachers pushed in different directions, causing unstable training, flicker, and strange deformations.

The Gap: What was missing was a way to teach a video model several world rules together without the rules fighting each other. Instead of only aligning with an external expert, the model needed to jointly predict not just pixels, but also the expert features themselves, so it could build a single, shared internal sense of the world that ties appearance to motion, space, and meaning.

Real Stakes: Why care? Because consistency makes videos useful. For learning science, kids should see water behave like water. For training robots, the world rules must be reliable over many frames. For film and advertising, characters should stay on-model and move believably. For simulation and planning, temporal and spatial logic matters more than flashy single frames. Without these, videos look cool at first glance but break down the moment your brain checks for common sense.

Now let’s gently introduce the key ideas in the order they connect:

🍞 Hook: You know how a puzzle only shows the full picture when all pieces fit together, not just one piece? 🥬 The Concept: Joint World Modeling Paradigm is a way to train a video model to learn pixels and multiple kinds of world knowledge (motion, space, meaning) at the same time so they fit together.

How it works: (1) Gather expert features for motion (optical flow), 3D/2D geometry (VGGT), and semantics (DINOv2). (2) Align and compress them so they can sit next to the video’s latent representation. (3) Train the model to predict both the video latents and these features jointly. (4) Use a careful training schedule so none of the experts overpower the others.
Why it matters: If you only learn one piece, the whole video can still break; learning them together makes the model understand the world better across time. 🍞 Anchor: In a “tilted teacup” scene, the model keeps the cup solid (geometry), the tea flows realistically (motion), and the cup remains a cup (semantics) across the whole video.

🍞 Hook: Imagine a coach who starts strict but relaxes as the team gets it right, so players keep their natural style. 🥬 The Concept: Consistent Constraint Annealing (CCA) slowly reduces the strength of extra world-rule losses during training.

How it works: (1) Start training with strong weights on motion/space/semantic rules. (2) Gradually lower those weights, keeping image quality crisp. (3) End training focused on clean, stable visuals while the world rules are already learned.
Why it matters: Without this, different rules fight, causing flicker and artifacts; with CCA, learning stays stable. 🍞 Anchor: A balcony sunrise video no longer gets weird highlights or flicker halfway through because the rules were dialed back at the right time.

🍞 Hook: Think of a GPS that uses several satellites; the more signals, the better the path. 🥬 The Concept: Multi-Source Inner-Guidance is a way to steer video generation at test time using the model’s own predicted motion, space, and semantic signals along with the text.

How it works: (1) The model predicts video latents and world features. (2) It compares “with” vs. “without” each feature to measure how much that feature helps. (3) It adds a weighted nudge from text, motion, semantics, and space to keep the generation on-track.
Why it matters: Without this, the model might drift; with it, motion stays smooth, shapes stay solid, and prompts are followed. 🍞 Anchor: If the prompt says “girl reading while camera tilts up,” guidance helps the camera move smoothly and keeps her face consistent as the shot continues.

02Core Idea

🍞 Hook: You know how a conductor keeps the strings, brass, and drums in harmony so the orchestra sounds like one music, not three separate bands? 🥬 The Concept: The “aha!” is to train a video model to predict both the video and multiple expert features together, then gradually relax the extra rule pressure and finally steer the generation with those learned features, producing videos that look great and follow world rules.

How it works: (1) Build a joint feature stack from motion (optical flow), space (VGGT), and semantics (DINOv2). (2) Concatenate these with the video latents. (3) Expand the model’s input and output projections so it jointly predicts all parts. (4) Use CCA to keep training stable and prevent flicker. (5) At inference, use Multi-Source Inner-Guidance to stay faithful to text, motion, space, and semantics.
Why it matters: This turns a good painter of frames into a careful world storyteller through time. 🍞 Anchor: In a muddy-puddle scene, boots leave believable footprints (space), splashes move right (motion), and the person stays the same person (semantics) frame after frame.

Three analogies for the same idea:

Orchestra: Different sections (motion, space, semantics) play in tune, guided by the conductor (CCA and inner guidance).
Recipe: Flour, eggs, sugar (three expert priors) are mixed into one batter (joint features), baked carefully (CCA), then frosted for taste (inner guidance) so the cake holds shape and tastes right.
Sports team: Defense (space), midfield (semantics), and offense (motion) train together, then the coach (inner guidance) makes smart calls during the game to keep plays clean.

Before vs. After:

Before: Pixel-perfect single frames that sometimes break physics, identities, or geometry over time.
After: Videos that keep identities stable, obey spatial relationships, and move naturally, while still looking sharp and stylish.

Why it works (intuition, not equations):

Joint prediction ties appearance to rules. When the model must output both pixels and world features, it learns how they relate—like learning both a map and the terrain together.
CCA prevents rule-fights. By easing rule strength over time, the model keeps what it learned but stops overfitting to any one expert.
Inner guidance is course-correction. During generation, the model measures how each feature helps and gently adjusts its steps.

Building Blocks (each with a mini sandwich):

🍞 Hook: You know how you can blend two colors by mixing a little of one with a little of the other? 🥬 The Concept: Video Diffusion Transformers learn to turn noise into video by taking many small, guided steps.
- How it works: The model starts from noise and repeatedly denoises toward a clean video latent, guided by text.
- Why it matters: This gives flexible, high-quality generation over many frames. 🍞 Anchor: Starting from static snow on a TV, the model gradually reveals a dancing puppy video.
🍞 Hook: Imagine a student who learns what things are by spotting patterns again and again. 🥬 The Concept: DINOv2 gives semantic features that tell the model what objects are and how they relate.
- How it works: It turns images into features that cluster similar objects together.
- Why it matters: Without semantics, the model can drift—like a cat turning into a dog across frames. 🍞 Anchor: A yellow frisbee stays a yellow frisbee while a teddy bear remains a teddy bear.
🍞 Hook: When building a LEGO city, you must keep streets and buildings in the right places. 🥬 The Concept: 3D Geometry Modeling (VGGT) gives spatial features that keep shapes solid and layouts consistent.
- How it works: It extracts features related to geometry and camera/layout cues.
- Why it matters: Without spatial grounding, hands can pass through cups or heads change size oddly. 🍞 Anchor: A dog’s ears don’t sink into its sweater as it turns—occlusions make sense.
🍞 Hook: If you flip through a flipbook, you can see how everything moves from page to page. 🥬 The Concept: Optical Flow tracks pixel motion between frames.
- How it works: It estimates where each pixel moves over time, forming dense motion maps.
- Why it matters: Without motion cues, the video may jitter or move in jerky, unrealistic ways. 🍞 Anchor: When a person tilts a teacup, the liquid flows along expected paths, not randomly.

Simple math snapshots with kid-friendly numbers:

Mixing two states over time: $z_t = t z_1 + (1 - t) z_0$ . For example, if $t = 0.3$ , $z_1 = 10$ , and $z_0 = 2$ , then $z_t = 0.3 \times 10 + 0.7 \times 2 = 3 + 1.4 = 4.4$ .
Total channels after concatenation: $C_{total} = C_{vae} + C_{temporal} + C_{semantic} + C_{spatial}$ . For example, $16 + 16 + 8 + 8 = 48$ .
Annealing the rule weight: $\lambda(t) = \lambda_{base} \cdot (1 + \cos(\pi t / T_{total}))$ . For example, if $\lambda_{base} = 0.2$ , $t = 0.5$ , and $T_{total} = 1.0$ , then $\lambda(t) = 0.2 \cdot (1 + \cos(\pi \cdot 0.5)) = 0.2 \cdot (1 + 0) = 0.2$ .
Guidance sum factor: $w_{sum} = 1 + w_{txt} + w_{temp} + w_{sem} + w_{spa}$ . For example, with $w_{txt}=5$ and others $=1$ , $w_{sum} = 1 + 5 + 1 + 1 + 1 = 9$ .

03Methodology

At a high level: Text prompt → Preprocess expert features → Build joint world latent → Joint prediction (video + features) with CCA during training → Multi-Source Inner-Guidance at inference → Final decoded video.

Step 1: Preprocess expert features so they fit together

What happens: We collect three priors—optical flow for motion, VGGT for spatial geometry, and DINOv2 for semantics. We align them in size and time, standardize their values, and compress channels (e.g., with PCA) so they can be concatenated with the video’s latent.
Why this exists: Each expert speaks a different “language” (different shapes and scales). Without alignment and compression, the model would be overwhelmed and unstable.
Example: Imagine three rulers with different units (inches, centimeters, and hand-spans). We convert them to the same unit before measuring and adding them together.
Simple math: Total channel count: $C_{total} = C_{vae} + C_{temporal} + C_{semantic} + C_{spatial}$ . For example, $16 + 16 + 8 + 8 = 48$ .

Step 2: Build the joint world latent

What happens: We concatenate the video latent (from a 3D causal VAE) with the world knowledge tensor to form a bigger input for the transformer. The model’s input and output linear layers are expanded to handle these extra channels, with new weights initialized to zero so nothing breaks at the start.
Why this exists: Zero-init means the pre-trained video model behaves the same at the beginning; the world features start influencing only after learning begins, avoiding sudden harm to image quality.
Example: Adding new instruments to a band but keeping their volume at zero at first; then slowly bringing the faders up as everyone syncs.

Step 3: Train by jointly predicting video and features (the Dream Loss)

What happens: The model learns to predict both video latents and each expert feature’s “direction of improvement” (its velocity in latent space). The total loss is a weighted sum across video, motion, semantics, and space.
Why this exists: Joint prediction bonds appearance to rules. If we removed expert heads, the model could ignore motion/space/semantics and just make pretty frames.
Example with data: For a clip of 81 frames at $480×832$ , optical flow is converted to an RGB-like visualization, encoded by the VAE, then aligned in time with DINOv2 and VGGT features. The model takes noisy latents plus these features and predicts cleaned-up versions for all parts.

Step 4: Keep training stable with CCA

What happens: The weights on motion/space/semantic losses start moderately strong, then shrink smoothly to zero as training proceeds. That lets the model absorb the rules early without crushing its natural image quality later.
Why this exists: Static high weights cause flicker and harsh artifacts; static low weights fail to teach the rules.
Example: Learning to ride a bike with training wheels that gradually lift off the ground as you get better.
Simple math: Annealing weight: $\lambda(t) = \lambda_{base} \cdot (1 + \cos(\pi t / T_{total}))$ . For example, if $\lambda_{base} = 0.2$ , $t = 0.75$ , $T_{total}=1.0$ , then $\lambda(t) = 0.2 \cdot (1 + \cos(\pi \cdot 0.75)) \approx 0.2 \cdot (1 - 0.707) \approx 0.2 \cdot 0.293 \approx 0.0586$ .

Step 5: Generate with Multi-Source Inner-Guidance

What happens: During inference, the model compares a fully conditioned prediction (text + all features) against versions with one feature masked. The differences tell how each feature helps, and we add weighted nudges (text 5, others 1) to keep the video on-track.
Why this exists: Even good models can drift while denoising. Small, smart nudges reduce drift without over-constraining creativity.
Example: In a “girl reading, tilt up” prompt, text guidance ensures the story is followed; motion guidance keeps the tilt smooth; spatial guidance avoids face distortions; semantic guidance maintains her identity.
Simple math: Sum of guidance scales: $w_{sum} = 1 + w_{txt} + w_{temp} + w_{sem} + w_{spa}$ . For example, with $w_{txt}=5$ and others $=1$ , $w_{sum} = 1 + 5 + 1 + 1 + 1 = 9$ .

The Secret Sauce (why this is clever):

Zero-init expanded projections let you add powerful world features without wrecking a good pre-trained model on day one.
Joint prediction creates a strong link between how things look and how they should behave and be identified.
CCA makes multi-rule learning practical—firm at first, gentle later—preventing rule conflicts from causing flicker.
Inner guidance uses the model’s own learned signals to steer itself, which is efficient and precise.

Mini Sandwiches for the components used:

🍞 Hook: You know how you organize a messy desk by putting things into labeled boxes? 🥬 The Concept: PCA is a way to compress feature channels so they’re smaller but still useful.
- How it works: It finds directions that keep most of the variation in fewer numbers.
- Why it matters: Smaller features mean faster training and less confusion. 🍞 Anchor: Shrinking from 256 channels to 8 while keeping the key patterns.
🍞 Hook: If you listen carefully, you can hear a melody form from scattered notes. 🥬 The Concept: Flow Matching is a training method where the model learns a velocity field that moves noise toward data smoothly.
- How it works: It teaches the model how to step from “noisy” to “clean” at many time points.
- Why it matters: It’s stable and efficient for high-quality video synthesis. 🍞 Anchor: From visual snow to a clear beach scene, one learned push at a time.

Simple mixing formula to visualize time interpolation:

$z_t = t z_1 + (1 - t) z_0$ . For example, with $t=0.1$ , $z_1=20$ , $z_0=0$ , $z_t = 0.1 \times 20 + 0.9 \times 0 = 2 + 0 = 2$ .

04Experiments & Results

The Test: The authors measured whether videos look good and follow world rules across time and space. They used:

VBench: Checks quality and semantic consistency across 16 sub-dimensions (like motion smoothness, subject consistency).
VBench 2.0: A tougher version matching human preferences on complex motion and composition tasks.
VideoPhy: Focuses on physical commonsense—does the video obey physics?
WorldScore: Designed to evaluate world simulation, separating static quality from dynamic motion.

The Competition: DreamWorld is compared with Wan2.1 (a strong open model) and VideoJAM (a joint appearance-motion baseline), and also a fine-tuned Wan2.1.

The Scoreboard (with context):

VBench Overall Score: DreamWorld 80.97 vs. Wan2.1 78.71 (FT) and VideoJAM 78.76. That’s like getting a solid A when others score in the high B range, especially improving spatial relationships and motion.
VBench 2.0 Total: DreamWorld 52.97 vs. 51.18 (Wan2.1 FT) and 52.33 (VideoJAM). On this harder test, DreamWorld keeps the top spot, showing it balances realism and control.
VideoPhy: DreamWorld SA 52.9% and PC 26.2%, beating baselines. Think of this as not just looking right but behaving right under physics spot-checks.
WorldScore: DreamWorld Overall 51.48 vs. 50.95 (Wan2.1 FT) and 49.38 (VideoJAM). It scores well on both static (appearance) and dynamic (motion), proving long-horizon consistency.

Surprising Findings:

Adding physics-aware priors did not hurt aesthetics; it actually improved quality scores by cleaning up artifacts.
Temporal guidance was especially important: removing it led to the biggest quality drops, showing motion priors are key for smooth, believable videos.
The best trade-off for rule strength came at a moderate loss weight (around 0.2). Heavier weights harmed visual fidelity; lighter weights didn’t teach the rules well enough.

Ablations (what parts matter):

World features: Geometry alone helped a bit; geometry + semantics helped more; adding motion gave the best results. All three together form the strongest world sense.
CCA: Without it, scenes showed flicker and abnormal highlights; with it, videos became smoother and cleaner.
Inner guidance: Removing any guidance (text, motion, semantics, space) hurt performance; motion and text were the most critical.

Takeaway: The numbers match what your eyes see: steadier motion, solid shapes, and consistent objects over time—not just pretty single frames.

05Discussion & Limitations

Limitations:

Compute and data diversity: The approach still needs decent GPUs and benefits from varied, curated data (e.g., WISA), so generalizing to all edge cases may require more sources and time.
Expert dependence: Quality depends on the chosen experts (DINOv2, VGGT, optical flow). If these are weak in certain scenes (e.g., rare materials), guidance weakens.
Longest horizons: While improved, extremely long or highly interactive sequences may still drift or require memory modules.

Required Resources:

A pre-trained video diffusion transformer (e.g., Wan2.1-T2V-1.3B), 3D causal VAE for latents, and offline extraction with RAFT/DINOv2/VGGT.
Training with LoRA on roughly tens of thousands of videos (they used 32k) and a handful of GPUs (they used $8×A100$ ), for a few thousand steps (they used 2k).

When NOT to Use:

If you only need short, stylized clips where physics doesn’t matter (e.g., abstract art), the extra complexity may be unnecessary.
If you cannot preprocess expert features due to latency or storage limits, a lighter method could be preferable.
If your domain needs different priors (e.g., medical imaging), these particular experts may not transfer without adaptation.

Open Questions:

Can we replace offline experts with smaller, learned heads trained end-to-end to reduce preprocessing cost?
How do we scale to much longer videos without drift—do we need explicit memory or planning modules?
Can we auto-learn the best annealing schedule per scene or per prompt, not one-size-fits-all?
How do we integrate audio, depth sensors, or language feedback loops for richer world models?
Can this approach help interactive agents plan actions, not just generate videos?

06Conclusion & Future Work

Three-Sentence Summary: DreamWorld teaches a video model to predict both pixels and multi-source world features (motion, space, semantics) at the same time, building a unified world sense. It keeps training stable with Consistent Constraint Annealing and keeps generation on-track with Multi-Source Inner-Guidance. The result is cleaner, steadier, and more believable videos that score higher on world-consistency benchmarks.

Main Achievement: Turning single-frame visual talent into time-consistent world understanding by jointly learning from multiple expert priors without letting them fight each other.

Future Directions: Make expert integration more efficient, broaden to more kinds of world knowledge (e.g., audio-visual, physics simulators), and extend to longer, interactive sequences with memory and planning. Explore adaptive schedules and end-to-end learned priors to cut preprocessing costs.

Why Remember This: It’s a blueprint for going beyond pretty clips toward true world simulators—where motion, space, and meaning agree over time—opening doors for education, robotics, storytelling, and safe simulation at scale.

Practical Applications

•Educational science videos that consistently show correct physics (e.g., gravity, fluids).
•Pre-visualization for films and ads where characters must stay on-model across shots.
•Robotics simulation training with motion and geometry that reflect real-world constraints.
•Virtual prototyping of products where object interactions and materials behave believably.
•Game cutscenes that maintain identity and spatial logic over long sequences.
•Instructional safety videos (e.g., industrial procedures) with clear, consistent motion cues.
•Content creation tools that reduce flicker and spatial artifacts in user-generated videos.
•Data augmentation for vision models with physically consistent synthetic sequences.
•AR/VR scene previews where camera motion, occlusion, and object permanence are reliable.
•Scientific visualization of processes (e.g., fluid flows) with improved temporal coherence.

Version: 1