Geometry-Aware Rotary Position Embedding for Consistent Video World Model

Chendong Xiang; Jiajun Liu; Jintao Zhang; Xiao Yang; Zhengwei Fang; Shizun Wang; Zijun Wang; Yingtian Zou; Hang Su; Jun Zhu

Geometry-Aware Rotary Position Embedding for Consistent Video World Model

Intermediate

Chendong Xiang, Jiajun Liu, Jintao Zhang et al.2/8/2026

arXiv

Key Summary

•The paper fixes a common problem in video world models: scenes slowly change or “drift” when the camera moves and comes back.
•The authors show that screen-position clues (pixel x, y, time) clash with true 3D geometry, causing drift and hallucinated details.
•They introduce ViewRope, which puts each patch’s camera-ray direction directly into attention so the model can match the same 3D content across far-apart frames.
•They add Geometry-Aware Frame-Sparse Attention to pick only the most geometrically relevant past frames, speeding up long videos while keeping memory consistent.
•A new benchmark, ViewBench, measures loop-closure fidelity—how well the model matches the starting view after rotating away and back—plus geometric drift.
•On ViewBench, ViewRope lowers Loop Closure Error versus strong baselines (about 4% better than GTA) while keeping visual quality competitive.
•With sparse attention, ViewRope reduces Loop Closure Error further (about 16% better than a sliding window) and trains faster (around 25% speedup on long sequences).
•Counterfactual tests show the selected frames are truly needed: removing them hurts performance the most.
•Ablations suggest where to place ViewRope channels (lower-frequency temporal bands work best) and how many past frames to retrieve (k=5 balanced geometry and detail).
•The method still struggles with drastic scene changes and very large rotations due to frame-rate mismatch and error accumulation, but it opens a promising path to consistent, controllable video generation.

Why This Research Matters

Consistent video world models power smoother VR/AR experiences: when you look back, the scene should look the same, not wobble or morph. Games, simulators, and training tools become more believable because the world stays stable as the camera moves. Robots and drones that rely on camera input can better remember what they’ve seen, improving navigation and inspection. Creative tools benefit too: filmmakers and designers can generate long, controlled camera moves without visual drift. Efficiency gains from geometry-aware sparsity make long, interactive sequences more practical on real hardware. The approach preserves open-domain flexibility instead of locking you into a rigid 3D pipeline. Overall, it’s a step toward AI that understands and respects 3D space over time.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine you’re playing a 3D video game and you look to the left, then to the right, and finally look back to where you started. You expect to see the same doorway or tree in exactly the same place. If it looks different the second time, it feels wrong.

🥬 The Concept (Attention Mechanism): What it is: Attention is the part of an AI model that decides which pieces of information to focus on when making the next prediction. How it works:

The model looks at many tokens (like video patches) at once.
It scores how helpful each token might be.
It focuses more on the high-scoring ones. Why it matters: Without attention, the model treats everything as equally important and quickly gets lost in long videos. 🍞 Anchor: When the model guesses the next frame in a video, attention helps it focus on patches that matter for predicting what moves or stays the same.

🍞 Hook: You know how real objects don’t hop around when you blink—they stay put in 3D space, even if they look different from a new angle?

🥬 The Concept (3D Geometry): What it is: 3D geometry is the rulebook for how objects stay in place in the real world and how cameras see them from different angles. How it works:

Objects live in 3D (width, height, depth).
A camera looks at them from a certain position and angle.
The 3D scene is projected into a 2D image depending on the camera. Why it matters: If a video model ignores 3D geometry, it may redraw the same building differently when you come back to it. 🍞 Anchor: A lamppost seen from the left or the right is still the same lamppost; the model must keep that identity consistent across frames.

🍞 Hook: Think about how a shadow stretches and shrinks depending on where you stand and which way the light points.

🥬 The Concept (Projective Geometry): What it is: Projective geometry explains how 3D points turn into 2D pixels depending on camera position and angle. How it works:

Each pixel corresponds to a ray shooting out from the camera into the scene.
The same 3D point can land on very different pixels if the camera moves.
Nearby pixels aren’t guaranteed to be the same thing in 3D. Why it matters: If a model leans on pixel positions alone, it mixes up “close on the screen” with “the same object,” causing confusion. 🍞 Anchor: A statue that starts at the left edge of the image might end up near the center after you rotate back—but it’s still the same statue.

🍞 Hook: Picture using sticky notes on your screen to label where objects are. If you rotate the camera, those sticky notes won’t follow the real objects.

🥬 The Concept (Positional Encoding in Screen Space): What it is: Many video models attach an (x, y, time) tag to each patch to remember where it is on the screen and when it appears. How it works:

Each token gets a position embedding.
Attention uses these to guess which tokens relate.
The model learns patterns like “nearby pixels over time tend to match.” Why it matters: When the camera moves, nearby pixels may not be the same 3D content, so this bias can be wrong. 🍞 Anchor: Rotating your head changes which pixels show the sofa; a sofa-pixel today might be a window-pixel tomorrow.

🍞 Hook: Have you ever returned to a place and sworn the store sign was a different color? Your memory drifted a bit.

🥬 The Concept (Geometric Drift): What it is: Geometric drift is when a model’s idea of the scene slowly warps so that objects don’t match their earlier appearance or layout. How it works:

The model updates frames one after another.
If it relies on screen positions, small misalignments creep in.
Over time, objects “slide” or mutate. Why it matters: On returning to a previous viewpoint, the model may hallucinate new details or miss old ones. 🍞 Anchor: A red mailbox might come back as a slightly different shape, or even turn into a planter, after a long camera path.

🍞 Hook: Think of walking a loop around your block and ending where you started. You expect the scene to match what you saw first.

🥬 The Concept (Loop Closure): What it is: Loop closure means the video model should reproduce the same view when the camera returns to a previous pose. How it works:

The camera moves away from the start.
After traveling, it returns to the initial pose.
The frame at the end should match the first frame in co-visible areas. Why it matters: This is the clearest test of a stable, persistent world model. 🍞 Anchor: If a bench is next to a tree in the first frame, it should still be there and look the same when you rotate back.

Before this work, video models were great at making pretty frames and short-term smoothness, but not at keeping rock-solid 3D structure over long camera moves. Researchers tried making the context bigger with external memory or enforcing 3D with special pipelines, but those were heavy or limited. The missing piece was a native, built-in way for the model’s attention to prefer true 3D matches—no matter where pixels land. That’s the gap this paper fills.

02Core Idea

🍞 Hook: Imagine you have a magical compass that doesn’t care where an object sits on your screen; it only cares which direction you’re looking to see it.

🥬 The Concept (Key Insight): What it is: The aha! idea is to encode each patch’s camera-ray direction into attention, so attention matches tokens by 3D viewing angles, not by 2D pixel distance. How it works:

Compute the camera-ray direction for each image patch.
Rotate parts of the query/key features according to that direction.
Attention scores become functions of relative ray directions.
The model naturally retrieves the same 3D content across time. Why it matters: Without 3D-aware attention, the model confuses different objects that just happen to be near each other on the screen. 🍞 Anchor: When the camera returns to a storefront, its patches align by viewing rays, helping the model redraw the same sign, windows, and bricks.

Three analogies for the same idea:

Flashlight beams: You and your friend each shine a flashlight. If both beams point to the same statue, you’re talking about the same thing—even if you stand in different places. ViewRope makes attention notice matching beam directions.
Radio stations: Tuning to a frequency is like choosing a direction in space. If two radios are tuned to the same station (ray direction), they hear the same song (3D content) even from different rooms (pixels).
Treasure map grid vs. compass: A flat grid (screen pixels) can mislead when the terrain bends. A compass (ray directions) keeps you oriented to the same landmarks, no matter how you walk.

🍞 Hook: You know how before-and-after pictures only match if you look from the same angle?

🥬 The Concept (Before vs. After): What it is: Before, models used pixel positions; after, they use view directions. How it works:

Before: Attention favors nearby pixels in screen space, which fall apart when the camera rotates.
After: Attention favors similar ray directions, so the same 3D point is easier to find across time. Why it matters: This shift reduces drift and boosts loop-closure fidelity without building an explicit 3D model. 🍞 Anchor: Panning left and right and back again now reliably brings back the same storefront details rather than invented ones.

🍞 Hook: Think about matching puzzle pieces by shape, not by where they were piled on the table.

🥬 The Concept (Why It Works, Intuition not Math): What it is: Aligning features with ray directions turns “dot products” in attention into measurements of angular similarity between views. How it works:

Each patch feature is partially rotated to face where the camera looks.
When two patches face similar directions, their features line up better.
Attention gives them higher scores. Why it matters: The model naturally learns to reach back to the correct historical patches that saw the same 3D stuff. 🍞 Anchor: A balcony corner seen on frame 5 and frame 50 gets high attention because both patches “face” it similarly.

🍞 Hook: Imagine a chef’s recipe broken into easy steps.

🥬 The Concept (Building Blocks of the Idea): What it is: The method has two main blocks: ViewRope and Geometry-Aware Frame-Sparse Attention. How it works:

ViewRope: compute per-patch camera rays; rotate parts of Q/K accordingly; attention becomes ray-aware.
Geometry-Aware Frame-Sparse Attention: estimate which past frames are co-visible; keep only top-k; do attention on those. Why it matters: ViewRope provides the right bias; sparse attention makes it fast and stable over long sequences. 🍞 Anchor: In a long Minecraft-like camera sweep, the model selectively looks back at frames that actually “saw” the same hallway, not just recent ones.

Put together, the core idea is simple but powerful: treat “view direction” as the true position for attention. This small architectural nudge gives the model an internal compass, helping it remember and redraw the same 3D structures even after long, twisty camera paths.

03Methodology

At a high level: First frame + camera poses → encode each patch’s viewing ray (ViewRope) → rotate parts of attention features by those rays → estimate which past frames are geometrically relevant → attend sparsely to them → generate the next frame.

🍞 Hook: Picture each image patch carrying a tiny arrow showing which way the camera is looking through it.

🥬 The Concept (Per-Patch Camera Rays): What it is: For every patch (a small grid cell), compute the unit ray shooting from the camera through that patch. How it works:

Use camera intrinsics (like focal length) and the patch’s pixel center.
Turn that into a 3D direction vector (normalize it).
Combine with camera rotation to get a world-aligned direction. Why it matters: These rays tell us which 3D content each patch might see. 🍞 Anchor: The patch over a doorway points its ray toward the doorway’s 3D location; the patch over the sky points upward.

🍞 Hook: Think of turning a dial so some feature channels now “face” the same direction as the camera ray.

🥬 The Concept (Ray-Rotated Queries/Keys): What it is: Rotate groups of three feature channels in the query and key vectors by the patch’s ray rotation. How it works:

Split Q and K channels into sets of 3 (x, y, z like tiny vectors).
Apply the patch’s 3×3 rotation to those sets.
Keep the rest of the channels unchanged. Why it matters: After rotation, dot products between Q and K measure relative viewing directions. 🍞 Anchor: Two patches that both face the same statue produce better-aligned features and score higher in attention.

🍞 Hook: When studying for a test, you don’t reread the whole textbook—you revisit only the most relevant pages.

🥬 The Concept (Geometry-Aware Frame-Sparse Attention): What it is: Only a few past frames really matter—those with co-visible geometry. How it works:

Group tokens by frames (blocks).
Sample a few tokens to cheaply estimate block-to-block affinity using the rotated Q/K features.
Pick the top-k past frames and always include the current frame.
Do attention only over these selected frames. Why it matters: This cuts compute from “everything with everything” to “only what’s relevant,” keeping long videos fast and stable. 🍞 Anchor: During a rotate-away-rotate-back motion, the method keeps the frames that actually saw the same storefront, skipping those that didn’t.

🍞 Hook: Imagine streaming a cartoon one frame at a time while keeping a smart scrapbook of earlier scenes.

🥬 The Concept (Streaming with KV Cache): What it is: Store keys/values from past frames so the current frame can look back without recomputing everything. How it works:

During training (teacher forcing), the cache uses clean ground-truth frames.
At inference, the cache holds previously generated frames.
Each new step rotates Q/K by ViewRope, estimates top-k frames, and attends to them. Why it matters: This makes online, causal generation efficient while preserving geometric memory. 🍞 Anchor: The model doesn’t reread all old frames; it pulls just the relevant ones from its organized shelf.

🍞 Hook: Think of learning to ride a bike: you start short and safe, then go longer and faster.

🥬 The Concept (Progressive Training Schedule): What it is: Train in stages to stabilize learning. How it works:

Short-clip teacher forcing: learn the AR interface and caching.
Enable ViewRope on short clips: learn view-conditioned matches.
Turn on sparse attention: learn to retrieve geometry-relevant frames.
Scale to long sequences: practice long-horizon generation efficiently. Why it matters: Jumping straight to long, sparse contexts can be unstable. Staging makes it reliable. 🍞 Anchor: The model first practices on 17 frames, then graduates to 61 and beyond with strong geometric recall.

Concrete mini-example:

Input: First frame shows a brick arch; camera rotates 75° and later rotates back.
With ViewRope: Patches near the arch carry rays pointing to the arch’s 3D spot. When the camera returns, those patches find high-affinity matches in earlier frames that saw the same arch, guiding accurate redraw.
With Frame-Sparse: The block selector picks a few key past frames that clearly saw the arch, speeding up and stabilizing generation.

Secret sauce:

The small act of rotating only parts of Q/K (not all channels) injects just enough geometry to guide attention without overwriting the model’s learned visual features.
The frame-level sparsity uses the very same geometry-aware scores, making selection and attention agree on what “matters.”

04Experiments & Results

🍞 Hook: If you claim your compass works, you should test it by walking a loop and seeing if you end where you started.

🥬 The Concept (ViewBench): What it is: A benchmark that stresses camera-controlled, long-horizon consistency with loop closures and full 3-axis rotations. How it works:

Synthetic but photoreal scenes in UE5 (indoors/outdoors).
Rotations along yaw, pitch, and roll, including rotate-away-rotate-back.
Measures visual quality and loop-closure fidelity. Why it matters: Usual metrics don’t tell you if the model faithfully returns to the same view; ViewBench does. 🍞 Anchor: After a 75° pan away and back, ViewBench checks if the final frame matches the first in co-visible regions.

🍞 Hook: You don’t just ask “is it pretty?”—you also ask “is it the same place?”

🥬 The Concept (Metrics including LCE): What it is: PSNR/SSIM/LPIPS for visual quality and Loop Closure Error (LCE) for memory consistency at the return pose. How it works:

Visual metrics compare generated frames to ground truth.
LCE compares the final return frame to the first frame using a perceptual distance.
Lower LCE means better loop-closure fidelity. Why it matters: A model can look sharp but still forget what it saw before; LCE catches that. 🍞 Anchor: Two models might both look nice frame-by-frame, but the one with lower LCE truly “remembers” the scene layout.

Baselines and setup:

Methods: 3D RoPE (no camera geometry), GTA (relative extrinsics in attention), and our ViewRope. Also comparisons to interactive world models Matrix-Game-2 and HY-WorldPlay.
Data: Mix of CaM, GF-Minecraft, and ViewBench (balanced 1:1:1).
Fairness: Same backbones and training budgets; same channel budgets for positional encodings.

Scoreboard with context:

Position encodings (30° and 75° rotations): ViewRope achieves the best loop-closure performance, cutting LCE by roughly 4% vs. GTA, while matching or slightly improving PSNR/SSIM. That’s like getting the top score in “remember the scene” while keeping picture quality at least as good as others.
Against state-of-the-art interactive models: ViewRope reduces LCE compared to HY-WorldPlay by about 6.5% at 30°, 7.9% at 45°, and 11.4% at 75°. The advantage grows with bigger camera moves, where geometry matters more.

Sparse attention results:

Long sequences (90° and 180° cases): ViewRope with geometry-aware sparse attention outperforms sliding-window attention, reducing LCE by about 16% and keeping PSNR/SSIM competitive. Training on 201-frame sequences runs about 25% faster per iteration with top-k=5.

Surprising and supporting findings:

Stability: Naïve sparsity or GTA + sparsity sometimes diverged in training, but ViewRope + sparsity converged reliably. Intuition: ray-rotated Q/K make the block-relevance scores meaningful and steady.
Counterfactual tests: Randomly picking past frames worsened LCE by ~25%. Intentionally excluding ViewRope’s selected frames hurt even more (~38%). This shows the selected frames are causally important, not accidental.
Attention head specialization: Visualizations show some heads focus on nearby time steps (temporal heads), while geometry-aware heads light up on far-away but co-visible frames (e.g., a strong band during loop closure), guiding the top-k selection.

Ablations:

Channel placement: Embedding ViewRope in low-frequency temporal bands performed best, beating alternatives that replaced spatial RoPE or spread ViewRope everywhere.
Number of retrieved frames (k): Increasing k can boost visual detail (PSNR/SSIM) but hurts loop-closure beyond k=5 for the trained model, suggesting a balance between rich context and focused geometric recall.

Where it struggles and why:

For large angles (90°–180°), ViewRope lagged behind HY-WorldPlay in some settings. Two system-level reasons were identified: (1) Frame-rate resampling in evaluation caused under-rotation relative to training dynamics; (2) Teacher forcing during training led to error accumulation at long horizons during inference. These are orthogonal to ViewRope and could be mitigated with self-forcing or RL-based post-training.

Takeaway: On a diagnostic suited to geometry and memory (ViewBench), ViewRope consistently improves loop-closure consistency with small architectural changes and pairs well with geometry-aware sparsity to stay efficient on long videos.

05Discussion & Limitations

🍞 Hook: Even the best compass can struggle in a maze with walls that keep changing.

🥬 The Concept (Limitations): What it is: Where the method can stumble. How it works:

Drastic scene changes (e.g., moving to a new room) break geometric correspondence; ray-based matching has little to latch onto.
Very large rotations with different frame-rate dynamics than training can cause under/over-rotation and higher LCE.
Autoregressive error accumulation can blur details or shift geometry over very long sequences. Why it matters: Knowing when not to expect perfect loop closure helps design better systems and datasets. 🍞 Anchor: If you teleport from a street to a forest, the model can’t reuse old views—it must truly redraw from scratch.

Required resources:

Calibrated camera poses (intrinsics/extrinsics) to compute rays.
A video diffusion transformer backbone with enough capacity to host rotated Q/K channels.
GPU memory/time for long-context training, though geometry-aware sparsity helps a lot.

When not to use it:

Purely static 2D animations with no camera motion: the geometry-aware benefit is minimal.
Scenes with frequent hard cuts or abrupt, unrelated locations: geometric correspondence is rare.
Tasks demanding exact metric 3D reconstructions: ViewRope improves consistency but is not an explicit 3D model.

Open questions:

Can self-forcing, curriculum schedules, or RL post-training reduce long-horizon error accumulation?
How best to blend implicit (ViewRope) and explicit (e.g., point clouds or Gaussians) memories without losing flexibility?
Can we learn or adapt camera intrinsics/extrinsics on the fly when calibration is imperfect?
What’s the optimal allocation of rotated channels per layer/head for different datasets and motions?
Can we generalize geometry-aware sparsity patterns to even cheaper, near-lossless retrieval at scale?

Overall, ViewRope provides a clean, native bias toward 3D consistency inside attention and pairs naturally with geometry-driven sparsity. It’s not a full 3D engine, but it makes open-domain video generators much steadier on long, looping camera paths.

06Conclusion & Future Work

Three-sentence summary: This paper shows that using camera-ray directions as the “true positions” for attention helps video world models keep 3D scenes consistent over long, camera-controlled trajectories. The proposed ViewRope rotates parts of attention features by per-patch viewing directions, making attention scores sensitive to relative rays, and a geometry-aware sparse attention scheme selects only co-visible past frames, boosting both fidelity and speed. A new benchmark, ViewBench, confirms better loop-closure behavior and reduced geometric drift compared to strong baselines.

Main achievement: A simple, native change to attention—injecting patch-level ray geometry—substantially improves loop-closure fidelity without sacrificing open-domain generative flexibility, and it synergizes with geometry-aware sparsity for efficient long-sequence generation.

Future directions: Combine ViewRope with self-forcing or RL-based action post-training to curb error accumulation; explore hybrid implicit–explicit memories; learn adaptive channel allocations for ray rotations; and handle uncertain camera calibration robustly. Extending ViewBench with more diverse motions, lighting, and dynamic objects would sharpen evaluation.

Why remember this: Treating “view as position” gives the model an internal compass. With just this compass and a smart way to look back at only the right frames, we get video generation that remembers places, survives long camera trips, and returns home looking like it should.

Practical Applications

•VR/AR content generation where users can look around and return to a view without visual drift.
•Game replay synthesis with camera control that preserves consistent maps and landmarks.
•Training simulators (e.g., emergency response) needing stable, long camera paths for scenario practice.
•Cinematic previsualization with complex camera moves that loop back to key shots consistently.
•Robotics vision simulation where consistent scene memory aids planning and navigation.
•Virtual tours of real estate or museums with smooth, revisitable points of interest.
•Sports analysis videos with controllable viewpoints that reliably rediscover the same play moments.
•Education demos for geometry and physics with accurate, long-horizon camera motion.
•Creative video tools that maintain object identity across edits and revisits.
•Resource-efficient long video generation on limited GPUs using geometry-aware sparse attention.

Version: 1