GeoWorld: Geometric World Models

Zeyu Zhang; Danning Li; Ian Reid; Richard Hartley

GeoWorld: Geometric World Models

Intermediate

Zeyu Zhang, Danning Li, Ian Reid et al.2/26/2026

arXiv

Key Summary

•GeoWorld is a new way for AI to plan several steps into the future by thinking in shapes (geometry) instead of only numbers.
•It moves the AI’s thoughts from flat space (Euclidean) into a curved space called hyperbolic space, where hierarchies fit naturally.
•Inside this curved space, the shortest, most meaningful paths (geodesics) guide the AI to make steadier, longer plans.
•A special predictor (H-JEPA) maps video observations into the hyperbolic world so distances match real task structure.
•A learning step called Geometric Reinforcement Learning (GRL) tunes the predictor so lower energy equals better plans, and rollouts follow geodesics.
•During planning, the AI searches for the action sequence that travels the lowest-energy hyperbolic path to the goal (using CEM).
•On two big benchmarks (CrossTask and COIN), GeoWorld beats the strong V-JEPA 2 baseline, especially for longer plans (T=3 and T=4).
•Results show about 3% higher success rate for 3-step plans and 2% higher for 4-step plans on average.
•The main win: adding geometry makes the AI’s inner map match the real world’s layered structure, so errors don’t snowball as fast.
•This could help robots, video assistants, and how-to tools plan reliable multi-step procedures from visual input.

Why This Research Matters

Many real tasks—cooking, assembling gadgets, or lab procedures—are multi-step and hierarchical. GeoWorld helps AI plan these steps more reliably by using a space that naturally fits branching futures. That reduces the “snowballing error” that often ruins long plans. It also avoids heavy pixel generation, making planning faster and cleaner. As robots and assistants learn from videos, shaping their inner map with geometry can turn clumsy guidance into confident, step-by-step help. This is a step toward visual planners that are stable, accurate, and useful in everyday life.

Reading Workflow

Turn this paper into a decision

Scan fast. Promote only the papers that survive triage.

No workflow history yet.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine you’re building a LEGO castle. You don’t plan just the next brick; you think in steps—first the base, then the towers, then the flags. If your plan doesn’t respect the structure (base before towers), the castle wobbles.

🥬 The Concept (Latent space representation):

What it is: A hidden, compact map where AI stores the “important parts” of images and videos without all the pixel details.
How it works:
1. The AI sees a picture or video and turns it into a vector (a list of numbers) called a latent.
2. Latents cluster related scenes close together (e.g., all “make a sandwich” steps live near each other).
3. The AI uses these latents to predict future steps.
Why it matters: If the hidden map is messy, the AI gets lost when planning multiple steps ahead. 🍞 Anchor: It’s like a treasure map that shows big landmarks so you can plan a route without getting distracted by every pebble.

🍞 Hook: You know how a ball naturally rolls to the lowest spot in a bowl? That’s the path of least effort.

🥬 The Concept (Energy-based predictive models):

What it is: A way for AI to score how “compatible” a current situation is with a possible future; lower energy means more likely and better.
How it works:
1. Turn images into latents.
2. For a guessed next latent, compute an energy score (lower is better).
3. Search for actions that lead to low-energy futures.
Why it matters: Instead of generating pixels (hard and noisy), the AI just finds a valley in the energy landscape of latents. 🍞 Anchor: It’s like choosing the smoothest sled path downhill instead of trying to draw every snowflake.

🍞 Hook: Planning a school play needs a big plan (acts) and smaller plans (lines). Mixing them up causes chaos.

🥬 The Concept (Hierarchical planning):

What it is: Plan at high-level first (the storyline), then fill in low-level details (the lines and stage moves).
How it works:
1. Find a rough path of main steps.
2. For each main step, fill in details.
3. Keep both levels consistent.
Why it matters: Without hierarchy, long plans break because tiny mistakes stack up without guidance. 🍞 Anchor: Like making a sandwich: decide the order (bread → fillings → bread), then place each ingredient carefully.

🍞 Hook: Imagine city streets that branch out like a tree: downtown → neighborhoods → blocks → houses. This isn’t flat; it grows fast!

🥬 The Concept (Geodesics):

What it is: The shortest, most meaningful paths in curved spaces.
How it works:
1. Pick two points in a curved world.
2. Trace the curve that keeps you closest to the surface while staying shortest.
3. Use that curve as your “best route.”
Why it matters: In planning, geodesics encode efficient, stable progress. 🍞 Anchor: It’s the true shortcut over a hill, not the long walk around.

🍞 Hook: Picture squeezing a huge family tree into a circle where the center is “root” and the edges hold many leaves.

🥬 The Concept (Poincaré ball model):

What it is: A way to draw hyperbolic (negatively curved) space inside a ball so hierarchies fit naturally.
How it works:
1. Map the root near the center.
2. Push more detailed nodes outward.
3. Distances stretch near the boundary, so levels separate clearly.
Why it matters: Hierarchies (like procedures) pack efficiently with less confusion. 🍞 Anchor: It’s a stretchy map where each new level of a plan has room without bumping into others.

The world before GeoWorld: Many visual planners either generated pixels frame by frame (slow, noisy, and short-sighted) or used energy-based models in flat (Euclidean) latent spaces. Flat spaces don’t represent tree-like futures well, so tiny errors grow fast as you plan further. Also, most models were trained mostly on one-step data, making them wobble on long rollouts.

The problem: Two big hurdles kept showing up. First, geometric neglect: the hidden map didn’t preserve hierarchy and distances in a meaningful way. Second, multi-step shortcoming: plans fell apart over longer horizons because the model wasn’t trained or shaped to stay stable.

What people tried: Generative models made pixels or tokens for the next frame and used inverse dynamics to guess actions. But they saw only one step at a time and paid big compute costs to render pixels. Predictive models avoided pixels and learned an energy landscape, which was better—but they learned in flat space and focused on one- or two-step training.

The gap: We needed a representation that respects hierarchy (like a family tree), plus a trainer that encourages multi-step paths to be short, straight, and consistent (geodesics) over many steps.

Real stakes: In daily life, cooking, fixing things, playing sports, or doing science projects all need multi-step plans. If your assistant can’t hold a stable long plan, it gives you steps out of order or gets stuck. Making the AI’s inner map match real-world structure means instructions that work, robots that don’t fumble, and videos that teach reliably.

02Core Idea

🍞 Hook: You know how hanging coats on a straight rod gets crowded fast, but a tree with branches holds many more coats neatly? Space shape matters.

🥬 The Concept (GeoWorld):

What it is: A world model that plans in a curved, hierarchy-friendly space so multi-step visual plans follow the best (geodesic) routes.
How it works:
1. Encode video observations into latents.
2. Map those latents into a hyperbolic (curved) space.
3. Predict next states along geodesics (the shortest curves in that space).
4. Train with a geometric RL trick so low energy means better, longer plans.
Why it matters: Without geometry, long plans drift; with it, plans align with real task structure and stay stable. 🍞 Anchor: Picture planning the Replace Memory Chip task by gliding down a curved valley that naturally keeps sub-steps in order.

Aha! Moment in one sentence: If the future branches like a tree, then plan in a space that loves trees—hyperbolic geometry—so the shortest paths are the most stable multi-step plans.

Three analogies:

Library analogy: Flat shelves (Euclidean) cram books together; a branching bookshelf (hyperbolic) keeps topics neatly separated so you can find the next book in a series easily.
Road trip analogy: In a mountain region, the straight line on a flat map is silly; the safe, efficient route follows the terrain’s curves (geodesic).
Family tree analogy: Putting a giant family tree on lined paper (flat) overlaps branches; using a circle that expands near the edge (Poincaré ball) keeps generations clear.

Before vs After:

Before: Energy-based planners in flat space; decent short steps, weak long paths; errors compound and hierarchy is fuzzy.
After: Energy-based planners in hyperbolic space; geodesics protect long paths; hierarchy is built-in, and GRL gently straightens multi-step rollouts.

🍞 Hook: Imagine roads painted right on the hills so even if you’re blindfolded, the slope guides you.

🥬 The Concept (H-JEPA):

What it is: A predictor that maps Euclidean latents into a Poincaré ball and learns to step along hyperbolic geodesics.
How it works:
1. Encode frames to Euclidean latents.
2. Exponential map projects them into the hyperbolic ball.
3. Predict next states so hyperbolic distance to ground-truth is minimized (teacher-forced and with rollout consistency).
Why it matters: The predictor respects curvature, so the energy landscape mirrors real hierarchical structure. 🍞 Anchor: It’s like drawing your route on a globe instead of on a flat map, so “shortest” matches reality.

🍞 Hook: You know how a rubber band snaps into a straight line when stretched between two pins?

🥬 The Concept (Geometric Reinforcement Learning, GRL):

What it is: A training step that treats lower hyperbolic energy as higher reward and prefers geodesic-consistent rollouts.
How it works:
1. Define reward = negative hyperbolic distance error.
2. Maximize cumulative reward over several steps.
3. Add triangle-inequality regularization so two small steps don’t beat one direct step unless it’s truly shorter.
Why it matters: It sculpts the predictor’s value landscape so long plans stay smooth and don’t zigzag. 🍞 Anchor: It’s like coaching a relay team to pass the baton in the straightest lane, not weave across the track.

Why it works (intuition, no math):

Future branches explode like trees; hyperbolic space compresses trees naturally, so nearby steps and far-away goals get the right spacing.
Geodesics in that space are meaningful “straight” paths; training to follow them makes each added step less likely to drift.
Triangle inequality is a built-in safety rule that discourages detours.

Building blocks:

Latent space → Poincaré ball projection → Hyperbolic distance.
Predictor trained on one-step and two-step rollouts.
GRL tunes the predictor so “low energy = good long plan.”
CEM planner searches for action sequences that ride the low-energy valley to the goal.

03Methodology

At a high level: Video/goal input → Encode to latents → Project into hyperbolic space (H-JEPA) → Predict next latents along geodesics → GRL refines multi-step value/energy → Plan actions with CEM to reach the goal along lowest-energy paths.

🍞 Hook: Imagine you take notes (latents) from a video, then pin them inside a stretchy circle where related notes form neat branches.

🥬 The Concept (Hyperbolic projection with Poincaré ball):

What it is: A layer that maps Euclidean latents into the hyperbolic ball so distances reflect hierarchy.
How it works:
1. Encode frame xt into a Euclidean latent.
2. Use the exponential map to place it in the hyperbolic ball.
3. Learn the curvature parameter so the space fits the data.
Why it matters: If you keep latents flat, branches overlap; curving the space separates levels. 🍞 Anchor: Like seating students by grade levels in a fan-shaped hall so everyone fits without crowding.

🍞 Hook: To build a card tower, you place the next card where it best supports the layer above.

🥬 The Concept (Action-conditioned prediction in H-JEPA):

What it is: Given a current hyperbolic latent and an action, predict the next latent by following the geodesic direction.
How it works:
1. Input: a sequence of hyperbolic latents and actions.
2. Predictor outputs next-step latents across T steps.
3. Loss uses hyperbolic distance between predicted and true latents.
Why it matters: Aligning predictions with hyperbolic distances keeps steps consistent with hierarchy. 🍞 Anchor: It’s like stepping stones placed so each new stone is the shortest safe hop toward the goal.

Training recipe (Supervised Fine-Tuning, SFT):

Teacher Forcing Loss: one-step accuracy; compare predicted next latent to the true next latent using hyperbolic geodesic distance.
Rollout Loss: two-step consistency; feed predictions back in and still match the true future.
Total: a balance λ between one-step and rollout; roughly half/half works well for long-horizon stability.

🍞 Hook: When hiking, you prefer a single smooth trail over two awkward zigzags, even if distances look similar.

🥬 The Concept (GRL value shaping + triangle inequality):

What it is: A planner-friendly tuning where ‘reward’ = negative hyperbolic distance error over multiple steps, plus a rule favoring directness.
How it works:
1. Define per-step cost as the hyperbolic distance error; reward is its negative.
2. Sum discounted rewards over T steps.
3. Add triangle-inequality regularization: the direct two-step gap should not exceed the sum of the single-step gaps; encourage geodesic-like rollouts.
Why it matters: It bends the model’s internal roads into smooth highways, reducing detours as T grows. 🍞 Anchor: Like using a ruler to keep your drawing straight instead of freehand wobbles.

🍞 Hook: Picking the best route on a map often means sampling a few options and choosing the most promising, then refining.

🥬 The Concept (Energy-based planning with CEM):

What it is: A search method that samples action sequences, keeps the best (lowest energy), and refines around them.
How it works:
1. Encode current and goal frames into hyperbolic latents.
2. Sample many candidate action sequences.
3. Predict where each sequence lands; measure hyperbolic energy to the goal.
4. Keep the best few, update the sampling distribution, and repeat.
Why it matters: It finds action paths that follow the curved valley of lowest energy to the goal without generating pixels. 🍞 Anchor: It’s like trying several paper airplane folds, keeping the ones that fly farthest, and tweaking them.

Concrete example (Replace Memory Chip task):

Input: current video clip where the device is closed; goal clip where the new chip is installed.
Encode both into hyperbolic latents.
The predictor imagines how actions like “take off shell → remove old chip → install new one → close shell” move along the geodesic path to the goal.
CEM searches for action sequences that keep predicted latents on the lowest-energy curve; the best sequence matches the correct procedure order.

Secret sauce:

Curvature-aware distances separate levels of a task (preparing vs installing vs finishing).
GRL’s triangle rule keeps rollouts straight and stable.
No pixel generation—just clean, structured energy minimization in a space shaped like the tasks themselves.

04Experiments & Results

🍞 Hook: Think of a cooking contest where teams follow a recipe from videos. We judge not just if they finish, but if each step is correct and in order.

🥬 The Concept (The test setup):

What it is: Multi-step visual planning on CrossTask and COIN—big collections of how-to videos with labeled steps and timings.
How it works:
1. Give the model a starting observation and a goal (image or video).
2. Ask it to predict T actions that reach the goal.
3. Score with:
  - Success Rate (SR): all steps exactly match (a perfect recipe).
  - Mean Accuracy (mAcc): percent of correct steps on average.
  - Mean IoU (mIoU): overlap between predicted and true procedure sequences.
Why it matters: Real procedures are long and precise; scoring must check order and coverage. 🍞 Anchor: It’s like grading a LEGO build: did you make the right model (SR), place most pieces right (mAcc), and follow the plan closely (mIoU)?

The competition:

LLM/VLM planners (e.g., Gemini 2.5 Pro, GPT-5).
Generative world models (e.g., VideoWorld, PDPP, ActionDiffusion).
Predictive world models (e.g., V-JEPA 2, E3P, PlaTe).
Simple baselines (Random, Retrieval-based) for context.

Scoreboard with context:

Procedural planning (images): GeoWorld consistently beats V-JEPA 2 across model sizes for T=3 and T=4. Think of SR improvements of about +3% at T=3 and +2% at T=4—like nudging from a solid B+ to an A- when everyone else hovers around B.
Visual planning with videos: Again, GeoWorld rises above V-JEPA 2 and competes well with strong VLMs, especially as T increases. The ViT-g version reaches the best overall results, reflecting how geometry helps as tasks stretch longer.
Long-horizon stability: As T goes from 3 to 6 (and even to 8 in extended tests), many methods lose accuracy quickly (error snowballs). GeoWorld’s hyperbolic mapping and GRL reduce this drift, keeping SR notably higher at large T.

Surprising findings:

Learnable curvature tends to settle around a moderate value (e.g., $c ≈ 0$ .3), not extremely curved. This flattens the space just enough for stability but keeps hierarchy benefits.
Enforcing triangle inequality in GRL gives steady gains—it’s a simple geometric rule with big practical payoff.
Even with a frozen encoder (plus a small hyperbolic layer), GeoWorld earns strong improvements; fully fine-tuning adds only modest extra boosts, suggesting the core geometric change is doing the heavy lifting.

Why the numbers matter:

A few percentage points in SR at T=4 can be the difference between a device being fixed correctly vs. steps out of order.
Higher mIoU indicates the whole plan lines up closely with the ground truth, not just isolated steps.
Consistent wins across CrossTask and COIN show broad usefulness across many kinds of everyday procedures.

05Discussion & Limitations

🍞 Hook: Even great hiking maps don’t replace good shoes and daylight; tools help, but they have limits.

🥬 The Concept (Limitations):

What it is: Honest places where GeoWorld is not magic.
How it works:
1. Needs good visual data; confusing or very noisy videos can still trip it up.
2. Geometry helps hierarchy, but it doesn’t add missing facts; if steps are ambiguous, the plan may still wander.
3. Currently focused on visual procedures; full embodied control (e.g., robot torques) needs extra modules.
Why it matters: Knowing edges prevents overpromising and guides future improvements. 🍞 Anchor: Even with a great map, a foggy day can slow you down.

🥬 The Concept (Required resources):

What it is: What you need to train and run it well.
How it works:
1. Large video encoders (e.g., V-JEPA 2 backbones) and a ~300M-parameter predictor.
2. Multi-GPU training (the paper used H100s) for practical training times.
3. Inference is light enough for a single high-end GPU.
Why it matters: Planning budgets and hardware shape real deployments. 🍞 Anchor: It’s like needing a good kitchen to prep a feast, but serving plates only need a small counter.

🥬 The Concept (When not to use):

What it is: Situations where simpler or different tools may win.
How it works:
1. One-step, reactive control without hierarchy: a simpler Euclidean predictor might suffice.
2. Text-only planning: language-first tools may be enough.
3. Pixel-perfect video generation needs: use a generative model.
Why it matters: Pick the right wrench for the bolt. 🍞 Anchor: Don’t bring a bulldozer to plant a flower.

🥬 The Concept (Open questions):

What it is: Next puzzles to solve.
How it works:
1. Multi-level sub-task hierarchies (explicit high- and low-level controllers) inside the same hyperbolic space.
2. Richer action spaces for robots (continuous control) and real-time feedback.
3. Mixing language goals with hyperbolic planning (vision-language-action in curved spaces).
4. Automatic curvature schedules that adapt per task.
Why it matters: Answering these unlocks broader, more reliable planning. 🍞 Anchor: Today’s sturdy treehouse needs stairs and handrails to welcome more kids safely.

06Conclusion & Future Work

Three-sentence summary: GeoWorld is a geometric world model that moves planning from a flat latent space into a hyperbolic one, where geodesics naturally capture hierarchical procedures. A hyperbolic predictor (H-JEPA) and Geometric Reinforcement Learning (GRL) shape the energy landscape so long, multi-step rollouts stay smooth and stable. On CrossTask and COIN, this delivers consistent gains over strong baselines, especially as plans get longer.

Main achievement: Showing that respecting geometry—by mapping latents to a Poincaré ball and optimizing geodesic-consistent rollouts—meaningfully improves long-horizon visual planning.

Future directions: Add explicit multi-level controllers, extend to embodied robotics and continuous control, blend language with curved-space planning, and auto-tune curvature per task and scale.

Why remember this: When your problem branches like a tree, using a space that loves trees makes planning simpler, steadier, and smarter—turning long procedures from wobbly guesses into guided journeys along the right curves.

Practical Applications

•Home repair assistants that watch and suggest the next correct step (e.g., replacing a part) with fewer mistakes.
•Cooking tutors that plan and track multi-step recipes from video, keeping order and timing aligned.
•Assembly-line guidance that predicts stable action sequences for complex product assembly.
•Robot manipulation planning that follows geodesic-consistent paths for long-horizon tasks.
•Sports training tools that analyze a goal clip and plan the sequence of drills to reach similar performance.
•Educational how-to platforms that generate reliable procedure plans from instructional videos.
•AR/VR step-by-step overlays that keep procedures in correct order even across long sequences.
•Video retrieval-by-plan: find videos that match a desired multi-step procedure by comparing latent geodesic paths.
•Quality control in manufacturing: detect step-order deviations by measuring hyperbolic distance to the ideal plan.
•Medical skills training (simulation): plan multi-step procedural practice from curated videos with stable ordering (non-clinical support).

Version: 1