MIND: Benchmarking Memory Consistency and Action Control in World Models

Yixuan Ye; Xuanyu Lu; Yuxin Jiang; Yuchao Gu; Rui Zhao; Qiwei Liang; Jiachun Pan; Fengda Zhang; Weijia Wu; Alex Jinpeng Wang

MIND: Benchmarking Memory Consistency and Action Control in World Models

Intermediate

Yixuan Ye, Xuanyu Lu, Yuxin Jiang et al.2/8/2026

arXiv

Key Summary

•MIND is a new benchmark that fairly tests two core skills of world models: remembering the world over time (memory consistency) and following controls exactly (action control).
•It uses 250 high-quality 1080p videos at 24 FPS from Unreal Engine 5, with both first-person and third-person views across eight diverse scene types.
•Actions are simple (W, A, S, D and camera up/down/left/right), but the speeds and rotation angles vary to test generalization to new action spaces.
•The benchmark checks long-context memory, scene consistency when retracing paths, action accuracy via 3D pose recovery, plus visual aesthetics and image quality.
•MIND introduces MIND-World, a simple and strong Video-to-World baseline that injects actions into timestep embeddings and distills a fast, causal student for streaming generation.
•In first-person, results are mixed across metrics, but in third-person MIND-World clearly outperforms Matrix-Game 2.0 on memory and action accuracy.
•Having context memory usually improves long-run stability, but can hurt when the action space at test time differs from training.
•Key open challenges include long-horizon memory, precise action control, cross-action space generalization, and robust third-person character handling.
•MIND offers a unified, closed-loop, revisited setup that the field was missing, making comparisons across models and views much more meaningful.

Why This Research Matters

World models are moving from ‘pretty videos’ to ‘useful simulators’ that must remember the world and obey controls. MIND gives a fair, unified way to test these skills across many scenes and in both first- and third-person views. This helps creators of self-driving cars, home robots, and AR/VR systems find what actually works over long times. It also reveals hidden problems, like when memory tied to old motion settings confuses new action scales. With MIND, progress becomes measurable and comparable, speeding up real-world readiness. In short, it teaches models to act like trustworthy worlds, not just good-looking movies.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine you’re filming a treasure hunt with a GoPro on your head (first-person) while a friend films you from the side (third-person). If you walk back to the same tree, both cameras should show the same tree in the same spot. And if you press the joystick to go left, you should really go left.

🥬 Filling: Before this paper, video AIs got really good at making pretty, realistic videos, but they weren’t always good at remembering what they saw earlier or at following controls exactly over long times. A reliable world model needs both: strong long-term memory (so the world stays the same when revisited) and precise action control (so the model moves where and how you tell it). Without these, the video may look nice, but the world won’t behave consistently, making it bad for driving sims, robots, or games.

🍞 Anchor: If a model sees a red door, walks around the block, and comes back, it should show the same red door in the same place—and if you command “turn right,” it shouldn’t veer left.

The world before: Video generation models (like modern diffusion models) made big strides in visual realism and short-term smoothness. Benchmarks like VBench checked things like human fidelity and physical plausibility. But many tests were open-loop (no feedback from actions), focused on a single perspective (often first-person), and mostly graded looks, not memory or control.

The problem: Researchers lacked a unified, closed-loop benchmark that measured two bedrock abilities across different views: remembering the world over long stretches and obeying action commands precisely—even when those actions use different step sizes or rotation speeds than the model saw during training. In everyday terms: can the model keep the map in its head straight, and can it drive the car correctly regardless of slight differences in pedal sensitivity?

Failed attempts: Several benchmarks rated trajectory quality or physics adherence, but often ignored long-context revisits (coming back to the same place) and action generalization (same buttons, different speed/angle). Some memory tests existed but were locked to narrow worlds (like only Minecraft) and agent loops that didn’t match human behavior. Many stayed first-person only, hiding problems with character motion and camera–actor relationships visible in third-person.

The gap: We needed a benchmark that is open-domain (many scene types), closed-loop (actions drive what you see next), supports both first- and third-person, and directly measures long-context memory and action control—including generalization across different action scales.

Real stakes: This matters for real life because cars, drones, home robots, and AR/VR assistants must remember where things are and follow controls exactly. If a robot forgets a table it just saw, it’ll bump into it. If an autonomous car drifts when you tell it to turn slightly, that’s unsafe. For games and movies, stable worlds that react to the player’s inputs make experiences believable.

Now, let’s introduce the key ideas in the right order.

🍞 Hook: You know how you remember where your backpack is in your room, even if you look away for a while?

🥬 The concept (Memory Consistency): Memory consistency is a model’s ability to keep the same objects, layouts, and textures stable over time, even after moving around and coming back. How it works: (1) Watch a chunk of video to build a mental map, (2) get action commands, (3) predict new frames, (4) when revisiting a place, match what was seen before. Why it matters: Without it, doors change color, buildings shift, and the world feels broken when you loop back.

🍞 Anchor: If you walk past a blue mailbox, take a lap around the block, and return, the mailbox should still be blue and in the same spot.

🍞 Hook: Imagine pressing a game controller button and expecting your player to move exactly the way you pressed.

🥬 The concept (Action Control): Action control is how well a model follows the commands you give—like moving forward or turning. How it works: (1) Receive an action (e.g., ‘W’ for forward, ‘←’ to turn left), (2) translate it into movement/rotation, (3) update position/orientation, (4) generate the matching next frame. Why it matters: If actions aren’t followed, you can’t navigate or interact reliably.

🍞 Anchor: Press ‘W’ to walk forward toward a door; the door should get closer, not drift sideways.

With these two basics, researchers can finally test world models in meaningful, real-world-like settings.

02Core Idea

🍞 Hook: Think of a driving test that checks not just if your car looks shiny, but also if you can follow directions and remember the route.

🥬 The concept (MIND Benchmark): MIND is a closed-loop, open-domain benchmark that scores two core skills—memory consistency and action control—from both first-person and third-person views. How it works: (1) Provide a memory clip (context), (2) feed a sequence of actions, (3) generate future frames, (4) check: does the model keep the world stable when revisited (memory) and does it move/turn as commanded (control), (5) also test generalization by changing movement/turn sizes. Why it matters: Without a fair, unified test like this, we can’t see what’s truly working in world models.

🍞 Anchor: It’s like taking the same walk twice—once forward and once backward—and checking if the same buildings line up each time, and whether each button press caused the expected step.

The “Aha!” moment: One sentence—If you don’t test both remembering the world and following actions together, you can’t trust a world model to act like a real, stable world.

Three analogies:

School map test: First learn the classroom map, then walk around; when you come back to your desk, it should still be there (memory), and you must take the right turns to get there (control).
GPS + driver: The GPS holds the map (memory) and the driver follows the steering cues (control). Both must work for a safe trip.
Stage play: The set stays the same each night (memory), and actors hit their marks on cue (control). If either fails, the show falls apart.

Before vs after:

Before: Benchmarks focused on looks or physics snippets, often first-person only, rarely checked revisits or varied action scales.
After: MIND standardizes first- and third-person testing, closed-loop action-following, long-context revisits, and cross-action-space checks, giving a clearer, fairer picture.

Why it works (intuition, not equations):

Closed-loop design forces the model to connect actions to next frames, not just guess a pretty video.
Revisit paths reveal whether the model kept a consistent internal map.
Symmetric motion paths (forward then reverse) expose geometric drift: if A then A-in-reverse doesn’t land you back at start, something’s off.
Testing multiple action scales checks whether the model learned the rules of motion, not just memorized one joystick sensitivity.

Building blocks:

Diverse videos: 250 clips at 1080p/24 FPS across eight scene families (landscape, urban, interior, sci-fi, stylized, ancient, industrial, aquatic), both first- and third-person.
Simple, universal actions: W, A, S, D for movement; arrows for camera pitch/yaw.
Action space generalization: vary movement increments (∆p) and rotation steps (∆r) to simulate different speeds/turn rates.
Memory revisit protocol: watch context, act, then loop back to earlier views to check consistency.
Metrics: long-context memory MSE, generated scene consistency via symmetric paths, action accuracy via 3D pose recovery and Sim(3) alignment, plus aesthetics and image quality.

Extra key concepts, in sandwich style:

🍞 Hook: You know how a flipbook looks smooth when each picture changes only a little?

🥬 The concept (Temporal Consistency): Temporal consistency means neighboring frames change smoothly without flicker or sudden jumps. How it works: (1) Use past frames as context, (2) generate small, consistent changes, (3) avoid abrupt structure or color shifts. Why it matters: Flicker breaks realism and makes navigation cues unreliable.

🍞 Anchor: A smooth pan of a city street shouldn’t stutter or make buildings wobble from frame to frame.

🍞 Hook: Think of a backpack where you keep the most important clues from earlier in the day.

🥬 The concept (Context Memory): Context memory is a small window of past frames kept to guide future generation. How it works: (1) Cache recent frames, (2) condition new frames on this cache and actions, (3) update cache as you go. Why it matters: Without it, the model forgets details and drifts.

🍞 Anchor: Remembering the last 25 frames helps the model keep the same wall poster the same size and color as you pass by again.

Finally, MIND adds a baseline model:

🍞 Hook: Picture a simple, reliable starter car you can use to road-test the new highway.

🥬 The concept (MIND-World Baseline): MIND-World is a streamlined Video-to-World model that injects actions into timestep embeddings and distills a fast, causal student for streaming. How it works: (1) Train a teacher with action conditioning, (2) initialize a student from the teacher’s ODE trajectories, (3) distill into a few-step autoregressive model via self-forcing DMD, (4) keep a context cache for long runs. Why it matters: A clean, open baseline makes comparisons and progress faster.

🍞 Anchor: It’s like a test car with cruise control (action following) and a good dashboard memory (context cache) so you can measure the road (the benchmark) accurately.

03Methodology

At a high level: Input (context video + action sequence) → Memory setup (cache context) → Action-conditioned frame generation (autoregressive) → Revisit checks and path symmetry tests → Metric computation (memory, consistency, action accuracy, quality) → Output (scores per dimension).

Step-by-step details:

Data construction in Unreal Engine 5

What happens: The team built eight scene categories (e.g., landscape, urban, interior, sci-fi, stylized, ancient, industrial, aquatic), each with multiple environments. Volunteers performed scripted and free-form actions. Each video is 1080p/24 FPS with frame-level action logs, and both first-person and third-person views are captured.
Why it exists: Open-domain variety prevents overfitting to a single game style and exposes real generalization needs (e.g., characters, camera, obstacles).
Example: In an urban scene, the actor walks forward (W) toward a crosswalk while the camera slightly yaws left (←) for 24 frames, then reverses.

Action space definition

What happens: Actions are a universal set: W, A, S, D for movement; ↑, ↓, ←, → for camera pitch/yaw. Movement changes position by ∆p in a given direction; rotations change orientation by ∆r in a specified axis.
Why it exists: A shared, simple action language makes cross-model comparisons fair.
Example: Pressing W once might move forward ∆p units; pressing ← rotates yaw by ∆r degrees left over 24 frames.

Action space generalization

What happens: The same buttons can have different step sizes: small (∆p=100, ∆r=0.4°), medium (e.g., ∆p=150, ∆r=0.7°), large (∆p=280, ∆r=1.4°). Both first- and third-person clips include varied settings.
Why it exists: Real systems face slightly different joystick sensitivities or camera speeds; models should adapt without retraining.
Example: In one clip, ‘W’ advances 150 cm/s; in another, 250 cm/s. A generalizing model keeps the world coherent and matches motion scale.

Memory and revisit protocol

What happens: A memory segment M is observed first. Then, given an action sequence A, the model predicts frames V̂. Some action sequences loop back to a previously seen view (a revisiting trajectory), letting the benchmark compare the revisited prediction to the original ground truth.
Why it exists: Revisits directly test whether the model kept a stable internal map of objects, layout, and texture.
Example: See a red car near a blue shopfront; walk away, then return by the mirrored path. The red car and shopfront should match earlier frames.

Generated scene consistency via symmetric paths

What happens: Ten symmetric motion paths are provided (e.g., forward vs backward, left vs right, turn left vs turn right), each lasting 24 frames. The model is run forward and then with the mirrored sequence; the two predictions should align.
Why it exists: If geometric consistency holds, reversing actions should retrace the scene.
Example: Move left for 24 frames, then right for 24. If you don’t end up aligned with the start view, the model’s internal geometry drifted.

Action accuracy via 3D pose recovery and alignment

What happens: From generated videos, camera poses are recovered using ViPE, then aligned to ground truth with Sim(3) Umeyama to fix scale/coordinate mismatches. Translational and rotational relative pose errors are measured.
Why it exists: This decouples action accuracy from the model’s hidden speed settings; it checks whether the final motion matches the command sequence in 3D.
Example: If the script says yaw left 0.7° per step, but the recovered pose shows only 0.3°, rotational error is high.

Visual quality checks (aesthetics and imaging)

What happens: LAION’s aesthetic predictor provides an attractiveness score; MUSIQ measures perceptual fidelity (sharpness, artifact levels).
Why it exists: A model should be both consistent and pleasant/clear to watch.
Example: A sci-fi corridor with moody lighting should still be crisp without odd banding or blur.

Baseline model: MIND-World

What happens: Start from SkyReels-V2-I2V-1.3B; inject actions directly into timestep embeddings (simpler than heavy action modules). Train a bidirectional, action-conditioned teacher. Initialize a few-step autoregressive student from teacher ODE trajectories. Distill with DMD-based Self-Forcing so the student learns to condition on its own prior outputs. Use a small local attention window (e.g., 25 frames) and a context cache for long sequences.
Why it exists: Provide a transparent, efficient Video-to-World baseline that others can beat or improve.
Example: During inference, with a memory window of recent frames and incoming actions, the model streams new frames at low latency.

Putting it together: scoring

What happens: For each clip, the model observes context, receives actions, predicts frames, runs revisit and symmetry checks, recovers poses, and gets MSE-based memory/consistency scores plus pose errors and quality metrics.
Why it exists: A unified suite makes apples-to-apples model comparisons possible.
Example: On a third-person industrial scene, measure long-context memory MSE, generated scene consistency MSE, translational/rotational RPE, and aesthetics/MUSIQ.

The secret sauce:

Closed-loop revisits and symmetric paths expose failures you can’t see in open-loop, pretty-only videos.
Cross-view (first- and third-person) coverage reveals issues with character control and camera–actor interactions.
Action space generalization forces the model to learn motion rules, not memorize one control sensitivity.
Simple, shared actions and standardized metrics lower friction for broad adoption.

Sandwich spotlight on two more core ideas:

🍞 Hook: Imagine returning a borrowed book to the exact same shelf spot.

🥬 The concept (Generated Scene Consistency): It’s the idea that running an action and then its mirror (e.g., left then right) should bring you back to matching frames. How it works: (1) Run forward path, (2) run mirrored path, (3) compare frame-by-frame. Why it matters: If mirrored paths don’t align, your internal map is drifting.

🍞 Anchor: Walk three steps north, then three steps south—if you aren’t back where you started, your compass is off.

🍞 Hook: If two kids press the same ‘jump’ button on different controllers, the jumps should still look right even if one controller is a bit more sensitive.

🥬 The concept (Cross-Action Space Generalization): It means the model still follows the rules when step sizes and turn angles change. How it works: (1) Train on one ∆p, ∆r, (2) test on different ∆p, ∆r, (3) check if motions scale correctly. Why it matters: Real systems meet varied devices and speeds.

🍞 Anchor: If ‘turn left’ changes from 0.7° to 1.4° per tap, the curve should be wider, but still correct and stable.

04Experiments & Results

The test: Evaluate on MIND across first- and third-person clips. Measure long-context memory MSE (lower is better), generated scene consistency MSE via symmetric paths (lower is better), action space generalization MSE across new ∆p/∆r (lower is better), visual attractiveness (LAION aesthetics, higher is better), perceptual image quality (MUSIQ, higher is better), and action accuracy as translational/rotational relative pose errors (lower is better) after ViPE + Sim(3) alignment.

Competition: Compare MIND-World to Matrix-Game 2.0. Test in two modes: without context memory (image-to-world cold start) and with context memory (video-to-world), to see how memory affects performance.

Scoreboard with context:

First-person, image-to-world: • Long-context memory: MIND-World 0.1091 vs Matrix-Game 2.0 0.1188 (MIND-World is better by being lower). Think of this as getting a slightly better memory score. • Generated scene consistency: 0.0359 (MIND-World) vs 0.0306 (Matrix-Game 2.0); here Matrix-Game 2.0 edges out on symmetry consistency. • Action space generalization: 0.1200 (MIND-World) vs 0.1084 (Matrix-Game 2.0); Matrix-Game 2.0 generalizes a bit better here in first-person cold start. • Aesthetics: 0.4583 vs 0.4302; MIND-World looks nicer on average. • MUSIQ: 0.5655 vs 0.5180; MIND-World is perceptually cleaner. • Action accuracy (RPE): translation 0.0356 vs 0.0265 (Matrix-Game’s translation slightly better), rotation 0.4395 vs 0.6914 (MIND-World’s rotation clearly better, a big win in turning fidelity).
First-person, with context memory (MIND-World only reported): LCM improves to 0.1035, GSC to 0.0309, aesthetics ~0.4590, MUSIQ ~0.5702, but rotational RPE rises to 0.5534 (suggesting mixed effects of memory when action spaces mismatch).
Third-person, image-to-world: • Long-context memory: 0.1066 (MIND-World) vs 0.1404 (Matrix-Game 2.0); MIND-World is significantly better, like moving from a C to a solid B+. • Generated scene consistency: 0.0327 vs 0.0372; MIND-World better, showing less geometric drift. • Action space generalization: 0.0677 vs 0.0777; MIND-World better, handling new ∆p/∆r more robustly. • Aesthetics: 0.5204 vs 0.4236; MIND-World looks notably nicer. • MUSIQ: 0.5672 vs 0.4857; cleaner, sharper frames from MIND-World. • Action accuracy (RPE): translation 0.0271 vs 0.0622; rotation 0.2587 vs 0.9031—MIND-World is far more accurate, especially in turning (huge margin).
Third-person, with context memory (MIND-World only reported): minor LCM/GSC improvements and still-strong aesthetics/MUSIQ; rotational RPE increases compared to cold start but remains much better than Matrix-Game 2.0.

Surprising findings:

Memory helps long-horizon stability but can hurt when the test-time action space differs from training; the model may ‘trust’ cached context tied to old motion scales and get confused.
Third-person exposes weaknesses hidden in first-person: some baselines fail to control the character, drifting through or past the actor. MIND-World controls third-person better but can still mishandle foreground–background interactions (e.g., character passing through buildings).
Visual prompts (initial images/videos) strongly affect action following; separating visual style from control dynamics could improve reliability.

Takeaways in simple terms:

With memory on, MIND-World generally remembers the map better over time (lower LCM), especially in third-person.
MIND-World produces cleaner, more attractive frames overall.
In third-person, MIND-World follows turns much more precisely than Matrix-Game 2.0 (big rotational accuracy gains), which is critical for real control.
However, when the ‘controller sensitivity’ (∆p, ∆r) changes, memory-conditioned models can stumble—action generalization needs work.

05Discussion & Limitations

Limitations:

Long-horizon memory: Models still struggle to keep a perfect map over very long rollouts; revisits can drift, and symmetric paths don’t always meet.
Action-space generalization: Changing ∆p/∆r (movement/rotation scales) can confuse models, especially those relying on cached context tied to the training scales.
Third-person dynamics: Handling actor–camera–scene relationships remains hard; some models lose the character or let it pass through obstacles.
Coupling of visuals and control: Visual prompts can sway action following; better disentanglement of style from dynamics is needed.

Required resources:

Data: Access to MIND’s 250 1080p/24 FPS clips with action logs across diverse scenes.
Compute: Reasonable GPU resources (the baseline used 4× H100) for teacher–student training and distillation.
Tooling: ViPE for pose recovery; Sim(3) alignment; aesthetic and MUSIQ evaluators.

When NOT to use:

If you only care about short, open-loop, text-to-video prettiness without interaction, MIND may be overkill.
If your model doesn’t accept action inputs at all, many MIND scores won’t apply.
If you only operate in a single, fixed action sensitivity (no variation), the cross-action tests may not reflect your use case.

Open questions:

How to design action conditioning that adapts to new ∆p/∆r on the fly—can models infer controller sensitivity from context?
Can we build longer, more robust memory (e.g., hierarchical or 3D) that survives minutes, not seconds, without drift?
What’s the best way to decouple visual appearance from control dynamics so style changes don’t derail action following?
How can third-person understanding of character physics improve (e.g., explicit collision constraints, scene graphs, or 3D spatial memory)?

06Conclusion & Future Work

Three-sentence summary: MIND is a first-of-its-kind, closed-loop, open-domain benchmark that tests whether world models can both remember the world (memory consistency) and follow controls (action control) from first- and third-person views. It introduces varied action spaces, revisit paths, and symmetry checks, plus a simple, strong MIND-World baseline that streams action-conditioned video with context memory. Results show clear gains with memory and strong third-person control for MIND-World, while revealing tough challenges in long-horizon memory and cross-action generalization.

Main achievement: Turning ‘pretty videos’ into ‘trustworthy worlds’ by standardizing how we measure long-context memory, action following, and generalization across action scales and viewpoints.

Future directions:

Action adapters that learn ∆p/∆r on the fly and robustly map actions to motion.
Scalable, geometry-aware memory (e.g., 3D anchored or hierarchical) to cut drift on long revisits.
Better third-person physics and collision grounding to keep characters anchored to the world.
Stronger disentanglement between visual style and control dynamics.

Why remember this: MIND sets the rules of the road for world models—if a model can remember where things are and move exactly as told, you can start trusting it in cars, robots, games, and AR/VR. It shifts evaluation from ‘Does it look nice?’ to ‘Does it behave like a consistent, controllable world?’

Practical Applications

•Benchmark new world models for robots to ensure they remember object positions and follow navigation commands.
•Stress-test self-driving simulators for long-horizon memory and precise steering under varied controller sensitivities.
•Evaluate game AI that must keep levels consistent while reacting to player inputs in first- and third-person.
•Pre-visualize film and VR scenes, checking that camera moves retrace accurately and visuals stay stable.
•Train control policies that are robust to different joystick or camera-turn sensitivities (∆p/∆r).
•Diagnose model drift using symmetric path tests to catch hidden geometric inconsistencies.
•Compare memory modules (e.g., 3D memory vs. compressed caches) with standardized revisit metrics.
•Validate action-injection designs by measuring translational/rotational pose errors on the same clips.
•Curate data: identify which scene types or views your model struggles with and collect targeted training data.
•Tune inference settings (context window size, cache strategy) for the best trade-off between stability and flexibility.

Version: 1