Solaris: Building a Multiplayer Video World Model in Minecraft
Key Summary
- ā¢Solaris is a new AI that can imagine the future videos of two Minecraft players at the same time, keeping both cameras consistent with each other.
- ā¢The team built SolarisEngine, a multiplayer data system that coordinates two bots, records synchronized actions and videos, and scales to millions of frames.
- ā¢They collected 12.64 million multiplayer frames and created a benchmark to test movement, memory, grounding, building, and cross-view consistency.
- ā¢The model adapts a powerful single-player video diffusion transformer to multiplayer by interleaving tokens and sharing attention across players.
- ā¢Training happens in stages: single-player ā multiplayer ā causal (autoregressive) ā Self Forcing for long, stable generations.
- ā¢They introduce Checkpointed Self Forcing, which makes long-horizon training memory-efficient while letting gradients flow through attention caches for better visuals.
- ā¢Solaris beats strong baselines on most tasks, especially building and view consistency, and keeps visuals sharper (lower FID).
- ā¢The code, datasets, and models are open-sourced to help others build multi-agent world models.
- ā¢This work lays the groundwork for multi-robot planning, multi-camera simulation, and collaborative agents that share a consistent world.
- ā¢Key limitation: no persistent world state yetāwhen agents disappear from view, long-term memory can drift.
Why This Research Matters
Many real-world problems involve multiple people, robots, or cameras sharing the same space. If their āmental moviesā donāt agree, plans fall apart and safety suffers. Solaris shows how to keep multi-camera stories consistent, which is essential for collaborative robots, multi-camera driving systems, and shared AR/VR experiences. Its memory-efficient training unlocks longer simulations that look sharper and behave more realistically. The open-source engine and datasets lower the barrier for others to build multi-agent models in new domains. Over time, this can lead to better teamwork between humans and AI in factories, homes, cities, and games.
Detailed Explanation
Tap terms for definitions01Background & Problem Definition
š Hook: Imagine you and a friend playing Minecraft on the same server. If you place a block, your friend should see that block appear too. If they turn left, you should see their character turn from your point of view. The two videos have to agree because theyāre showing the same world.
š„¬ The Concept (Video World Model):
- What it is: A video world model is an AI that learns to predict the next frames of a video based on what happened before and which actions were taken.
- How it works: Step 1) Watch past frames; Step 2) Read the actions (like move, turn, place block); Step 3) Predict the next frames that make sense; Step 4) Repeat to make a full future video.
- Why it matters: Without it, an agent canāt āimagineā outcomes before acting. Itās like trying to play chess without mentally visualizing the next moves.
š Anchor: When a robot plans to lift a cup, a video world model helps it āseeā what will happen before it actually moves its arm.
-
The World Before Before Solaris, most video world models were like single-player cameras. They could show what one agent would see next if that agent moved or acted, but only from that one viewpoint. Real lifeāand most interesting gamesāare multi-agent: people coordinate, collide, and cooperate. In multi-agent worlds, consistency matters: one agentās action should instantly show up in everyone elseās view. Single-camera models canāt check or enforce that agreement between cameras.
-
The Problem How do we build an AI that can imagine the future for multiple players at once, keeping their videos consistent with each other? This is hard because:
- Two views must agree on shared facts (like the same block appearing at the same place and time).
- One playerās movement changes what the other player sees (occlusions, angles, distances).
- The world is dynamic (weather, mobs), so the model must separate random events from player-caused changes.
- Failed Attempts
- Low-level RL platforms: Some Minecraft toolkits expose tiny actions (like raw key presses) that need RL training to do anything meaningful. This is too slow and creates narrow, reward-chasing data that isnāt diverse enough for world models.
- Single-player data only: Great for one view, but it canāt teach cross-view consistency.
- Naive multi-view hacks (like just gluing frames together): They struggle to make both playersā videos agree over time and often āhallucinateā actions or lose detail.
- The Gap What was missing was a full pipeline: a reliable multiplayer data system that captures synchronized actions and videos, a model architecture that lets player views talk to each other, and a training recipe that stabilizes long, autoregressive generations without running out of memory.
š Hook: You know how a movie set needs cameras that are all synced, with a director making sure scenes match from every angle? Thatās what multi-agent AI needs too.
š„¬ The Concept (Multiplayer Data System):
- What it is: A system that coordinates multiple bots, records their actions, and captures aligned videos so every camera shows the same world events at the same times.
- How it works: Step 1) Two controller bots act in Minecraft; Step 2) Two ācamera botsā mirror them to render real graphics; Step 3) A plugin keeps them in perfect sync; Step 4) Timestamps align actions and frames; Step 5) Docker orchestration restarts anything that gets stuck and scales to many episodes in parallel.
- Why it matters: Without clean, synchronized data, a multi-view model canāt learn what āconsistencyā even looks like.
š Anchor: If Player 1 places a torch at time t, the system guarantees both Player 1ās and Player 2ās videos show that torch starting at the same time.
- Real Stakes
- Collaborative robots: Multiple robots need a shared āmental movieā of the same warehouse.
- Multi-camera driving: Dashcams and street cameras should agree on carsā positions and motions.
- AR/VR: Friends in the same virtual room need consistent views to avoid motion sickness and confusion.
- Safety and planning: Agents that can predict each other reduce surprises and collisions.
- Games and training: Simulators that understand multiple players make smarter, more human-like AI.
š Hook: Think of it like a team sport. If each player sees a different ball, nobody can pass or score. Solaris aims to make sure everyone sees the same ball, at the same time, from their angle.
02Core Idea
š Hook: Imagine two reporters covering the same soccer match from opposite sides of the field. If a goal is scored, both reports must agree on that exact moment, just from different angles.
š„¬ The Concept (Aha!):
- What it is: Solaris treats both playersā videos as one interleaved sequence so the model can āpay attentionā across players and keep their stories aligned.
- How it works: Step 1) Take tokens (tiny video chunks) from Player 1 and Player 2; Step 2) Interleave them in a shared transformer; Step 3) Use a special shared attention layer so information can flow across views; Step 4) Condition on actions per player so the model knows who did what; Step 5) Train in stages from easy (single-player) to hard (multiplayer, long rollouts).
- Why it matters: If player views donāt talk to each other, the model invents conflicting storiesālike one view raining and the other sunny at the same moment.
š Anchor: After Player 2 turns right, Solaris updates Player 1ās view so Player 2ās character appears where it shouldāon Player 1ās left or rightādepending on the scene.
Multiple Analogies (3 ways):
- Newspaper Editors: Two editors combine their reportersā notes into one timeline, catching contradictions.
- Orchestra Conductor: Different instruments (players) play different notes (actions), but the conductor (shared attention) keeps them in sync.
- Split-Screen Video: Two screens show the same event from two angles, but a smart editor aligns frames so events match perfectly.
š Hook: You know how you first learn to ride a bike in a quiet park before heading into traffic? Training models is similarāstart simple, then add complexity.
š„¬ The Concept (Staged Training: Bidirectional ā Causal ā Self Forcing):
- What it is: A step-by-step training recipe that gradually makes the model better at long, realistic rollouts.
- How it works: Step 1) Learn rich visuals and actions from single-player humans (VPT); Step 2) Learn two-player consistency with full access to past and future (bidirectional); Step 3) Switch to causal so the model predicts like at test time (future unknown); Step 4) Use Self Forcing so the model learns from its own generated videos; Step 5) Use Checkpointed Self Forcing to do Step 4 efficiently over long sequences.
- Why it matters: Jumping to āhard modeā too early makes training unstable; the staged path builds skills layer by layer.
š Anchor: First, Solaris learns Minecraft basics (move, turn, place). Next, it practices seeing how two playersā views match. Finally, it practices ārolling forwardā without peeking at the future, while a teacher corrects it.
š Hook: Saving your game often keeps you from redoing a whole level when you mess up.
š„¬ The Concept (Checkpointed Self Forcing):
- What it is: A memory-efficient twist on Self Forcing that recomputes just whatās needed, so you can train on long videos without running out of memory.
- How it works: Step 1) Roll out a video with a sliding window, but stop tracking unnecessary gradients; Step 2) Cache just the clean/noisy key frames; Step 3) Re-run a single, parallel pass to simulate the last denoising step for every frame; Step 4) Let gradients flow through attention caches to improve visuals.
- Why it matters: Without this, memory balloons when windows slide over long sequences, making training impractical.
š Anchor: Itās like recording a whole piano performance once, noting which bars you flubbed, and then re-practicing just those bars in one focused session.
Before vs After
- Before: Multi-agent models were rare, and cross-view consistency often broke over time; training long rollouts was memory-heavy.
- After: Solaris synchronizes views with a simple architecture change (shared attention), scales via a clean data system, and stabilizes long generations using Checkpointed Self Forcing.
Why It Works (intuition, not math)
- Interleaving and shared attention let the model ācompare notesā across players each step.
- Action conditioning tells the model āwho did whatā so causality lines up across views.
- Staged training aligns how the model learns with how itās used: first rich perception, then multiplayer understanding, then test-time-like generation.
- Checkpointing reduces training memory while still sending useful gradient signals through attention, sharpening details.
Building Blocks
- Multiplayer attention with player ID embeddings to distinguish views while sharing knowledge.
- Action modules per player to feed their own controls.
- First-frame conditioning so the model knows where itās starting from.
- Sliding-window causal attention for efficient autoregressive rollout.
- Teacher-student Self Forcing with a longer-context teacher and a memory-efficient student.
03Methodology
At a high level: Past videos and actions from two players ā [Encode into tokens + add player IDs + add actions] ā [Shared transformer with multiplayer attention] ā [Diffusion denoising to predict next frames] ā Output two future videos that agree.
Step-by-step recipe
- Build the multiplayer data kitchen (SolarisEngine)
- What happens: Two kinds of bots run per player: a controller bot (decides actions) and a camera bot (renders the real Minecraft video). A server plugin mirrors the controllerās state to the camera in real time. Docker Compose spins up many of these pairs in parallel for scalable data collection. If anything gets stuck, a safety reset restarts the episode.
- Why this step exists: The model needs synchronized action-video pairs from both players. Without alignment, it canāt learn what consistent looks like.
- Example: Player 1 places 3 blocks while Player 2 watches. Both videos show the new blocks appearing at the same moments.
š Hook: Think of a stunt show filmed by two camerasāboth need to start and end at the same times to edit the scene properly.
š„¬ The Concept (SolarisEngine):
- What it is: A multiplayer data system that captures synchronized actions and videos with scalable automation.
- How it works: Step 1) Controller bots execute high-level skills (pathfinding, building, combat); Step 2) Camera bots render graphics in headless mode with GPU; Step 3) A plugin mirrors poses and animations; Step 4) Timestamps align frames and actions; Step 5) Orchestration restarts failed episodes and randomizes spawns for diversity.
- Why it matters: Without clean, reliable data, the model canāt learn cross-view consistency.
š Anchor: If the weather turns to rain, both cameras start showing raindrops at the same time.
- Prepare the training menu (Dataset and tasks)
- What happens: They collect 12.64 million frames across building, combat, movement, and mining, balancing simple and complex worlds (Superflat and Normal). Episodes range from short clips to ~25 seconds at 20 fps, covering turning, walking, fighting, placing, mining, and more.
- Why this step exists: The model needs variety to learn general rules that transfer across biomes, times of day, and tasks.
- Example: In a ābuild a squareā episode, one player places blocks while the other watches. Both videos record from different angles, aligned by time.
- Adapt the chef (Model architecture)
- What happens: Start with a strong single-player video diffusion transformer. Expand the action module to full Minecraft controls. Interleave playersā video tokens and share a self-attention layer so information flows across views. Keep first-frame conditioning per player.
- Why this step exists: Cross-view self-attention is the bridge that keeps two stories synced.
- Example: If Player 2 turns to face a torch, Player 1ās view updates so Player 2ās avatar rotates correctly in their camera.
š Hook: Learning to talk first, then learning to coordinate with a friend.
š„¬ The Concept (Bidirectional vs. Causal):
- What it is: Bidirectional looks at past and future during training to learn rich representations; causal looks only at the past, like at test time.
- How it works: Step 1) Train bidirectionally for sharp visual understanding; Step 2) Switch to causal with a sliding window so the model predicts frame by frame without peeking.
- Why it matters: If you train only causally from scratch, the model may be blurry or unstable; if you never switch to causal, it wonāt match test-time behavior.
š Anchor: Studying the whole chapter (bidirectional) before taking a quiz where you can only see previous questions (causal).
- Make it roll (Autoregressive generation)
- What happens: The causal model predicts the next frames repeatedly, using a sliding window of recent context so memory and compute stay bounded.
- Why this step exists: Long videos need to be efficient; otherwise, memory and time explode.
- Example: To generate 10 seconds, the model repeatedly takes the last slice of frames and predicts a bit more.
š Hook: Practice tests help you learn the exact format of the real test.
š„¬ The Concept (Self Forcing):
- What it is: A training trick where the model learns from its own generated videos, reducing the mismatch between training (with ground-truth) and testing (with its own predictions).
- How it works: Step 1) Roll out your own video; Step 2) Compare it to what a strong teacher would do; Step 3) Fix your mistakes; Step 4) Repeat.
- Why it matters: Without this, small errors snowball during long rollouts.
š Anchor: Itās like doing a practice essay, grading it with a rubric, and then rewriting it better.
- Keep memory in check (Checkpointed Self Forcing)
- What happens: First, generate a long rollout while freezing unnecessary gradients; cache just the key states. Then, re-run one parallel pass that simulates the final denoising step for all frames, letting gradients flow even through attention caches.
- Why this step exists: Sliding windows overlap a lot; keeping all gradients makes memory blow up. Checkpointing cuts memory while preserving the learning signals that sharpen details.
- Example: The model can train on sequences hundreds of frames long without running out of memory, and visuals (textures, edges) look better.
š Hook: You donāt rewatch the whole movie to study one sceneāyou bookmark the scene, then replay just that part.
The Secret Sauce
- Interleaving tokens + shared attention = cross-view consistency.
- Full action conditioning per player = clear causality of who did what.
- Staged bidirectionalācausalāSelf Forcing = stable long-horizon behavior.
- Checkpointed Self Forcing = long sequences without memory blowups, plus better visuals via attention-cache gradients.
04Experiments & Results
š Hook: When you test a team, you donāt just see who runs fastestāyou test passing, teamwork, memory of plays, and how well they coordinate.
š„¬ The Concept (VLM-as-a-judge):
- What it is: A vision-language model (VLM) watches generated frames and answers simple, checkable questions (e.g., āDo both players now see the same scenery?ā).
- How it works: Step 1) For each task, pick key frames; Step 2) Ask a specific question with known correct answers; Step 3) Score whether the VLMās answer matches the ground truth.
- Why it matters: It turns fuzzy ālooks rightā into measurable āanswers rightā across tasks.
š Anchor: Like a referee answering, āWas the ball over the line? Yes or no.ā
The Tests (What and Why)
- Movement: Did the model follow actions like walk/turn correctly from both views? This checks action following and viewpoint geometry.
- Grounding: If one player turns away and back, does the model know where the other player should appear? This checks spatial grounding across time.
- Memory: If both turn away and back, do they refind each other? This checks short-term memory of scene layout.
- Building: When one builds a structure, does the other observe it accurately? This checks environment updates driven by actions.
- Consistency: If both turn to the same side, do their views show the same new region? If they turn opposite, do views differ appropriately? This checks cross-view agreement.
The Competition
- Baseline: A frame-concatenation model inspired by prior multiplayer work that merges views along channels.
- Solaris variants: With and without single-player pretraining, and with different Self Forcing setups.
The Scoreboard (Context-rich)
- Visual quality (FID): Lower is better. Solaris consistently shows lower (better) FID than baselines across tasksālike getting an A- for crispness where others get C+ to B-.
- VLM accuracy: Movement is competitive for Solaris but a concatenation baseline edges it on that single task; however, Solaris excels on harder, more āworldlyā tasksāBuilding and Consistencyāwhere multi-view reasoning really matters. For example, Building: Solaris scores meaningfully above baselines (baseline often near zero), showing it truly reflects environment changes across views. Consistency: Solaris shows notably higher accuracy, meaning both cameras agree about new scenery after joint turns.
- Qualitative surprises: Solaris keeps PvP coherent on rough terrain, synchronizes weather onset across views, updates inventories after placements, and gets animations (like torch placement) right. Baselines tend to blur, flatten textures, or hallucinate actions when no-ops occur.
š Hook: Studying with a coach makes you sharper than just reading a book.
š„¬ The Concept (KV-cache backprop ablation):
- What it is: An ablation where gradients are allowed to flow through attention caches during Checkpointed Self Forcing.
- How it works: Compare training with vs. without those gradients.
- Why it matters: With gradients, visuals are sharper (lower FID), though some action-following scores can dip slightly. Overall, Solaris remains competitive across tasks and strongest on Building and Consistency.
š Anchor: Letting the coach comment on your footwork (not just your final shot) improves technique, even if you momentarily lose a point on speed.
Bottom Line
- Solaris offers the best balance: strong visuals and superior multi-view reasoning where it counts (building changes, cross-view agreement), enabled by its architecture and Checkpointed Self Forcing.
05Discussion & Limitations
Limitations
- No persistent world state: If players go out of view for a while, the model can drift about where things are, since it doesnāt maintain a separate, explicit 3D memory of the world.
- Synthetic-only training: Bots act realistically, but itās not human gameplay; visuals and actions may still differ from human data distributions.
- Two-player focus: The architecture can extend to more, but experiments and data focus on two.
- GUI gaps: Some GUI interactions (like inventory screens) arenāt captured, limiting certain behaviors.
Required Resources
- Substantial compute (TPUs/GPUs) for training diffusion transformers on millions of frames.
- Storage and bandwidth for datasets and model checkpoints.
- A modern containerized setup (Docker, orchestration) for scalable data collection.
When NOT to Use
- If you need exact, engine-level physics or perfect world persistence (e.g., for precise simulators), a learned video model may not guarantee frame-perfect correctness.
- If your scenario involves complex GUI manipulation or text UIs not present in the training set.
- If you only need a single, fixed camera and have no multi-agent interactionsāsimpler models may suffice.
Open Questions
- How to add persistent memory so agents remember off-screen objects and re-encounter them accurately later?
- How to scale beyond two players while keeping training efficient and views consistent?
- How to mix human multiplayer data with synthetic data to boost realism and generalization?
- How to couple planning/policy learning directly on top of the multiplayer world model for coordinated strategies?
- How to evaluate long-horizon story consistency beyond short clips, with richer, automated metrics?
06Conclusion & Future Work
Three-sentence summary Solaris is a multiplayer video world model that predicts future videos for two Minecraft players at once, keeping their views consistent as they act. Itās powered by SolarisEngine for synchronized data, an architecture that shares attention across players, and a training recipe that ends with Checkpointed Self Forcing for long, stable rollouts. The system outperforms baselines on key multi-view tasks like building and consistency while producing sharper visuals.
Main Achievement The #1 contribution is Checkpointed Self Forcing: a memory-efficient way to train long-horizon autoregressive video models while allowing gradients through attention caches, significantly improving multi-view visual fidelity and stability.
Future Directions
- Add persistent world memory to track off-screen objects and maintain a unified world state.
- Scale to more than two players and more diverse, human-like multiplayer behaviors.
- Couple the world model with planning and policy learning for coordinated multi-agent decision-making.
- Expand to other domains (multi-camera driving, robotics, AR) with hybrid human/synthetic datasets.
Why Remember This Solaris shows how to make multi-agent video āstoriesā agree across cameras, a core requirement for collaborative AI. Its open-source engine, datasets, and models set a foundation others can build on, and its memory-friendly training technique unlocks longer, more realistic simulations. As AI agents increasingly share spaces with usāand with each otherāconsistent, multi-view world models like Solaris will be the backbone of safe and smart coordination.
Practical Applications
- ā¢Train collaborative warehouse robots that coordinate paths and tasks using a shared, consistent video world.
- ā¢Simulate multi-camera driving scenes for testing autonomous vehicles under varied weather and traffic.
- ā¢Power multi-user AR/VR so friends see the same virtual objects from different angles without mismatch.
- ā¢Prototype game AI that understands teammates and opponents from multiple viewpoints for better tactics.
- ā¢Create digital twins of construction sites where multiple agents (cranes, drones, workers) stay in sync.
- ā¢Design crowd-safety drills where many agents move through shared spaces with consistent predictions.
- ā¢Assist multi-camera sports analytics that reconcile different angles into a consistent play-by-play.
- ā¢Support multi-robot exploration (e.g., drones) that keep a coherent map as they split and regroup.
- ā¢Generate synthetic training data for vision-language-action models that require consistent multi-view scenes.
- ā¢Benchmark and debug perception systems by stress-testing multiview consistency across edge cases.