World Models for Policy Refinement in StarCraft II
Key Summary
- âąThe paper builds StarWM, a âworld modelâ that lets a StarCraft II agent imagine what will happen a few seconds after it takes an action.
- âąThey turn messy game observations into neat text with five labeled parts (Info, Queue, My Units, My Structures, Visible Hostiles) so an LLM can learn different kinds of game rules more easily.
- âąThey create SC2-Dynamics-50k, a 50k-example dataset that teaches the model to predict future observations conditioned on actions under fog-of-war.
- âąThey design a careful offline test suite that checks economy numbers, building progress, unit health and positions, and whole-map consistency (using a transport-style distance metric).
- âąStarWM beats strong zero-shot LLM baselines by large margins offline, including about 60% better resource prediction and much better self-side spatial consistency.
- âąThey plug StarWM into a GenerateâSimulateâRefine loop so the policy proposes an action, simulates the outcome, then revises the plan using the prediction.
- âąOnline games against built-in AI show higher win rates (+30% on Hard and VeryHard with the 32B policy), fewer supply blocks, better spending, and smarter fights.
- âąZero-shot LLMs alone arenât good forward simulators for SC2; training on action-conditioned data is crucial.
- âąEnemy prediction under fog-of-war is still hard; conservative predictions can âhallucinateâ defenders, which hurts offline scores but sometimes helps online caution.
- âąThis work shows how giving LLM agents a small âcrystal ballâ can make their choices steadier, safer, and more efficient.
Why This Research Matters
This work shows how to give AI agents a safe âpeekâ into the near future so they can fix mistakes before they happen. In games, that means fewer supply blocks, smoother spending, and smarter fights; in the real world, it points toward better planning for robots, vehicles, and logistics. The key is simple but powerful: learn short-horizon, action-conditioned rules and feed the forecast back into decision-making. Because the method is lightweight and interpretable (text-structured), itâs practical to adopt and debug. The evaluation suite also teaches us how to judge predictions in ways that truly matter, like numbers and positions, not just word overlap. Overall, itâs a reusable blueprint for turning reactive systems into foresightful ones.
Detailed Explanation
Tap terms for definitions01Background & Problem Definition
đ Hook: Imagine youâre playing hide-and-seek on a huge playground with lots of toys to build, chores to do, and teammates to direct. You canât see behind walls, and everyone moves at once. If you could peek 5 seconds into the future before choosing your move, youâd make smarter choices.
đ„Ź The Concept (Partial Observability): In StarCraft II (SC2), players only see what their own units can seeâthis is called partial observability. How it works:
- The game hides enemy locations and plans behind âfog-of-warâ.
- You must guess whatâs happening off-screen using clues (what youâve seen before, timings, typical strategies).
- You choose actions (build, move, attack) without full information. Why it matters: Without handling partial observability well, an agent can overspend, walk into ambushes, or stall production because it guesses wrong. đ Anchor: You send workers to mine, but you donât see the enemy army marching. If you could predict a few seconds ahead, you might build defense first instead of a fancy new building.
đ Hook: You know how good chess players imagine a few moves ahead to avoid blunders? They play âwhat if?â in their heads.
đ„Ź The Concept (World Model): A world model is an internal âimagination engineâ that predicts what the game will look like in the near future if you take certain actions. How it works:
- Read the current observation (what you can see now).
- Consider a candidate action (e.g., Train Marine, Build Depot, Attack here).
- Predict how resources, queues, units, and visible enemies will look a few seconds later.
- Use that forecast to improve your decision before you actually commit. Why it matters: Without a world model, the agent acts âmyopicallyââit reacts to the now and often makes wasteful or risky moves. đ Anchor: If the world model predicts your minerals will drop to 50 and youâll still have plenty of unused supply, it advises âTrain SCVâ instead of âBuild Supply Depotââsaving you from a shortage.
The world before this paper: LLM-based SC2 agents made strides by summarizing long histories, adding external knowledge, and splitting strategy from tactics. But most efforts only polished the policy (the âwhat to doâ generator). They didnât plug in a learnable, action-conditioned transition modelâa piece that can say, âIf you do X, the world will likely look like Y.â
The problem: SC2âs dynamics are tricky. There are different âmini-worldsâ all entangled: economy (minerals/gas flow), development (build/research progress), micro-control (unit motion and health), and foggy enemy interactions. Predicting all this from only partial views is hard. Even if you could learn it, you still need a simple way to use those predictions to actually make better choices.
Failed attempts: Zero-shot LLMs (used as simulators without training on SC2 dynamics) donât learn game physics like âunits in training already consume supplyâ or âbuildings finish at a certain rate.â Another classic approach, Defogger, extrapolated states without conditioning on your specific actions, so it couldnât answer âwhat if I do this exact command?â
The gap: A learnable, action-conditioned world model for SC2 under fog-of-war was missing, along with a representation that separates the different types of dynamics and a fair way to score predictions beyond generic text metrics.
Real stakes: Better foresight helps agents avoid supply blocks, spend resources efficiently, and dodge bad fightsâjust like a good team captain. For real-world parallels, such âworld modelsâ can help robots plan moves, cars predict traffic, or managers schedule resources more wisely.
đ Hook: Tidying your room is faster when you label bins: âbooks,â âtoys,â âclothes,â âtrash.â
đ„Ź The Concept (Structured Textual Observation Representation): The paper turns the messy SC2 observation into five labeled text sections: Info, Queue, My Units, My Structures, Visible Hostiles. How it works:
- Info tracks economy and status numbers.
- Queue tracks whatâs being built/researched and how far along it is.
- My Units handles moving pieces with health and positions.
- My Structures lists buildings and their states/add-ons.
- Visible Hostiles lists only what you can currently see. Why it matters: Each section follows different rules (mathy flows, ticking progress, movement, combat). Splitting the view helps an LLM learn each rule type better and faster. đ Anchor: Itâs like giving the model five mini-puzzles instead of one gigantic jumbleâso it can solve each mini-puzzle with the right tricks.
To teach the world model, the authors built SC2-Dynamics-50k, a large set of âbefore action â after 5 secondsâ examples. They also crafted an offline evaluation suite that measures numbers, progress, unit attributes, and whole-map spatial consistency (using a transport-style distance). Finally, they plug the model into a simple loop: Generate (an action), Simulate (predict future), Refine (fix the action). Thatâs how this work gives SC2 agents a safe, small âpeekâ into the future and turns hunches into steadier choices.
02Core Idea
đ Hook: Picture planning a bike ride: you check the weather, imagine the route, and maybe change your start time if rain is coming. That tiny mental simulation saves the day.
đ„Ź The Concept (Key Insight): Let the SC2 policy imagine short-term futures with an action-conditioned world model, then use those predictions to refine its actions. How it works:
- Propose: The policy suggests an action from the current observation.
- Predict: The world model simulates what the observation will look like after a few seconds if that action is taken.
- Polish: The policy revisits its action with that forecast in mind and improves it. Why it matters: This turns reactive guesses into foresightful plans, reducing supply blocks, wasteful spending, and unlucky fights. đ Anchor: If your âpeekâ says youâll be mineral-starved, you hold off on a depot and train an SCV instead.
Three analogies for the same idea:
- Cooking: You mix a batter (propose), peek through the oven window (predict), then lower the temperature if itâs browning too fast (refine).
- Navigation: You pick a route (propose), check traffic predictions (predict), then switch roads if jams are ahead (refine).
- Sports: You call a play (propose), visualize the defenseâs likely shift (predict), then audible to a safer play (refine).
đ Hook: You know how different school subjects need different study methods? Math needs step-by-step logic; art needs creativity.
đ„Ź The Concept (Action-Conditioned Dynamics Model): This is the heart of the world model that predicts how the observation changes because of your specific action. How it works:
- Read the current observation and the exact action sequence.
- Apply the right âmini-rulesâ: add/subtract resources, tick build progress, move units, adjust HP from combat, update visibility.
- Output the predicted observation after a fixed short horizon (like 5 seconds). Why it matters: Without conditioning on the chosen action, you canât do true âwhat-ifâ planning. đ Anchor: If you âTrain Marineâ now, supply used should rise immediately (queued units consume supply), minerals drop by cost, and the Barracks queue fillsâthose specifics depend on your exact action.
đ Hook: Sorting your backpack into folders (homework, notes, art) makes it easy to find what you need.
đ„Ź The Concept (Structured Textual Observation Representation): A neat five-part text layout (Info, Queue, My Units, My Structures, Visible Hostiles) that matches different game rules. How it works:
- Economy math goes in Info.
- Timers go in Queue.
- Motion and HP go in My Units.
- Building states and add-ons go in My Structures.
- What you can currently see of the enemy goes in Visible Hostiles. Why it matters: The model can âswitch mental gearsâ depending on the section, learning hybrid dynamics faster and better. đ Anchor: The Queue section lets the model learn, âprogress = progress + (delta / build_time),â without getting distracted by unit coordinates.
đ Hook: Practicing with good examples beats guessing from scratch.
đ„Ź The Concept (SC2-Dynamics-50k Dataset): A 50k-sample training set of âcurrent observation + actions â observation in 5 seconds,â all under fog-of-war. How it works:
- Collect many real trajectories.
- Slice them into input-output pairs using the five-part text format.
- Fine-tune an LLM (with LoRA) to predict the future observation. Why it matters: Zero-shot LLMs donât know SC2âs physics; task-specific data is the teacher that fixes this. đ Anchor: Showing thousands of âBuild Marineâ examples with costs, supply, and progress teaches the model exactly how training affects future states.
đ Hook: You donât grade a drawing with a spelling test. You need the right scoring tools.
đ„Ź The Concept (Multi-Dimensional Offline Evaluation Framework): A testing kit that fairly scores different parts of the prediction: economy, development, micro-entities, and macro-situation. How it works:
- Economy & Status: number errors (like SMAPE for minerals/gas).
- Development: queue correctness (F1) and progress accuracy (MAE%).
- Micro-Entities: match predicted vs. true units (by ID or nearby position) and check HP/energy errors.
- Macro-Situation: compare whole-map spatial distributions using a transport-like distance (AWD) that also penalizes missing or extra units. Why it matters: Generic text metrics miss what really matters in games (numbers, positions, logic). đ Anchor: If you predict 600 minerals instead of 200, or place marines all over the map, these metrics will catch it.
đ Hook: When writing an essay, you draft, read it aloud, then fix what sounds off.
đ„Ź The Concept (GenerateâSimulateâRefine Decision Loop): A simple plan-imagine-fix pipeline that uses world-model foresight at inference time. How it works:
- Generate: The policy proposes actions.
- Simulate: The world model predicts the next observation.
- Refine: The policy revises the plan using the prediction. Why it matters: Itâs lightweight, works with an LLM policy, and avoids heavy search while still adding foresight. đ Anchor: If the predicted state shows âSupply unused: high,â the refined plan may skip a depot and add a worker or a unit instead.
Before vs. After:
- Before: Policies tried to reason from the current snapshot and generic knowledge.
- After: Policies reason with a grounded âpeek ahead,â becoming more preemptive (fewer supply blocks), efficient (higher resource conversion), and careful (better killâloss).
Why it works (intuition):
- Factorization reduces cognitive load: the model applies the right sub-rules to the right parts.
- Short-horizon simulation captures near-deterministic engine rules (costs, progress, supply) and approximate combat attrition.
- Closing the loop turns predictions into action changes, so foresight directly improves behavior.
Building blocks:
- Five-part textual observation to separate dynamics.
- SC2-Dynamics-50k to teach action-conditioned futures.
- StarWM (LLM fine-tuned with LoRA) as the simulator.
- Multi-metric offline evaluation to verify learning of each sub-dynamic.
- GenerateâSimulateâRefine loop to convert foresight into better online choices.
03Methodology
At a high level: Current Observation (text) â Policy proposes actions â World model simulates 5s future â Policy refines actions with the prediction â Final actions to the game.
Step 0. Turn game state into five-part text
- What happens: The raw SC2 observation (numbers, unit lists, positions) is converted into five labeled text sections: Info, Queue, My Units, My Structures, Visible Hostiles.
- Why it exists: Different parts obey different rules (math, timers, movement, visibility). Keeping them separate makes learning each âmini-ruleâ easier.
- Example: Info shows âMinerals: 280 (+2575/min) | Gas: 124 (+223/min), Supply: 64/93â. Queue shows âBarracks: Train Marine (80%).â My Units lists marines with positions and HP.
Step 1. Train the world model (offline)
- What happens:
- Build SC2-Dynamics-50k: for many moments t, record (observation at t, actions between t and t+5s, observation at t+5s).
- Fine-tune an LLM (Qwen3-8B with LoRA) to map (current observation + actions) â future observation.
- Use the same five-part text format for inputs and targets, so the model learns sub-dynamics by section.
- Why it exists: Zero-shot LLMs donât know SC2 engine rules; supervised examples teach the exact consequences of actions.
- Example: If the input says âFactory starts Siege Tank at +0.6s,â the target 5s later should show âFactory train Siege Tank (14%)â with mineral/gas reduced and supply adjusted.
Step 2. Evaluate the world model (offline)
- What happens: Use a multi-dimensional framework to measure how good predictions are where it matters.
- Economy & Status: Measure number accuracy with SMAPE (stable percent error) and F1 for sparse events (alerts, upgrades).
- Development: Queue F1 (did you predict the right tasks?) and Progress MAE% (how close is the percent complete?).
- Micro-Entities: Match predicted vs. true units (by same ID or close position if ID is unknown) and compute HP/energy errors.
- Macro-Situation: Use Augmented Wasserstein Distance (AWD) to compare entire spatial layouts and penalize missing/extra units.
- Why it exists: Generic text metrics can look good while being wrong on numbers or positions; these metrics reward true game understanding.
- Example: Predicting âMinerals: 230â vs. true âMinerals: 500â gets a big SMAPE error; scattering marines too far raises AWD.
Step 3. Use the world model in a decision loop (online)
- What happens: The StarWM-Agent runs GenerateâSimulateâRefine during live games.
- Generate: The policy (Qwen3-8B or 32B) reads the current state and proposes actions.
- Simulate: StarWM predicts the observation 5 seconds later if those actions execute.
- Refine: The policy re-reads its plan plus the predicted future and edits the actions (e.g., add a depot, cancel a risky attack, fill an idle queue).
- Execute: Send the refined actions to the game.
- Why it exists: To turn foresight into behavior changes right now, without heavy search like MCTS.
- Example: If prediction shows âSupply unused stays highâ and âminerals drop low,â the agent will skip building a depot and train an SCV instead.
Step 4. Guardrails and prompts
- What happens: Carefully designed prompts teach the model how to apply SC2 rules during prediction and how the policy should use the simulated future.
- Why it exists: The model needs consistent instructions for timing, supply, build progress, visibility, and combat attrition. The policy needs checklists (supply, gas, idle structures, damage) when refining.
- Example: The world-model prompt explicitly says âQueued units consume supply immediatelyâ and âProgress += (delta / build_time).â
Step 5. Secret sauce
- Structured factorization: Splitting the observation allows the LLM to âthink differentlyâ per section, which directly mirrors SC2âs hybrid dynamics.
- Short-horizon, action-conditioned rollouts: Many near-term changes are deterministic (costs, supply, progress), so the model can learn them reliably.
- Lightweight planning: GenerateâSimulateâRefine provides a big boost without the cost of search trees.
What breaks without each step?
- No factorized text: The model mixes unrelated rules (like HP with mineral math), learns slower, and predicts worse.
- No action conditioning: You canât answer âwhat if I do X?ââpredictions become vague and less useful.
- No multi-dimensional metrics: You might âlookâ accurate in text but fail on critical numbers or map positions.
- No refinement loop: Even good predictions wonât improve play if the policy never revises actions.
Concrete walk-through (with data):
- Input (t): âMinerals: 145, Supply unused: 18; Barracks building at 29%; no enemies seen.â
- Policy proposes: âBuild Supply Depot.â
- Simulate (t+5s): âMinerals: 50; Supply unused still 18; Depot 23% complete.â
- Refine: âTrain SCVâ instead of âBuild Depotâ because supply isnât the bottleneck; minerals are tight.
- Execute: Send âTrain SCV.â
Implementation notes:
- Model: Qwen3-8B with LoRA rank 8, lr 5e-5, 10 epochs on 8ĂH100.
- Dataset: TvT on Flat64; 50k train samples, ~6.7k val/test; 5s horizon with 1s stride.
- Matching thresholds: AWD penalty tuned to map size; micro matching allows ID or position-within-threshold.
04Experiments & Results
The test: Two phasesâoffline prediction quality and online gameplay.
- Offline asks: âGiven current observation + actions, can StarWM predict the future observation accurately?â
- Online asks: âDoes using those predictions to refine actions actually win more games and play more cleanly?â
The competition (baselines):
- Static Bias: Just copy todayâs observation as tomorrowâs guess (a tough baseline under short horizons and fog-of-war).
- Qwen3-8B (zero-shot): An LLM asked to predict futures without SC2-specific training.
- Qwen3-32B (zero-shot): A larger LLM trying the same zero-shot prediction.
Offline scoreboard (what the numbers mean):
- Economy (minerals/gas SMAPE): StarWM cuts error to 0.19/0.09 vs. 0.48/0.26 for zero-shot 32Bâabout a 60â65% reduction, like raising your grade from a shaky C to a solid A.
- Development (Queue F1, Progress MAE%): Queue F1 0.92 and Progress error 0.43% vs. >24% for baselinesâthis is like guessing build timers almost perfectly while others are minutes off.
- Micro-Entities (HP MAE%): Self/enemy HP errors drop (4.15%/7.90% vs. 5.11%/8.47%), showing learned combat attrition.
- Macro-Situation (AWD): Self-side AWD 3.46 vs. 9.79 (32B) and 8.37 (Static), nearly 60% betterâformations and moves are much more coherent.
Surprises and lessons:
- Zero-shot LLMs often fail to beat Static Bias. Moral: SC2 physics arenât in generic pretraining; you must fine-tune on action-conditioned trajectories.
- Enemy AWD under fog-of-war can look worse for StarWM than Static Bias. Why? Standing-still enemies (Static) can be âcloserâ over short spans; StarWM sometimes predicts plausible enemy motion or defenders (âconservative hallucinationsâ). Offline, thatâs a penalty; online, it can encourage caution that sometimes helps.
Online gameplay (vs. SC2 built-in AI on Hard/Harder/VeryHard):
- Win rate: StarWM-Agent (32B policy) boosts win rates by +30% (LV5), +15% (LV6), +30% (LV7) compared to the same policy without WM. Even StarWM-Agent (8B) gains consistently.
- Supply Block Rate: Big drops (e.g., from ~16â25% down to ~5â6% with 32B) show the agent plans supply preemptively.
- Resource Conversion Rate: Jumps to ~78â83%, meaning minerals/gas are turned into units/tech instead of sitting idle.
- KillâLoss Ratio: Increases (~+21%), reflecting fewer bad engagements.
- Valid Action Rate: Climbs to ~82â86%, indicating fewer impossible or ill-timed commands after simulation-based checks.
Ablations (what really matters):
- Generate only: Baseline performance with many supply blocks and poor spending.
-
- Refine (self-reflection, no simulation): Big macro improvements but modest win gainsâthinking longer helps, but itâs still blind.
-
- Zero-shot WM simulate: Some gainsâexternal âwhat-ifâ helps a bit, even if noisy.
-
- StarWM simulate (trained): Best across the boardâaccurate action-conditioned forecasts drive strong policy refinements.
Qualitative cases:
- Formations: StarWM preserves army structure in predictions; zero-shot models scatter units.
- Scouting: StarWM may âseeâ likely defenders near enemy bases; offline this is a false positive, but online it can trigger safer play.
- Macro choice: The classic exampleâskip early depot when supply isnât the bottleneck; train worker to keep economy flowing.
Bottom line: Accurate short-horizon, action-conditioned predictions are the difference between a cautious guess and a confident, foresightful edit. Thatâs what moves the needle from reactive to reliably proactive play.
05Discussion & Limitations
Limitations:
- Opponent modeling: Predicting enemy movement/production under fog-of-war with no explicit memory is underdetermined. StarWM sometimes âhallucinatesâ plausible defenders, hurting offline AWD even if it encourages safer play.
- Short horizon: A 5-second lookahead is great for supply, queues, and small skirmishes but wonât capture long tech swings or multi-minute strategies.
- Domain scope: Training is on Terran vs. Terran on Flat64; extending to all races/maps is engineering-heavy and could require more data.
- Text latency and parsing: Text I/O is interpretable but can be slower than compact tensors; strict formatting helps but still adds overhead.
- Zero-think constraint: Online evaluations use /no_think; allowing more reasoning steps or memory might further boost results but also cost time.
Required resources:
- Data: SC2-Dynamics-50k or similar action-conditioned trajectory data for your setting.
- Models: An LLM backbone (e.g., Qwen3-8B) and LoRA fine-tuning.
- Compute: Multi-GPU training (authors used 8ĂH100), plus inference budget for simulate-and-refine.
- Environment: SC2Arena (or equivalent) extended with a world-model module and prompts.
When NOT to use it:
- Fully observable, deterministic tasks where hand-coded rules or classical planners excel.
- Ultra-fast micro where frame-by-frame control matters more than 5-second foresight.
- Very long-horizon planning where tree search or hierarchical planners with memory are more appropriate.
- Severe latency budgets where simulate-and-refine would exceed real-time constraints.
Open questions:
- Temporal memory: Can adding history or an opponent-intent module improve enemy prediction without overfitting?
- Longer horizons: How to safely extend from 5s to multi-step chains without compounding errors?
- Multimodal inputs: Can raw spatial features (mini-map, screen) plus text factorization further improve kinematics and visibility?
- Uncertainty handling: How to represent and use multiple plausible futures (ensembles or stochastic decoders) during refinement?
- Generalization: How well does the approach transfer across races, maps, and metasâand what data is minimally needed?
06Conclusion & Future Work
Three-sentence summary: This paper builds StarWM, an action-conditioned world model that predicts short-horizon future observations in StarCraft II under fog-of-war. By factorizing observations into five textual modules, training on SC2-Dynamics-50k, and evaluating with multi-dimensional metrics, the model learns economy, development, movement, and combat attrition well enough to guide decisions. Plugging StarWM into a simple GenerateâSimulateâRefine loop yields consistent online gainsâhigher win rates, fewer supply blocks, better spending, and smarter fights.
Main achievement: Showing that a lightweight, learnable, action-conditioned simulator can reliably sharpen LLM policies in a complex, partially observable RTS by turning âwhat-ifâ imagination into concrete, improved actions.
Future directions: Add temporal memory and opponent-intent modeling for better enemy forecasts, explore multi-step simulations and multimodal inputs, and scale to more races, maps, and longer horizons with uncertainty-aware predictions.
Why remember this: Itâs a clean recipeâtidy the state, teach a short-horizon âpeek,â and feed that peek back into the policyâthat turns reactive agents into foresightful planners. The pattern is simple, practical, and portable to many decision-making domains where a small, trustworthy crystal ball makes all the difference.
Practical Applications
- âąTrain RTS game agents that spend efficiently and avoid supply or production bottlenecks.
- âąAdd short-horizon simulators to LLM agents for safer online planning in partially observable settings.
- âąUse structured text factorization to teach models hybrid rules (timers, costs, movement) in other games.
- âąAdapt the GenerateâSimulateâRefine loop for robotics tasks (e.g., predict object states after a grasp).
- âąImprove autonomous driving modules by simulating short-horizon, action-conditioned traffic scenes.
- âąOptimize warehouse operations by forecasting resource queues and adjusting schedules proactively.
- âąEnhance web agentsâ tool use by simulating the effects of actions (e.g., form submissions) before execution.
- âąDevelop better evaluation metrics that check numbers, progress, and spatial layouts for simulators.
- âąCreate curriculum datasets (like SC2-Dynamics-50k) for other complex domains with partial observability.
- âąBuild coaching tools that show players âwhat ifâ outcomes for training and strategy refinement.