World Models for Policy Refinement in StarCraft II

Yixin Zhang; Ziyi Wang; Yiming Rong; Haoxi Wang; Jinling Jiang; Shuang Xu; Haoran Wu; Shiyu Zhou; Bo Xu

World Models for Policy Refinement in StarCraft II

Intermediate

Yixin Zhang, Ziyi Wang, Yiming Rong et al.2/16/2026

arXiv

Key Summary

•The paper builds StarWM, a ‘world model’ that lets a StarCraft II agent imagine what will happen a few seconds after it takes an action.
•They turn messy game observations into neat text with five labeled parts (Info, Queue, My Units, My Structures, Visible Hostiles) so an LLM can learn different kinds of game rules more easily.
•They create SC2-Dynamics-50k, a 50k-example dataset that teaches the model to predict future observations conditioned on actions under fog-of-war.
•They design a careful offline test suite that checks economy numbers, building progress, unit health and positions, and whole-map consistency (using a transport-style distance metric).
•StarWM beats strong zero-shot LLM baselines by large margins offline, including about 60% better resource prediction and much better self-side spatial consistency.
•They plug StarWM into a Generate–Simulate–Refine loop so the policy proposes an action, simulates the outcome, then revises the plan using the prediction.
•Online games against built-in AI show higher win rates (+30% on Hard and VeryHard with the 32B policy), fewer supply blocks, better spending, and smarter fights.
•Zero-shot LLMs alone aren’t good forward simulators for SC2; training on action-conditioned data is crucial.
•Enemy prediction under fog-of-war is still hard; conservative predictions can ‘hallucinate’ defenders, which hurts offline scores but sometimes helps online caution.
•This work shows how giving LLM agents a small ‘crystal ball’ can make their choices steadier, safer, and more efficient.

Why This Research Matters

This work shows how to give AI agents a safe ‘peek’ into the near future so they can fix mistakes before they happen. In games, that means fewer supply blocks, smoother spending, and smarter fights; in the real world, it points toward better planning for robots, vehicles, and logistics. The key is simple but powerful: learn short-horizon, action-conditioned rules and feed the forecast back into decision-making. Because the method is lightweight and interpretable (text-structured), it’s practical to adopt and debug. The evaluation suite also teaches us how to judge predictions in ways that truly matter, like numbers and positions, not just word overlap. Overall, it’s a reusable blueprint for turning reactive systems into foresightful ones.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine you’re playing hide-and-seek on a huge playground with lots of toys to build, chores to do, and teammates to direct. You can’t see behind walls, and everyone moves at once. If you could peek 5 seconds into the future before choosing your move, you’d make smarter choices.

🥬 The Concept (Partial Observability): In StarCraft II (SC2), players only see what their own units can see—this is called partial observability. How it works:

The game hides enemy locations and plans behind ‘fog-of-war’.
You must guess what’s happening off-screen using clues (what you’ve seen before, timings, typical strategies).
You choose actions (build, move, attack) without full information. Why it matters: Without handling partial observability well, an agent can overspend, walk into ambushes, or stall production because it guesses wrong. 🍞 Anchor: You send workers to mine, but you don’t see the enemy army marching. If you could predict a few seconds ahead, you might build defense first instead of a fancy new building.

🍞 Hook: You know how good chess players imagine a few moves ahead to avoid blunders? They play ‘what if?’ in their heads.

🥬 The Concept (World Model): A world model is an internal ‘imagination engine’ that predicts what the game will look like in the near future if you take certain actions. How it works:

Read the current observation (what you can see now).
Consider a candidate action (e.g., Train Marine, Build Depot, Attack here).
Predict how resources, queues, units, and visible enemies will look a few seconds later.
Use that forecast to improve your decision before you actually commit. Why it matters: Without a world model, the agent acts ‘myopically’—it reacts to the now and often makes wasteful or risky moves. 🍞 Anchor: If the world model predicts your minerals will drop to 50 and you’ll still have plenty of unused supply, it advises ‘Train SCV’ instead of ‘Build Supply Depot’—saving you from a shortage.

The world before this paper: LLM-based SC2 agents made strides by summarizing long histories, adding external knowledge, and splitting strategy from tactics. But most efforts only polished the policy (the ‘what to do’ generator). They didn’t plug in a learnable, action-conditioned transition model—a piece that can say, ‘If you do X, the world will likely look like Y.’

The problem: SC2’s dynamics are tricky. There are different ‘mini-worlds’ all entangled: economy (minerals/gas flow), development (build/research progress), micro-control (unit motion and health), and foggy enemy interactions. Predicting all this from only partial views is hard. Even if you could learn it, you still need a simple way to use those predictions to actually make better choices.

Failed attempts: Zero-shot LLMs (used as simulators without training on SC2 dynamics) don’t learn game physics like ‘units in training already consume supply’ or ‘buildings finish at a certain rate.’ Another classic approach, Defogger, extrapolated states without conditioning on your specific actions, so it couldn’t answer ‘what if I do this exact command?’

The gap: A learnable, action-conditioned world model for SC2 under fog-of-war was missing, along with a representation that separates the different types of dynamics and a fair way to score predictions beyond generic text metrics.

Real stakes: Better foresight helps agents avoid supply blocks, spend resources efficiently, and dodge bad fights—just like a good team captain. For real-world parallels, such ‘world models’ can help robots plan moves, cars predict traffic, or managers schedule resources more wisely.

🍞 Hook: Tidying your room is faster when you label bins: ‘books,’ ‘toys,’ ‘clothes,’ ‘trash.’

🥬 The Concept (Structured Textual Observation Representation): The paper turns the messy SC2 observation into five labeled text sections: Info, Queue, My Units, My Structures, Visible Hostiles. How it works:

Info tracks economy and status numbers.
Queue tracks what’s being built/researched and how far along it is.
My Units handles moving pieces with health and positions.
My Structures lists buildings and their states/add-ons.
Visible Hostiles lists only what you can currently see. Why it matters: Each section follows different rules (mathy flows, ticking progress, movement, combat). Splitting the view helps an LLM learn each rule type better and faster. 🍞 Anchor: It’s like giving the model five mini-puzzles instead of one gigantic jumble—so it can solve each mini-puzzle with the right tricks.

To teach the world model, the authors built SC2-Dynamics-50k, a large set of ‘before action → after 5 seconds’ examples. They also crafted an offline evaluation suite that measures numbers, progress, unit attributes, and whole-map spatial consistency (using a transport-style distance). Finally, they plug the model into a simple loop: Generate (an action), Simulate (predict future), Refine (fix the action). That’s how this work gives SC2 agents a safe, small ‘peek’ into the future and turns hunches into steadier choices.

02Core Idea

🍞 Hook: Picture planning a bike ride: you check the weather, imagine the route, and maybe change your start time if rain is coming. That tiny mental simulation saves the day.

🥬 The Concept (Key Insight): Let the SC2 policy imagine short-term futures with an action-conditioned world model, then use those predictions to refine its actions. How it works:

Propose: The policy suggests an action from the current observation.
Predict: The world model simulates what the observation will look like after a few seconds if that action is taken.
Polish: The policy revisits its action with that forecast in mind and improves it. Why it matters: This turns reactive guesses into foresightful plans, reducing supply blocks, wasteful spending, and unlucky fights. 🍞 Anchor: If your ‘peek’ says you’ll be mineral-starved, you hold off on a depot and train an SCV instead.

Three analogies for the same idea:

Cooking: You mix a batter (propose), peek through the oven window (predict), then lower the temperature if it’s browning too fast (refine).
Navigation: You pick a route (propose), check traffic predictions (predict), then switch roads if jams are ahead (refine).
Sports: You call a play (propose), visualize the defense’s likely shift (predict), then audible to a safer play (refine).

🍞 Hook: You know how different school subjects need different study methods? Math needs step-by-step logic; art needs creativity.

🥬 The Concept (Action-Conditioned Dynamics Model): This is the heart of the world model that predicts how the observation changes because of your specific action. How it works:

Read the current observation and the exact action sequence.
Apply the right ‘mini-rules’: add/subtract resources, tick build progress, move units, adjust HP from combat, update visibility.
Output the predicted observation after a fixed short horizon (like 5 seconds). Why it matters: Without conditioning on the chosen action, you can’t do true ‘what-if’ planning. 🍞 Anchor: If you ‘Train Marine’ now, supply used should rise immediately (queued units consume supply), minerals drop by cost, and the Barracks queue fills—those specifics depend on your exact action.

🍞 Hook: Sorting your backpack into folders (homework, notes, art) makes it easy to find what you need.

🥬 The Concept (Structured Textual Observation Representation): A neat five-part text layout (Info, Queue, My Units, My Structures, Visible Hostiles) that matches different game rules. How it works:

Economy math goes in Info.
Timers go in Queue.
Motion and HP go in My Units.
Building states and add-ons go in My Structures.
What you can currently see of the enemy goes in Visible Hostiles. Why it matters: The model can ‘switch mental gears’ depending on the section, learning hybrid dynamics faster and better. 🍞 Anchor: The Queue section lets the model learn, ‘progress = progress + (delta / build_time),’ without getting distracted by unit coordinates.

🍞 Hook: Practicing with good examples beats guessing from scratch.

🥬 The Concept (SC2-Dynamics-50k Dataset): A 50k-sample training set of ‘current observation + actions → observation in 5 seconds,’ all under fog-of-war. How it works:

Collect many real trajectories.
Slice them into input-output pairs using the five-part text format.
Fine-tune an LLM (with LoRA) to predict the future observation. Why it matters: Zero-shot LLMs don’t know SC2’s physics; task-specific data is the teacher that fixes this. 🍞 Anchor: Showing thousands of ‘Build Marine’ examples with costs, supply, and progress teaches the model exactly how training affects future states.

🍞 Hook: You don’t grade a drawing with a spelling test. You need the right scoring tools.

🥬 The Concept (Multi-Dimensional Offline Evaluation Framework): A testing kit that fairly scores different parts of the prediction: economy, development, micro-entities, and macro-situation. How it works:

Economy & Status: number errors (like SMAPE for minerals/gas).
Development: queue correctness (F1) and progress accuracy (MAE%).
Micro-Entities: match predicted vs. true units (by ID or nearby position) and check HP/energy errors.
Macro-Situation: compare whole-map spatial distributions using a transport-like distance (AWD) that also penalizes missing or extra units. Why it matters: Generic text metrics miss what really matters in games (numbers, positions, logic). 🍞 Anchor: If you predict 600 minerals instead of 200, or place marines all over the map, these metrics will catch it.

🍞 Hook: When writing an essay, you draft, read it aloud, then fix what sounds off.

🥬 The Concept (Generate–Simulate–Refine Decision Loop): A simple plan-imagine-fix pipeline that uses world-model foresight at inference time. How it works:

Generate: The policy proposes actions.
Simulate: The world model predicts the next observation.
Refine: The policy revises the plan using the prediction. Why it matters: It’s lightweight, works with an LLM policy, and avoids heavy search while still adding foresight. 🍞 Anchor: If the predicted state shows ‘Supply unused: high,’ the refined plan may skip a depot and add a worker or a unit instead.

Before vs. After:

Before: Policies tried to reason from the current snapshot and generic knowledge.
After: Policies reason with a grounded ‘peek ahead,’ becoming more preemptive (fewer supply blocks), efficient (higher resource conversion), and careful (better kill–loss).

Why it works (intuition):

Factorization reduces cognitive load: the model applies the right sub-rules to the right parts.
Short-horizon simulation captures near-deterministic engine rules (costs, progress, supply) and approximate combat attrition.
Closing the loop turns predictions into action changes, so foresight directly improves behavior.

Building blocks:

Five-part textual observation to separate dynamics.
SC2-Dynamics-50k to teach action-conditioned futures.
StarWM (LLM fine-tuned with LoRA) as the simulator.
Multi-metric offline evaluation to verify learning of each sub-dynamic.
Generate–Simulate–Refine loop to convert foresight into better online choices.

03Methodology

At a high level: Current Observation (text) → Policy proposes actions → World model simulates 5s future → Policy refines actions with the prediction → Final actions to the game.

Step 0. Turn game state into five-part text

What happens: The raw SC2 observation (numbers, unit lists, positions) is converted into five labeled text sections: Info, Queue, My Units, My Structures, Visible Hostiles.
Why it exists: Different parts obey different rules (math, timers, movement, visibility). Keeping them separate makes learning each ‘mini-rule’ easier.
Example: Info shows ‘Minerals: 280 (+2575/min) | Gas: 124 (+223/min), Supply: 64/93’. Queue shows ‘Barracks: Train Marine (80%).’ My Units lists marines with positions and HP.

Step 1. Train the world model (offline)

What happens:
1. Build SC2-Dynamics-50k: for many moments t, record (observation at t, actions between t and t+5s, observation at t+5s).
2. Fine-tune an LLM (Qwen3-8B with LoRA) to map (current observation + actions) → future observation.
3. Use the same five-part text format for inputs and targets, so the model learns sub-dynamics by section.
Why it exists: Zero-shot LLMs don’t know SC2 engine rules; supervised examples teach the exact consequences of actions.
Example: If the input says ‘Factory starts Siege Tank at +0.6s,’ the target 5s later should show ‘Factory train Siege Tank (14%)’ with mineral/gas reduced and supply adjusted.

Step 2. Evaluate the world model (offline)

What happens: Use a multi-dimensional framework to measure how good predictions are where it matters.
- Economy & Status: Measure number accuracy with SMAPE (stable percent error) and F1 for sparse events (alerts, upgrades).
- Development: Queue F1 (did you predict the right tasks?) and Progress MAE% (how close is the percent complete?).
- Micro-Entities: Match predicted vs. true units (by same ID or close position if ID is unknown) and compute HP/energy errors.
- Macro-Situation: Use Augmented Wasserstein Distance (AWD) to compare entire spatial layouts and penalize missing/extra units.
Why it exists: Generic text metrics can look good while being wrong on numbers or positions; these metrics reward true game understanding.
Example: Predicting ‘Minerals: 230’ vs. true ‘Minerals: 500’ gets a big SMAPE error; scattering marines too far raises AWD.

Step 3. Use the world model in a decision loop (online)

What happens: The StarWM-Agent runs Generate–Simulate–Refine during live games.
1. Generate: The policy (Qwen3-8B or 32B) reads the current state and proposes actions.
2. Simulate: StarWM predicts the observation 5 seconds later if those actions execute.
3. Refine: The policy re-reads its plan plus the predicted future and edits the actions (e.g., add a depot, cancel a risky attack, fill an idle queue).
4. Execute: Send the refined actions to the game.
Why it exists: To turn foresight into behavior changes right now, without heavy search like MCTS.
Example: If prediction shows ‘Supply unused stays high’ and ‘minerals drop low,’ the agent will skip building a depot and train an SCV instead.

Step 4. Guardrails and prompts

What happens: Carefully designed prompts teach the model how to apply SC2 rules during prediction and how the policy should use the simulated future.
Why it exists: The model needs consistent instructions for timing, supply, build progress, visibility, and combat attrition. The policy needs checklists (supply, gas, idle structures, damage) when refining.
Example: The world-model prompt explicitly says ‘Queued units consume supply immediately’ and ‘Progress += (delta / build_time).’

Step 5. Secret sauce

Structured factorization: Splitting the observation allows the LLM to ‘think differently’ per section, which directly mirrors SC2’s hybrid dynamics.
Short-horizon, action-conditioned rollouts: Many near-term changes are deterministic (costs, supply, progress), so the model can learn them reliably.
Lightweight planning: Generate–Simulate–Refine provides a big boost without the cost of search trees.

What breaks without each step?

No factorized text: The model mixes unrelated rules (like HP with mineral math), learns slower, and predicts worse.
No action conditioning: You can’t answer ‘what if I do X?’—predictions become vague and less useful.
No multi-dimensional metrics: You might ‘look’ accurate in text but fail on critical numbers or map positions.
No refinement loop: Even good predictions won’t improve play if the policy never revises actions.

Concrete walk-through (with data):

Input (t): ‘Minerals: 145, Supply unused: 18; Barracks building at 29%; no enemies seen.’
Policy proposes: ‘Build Supply Depot.’
Simulate (t+5s): ‘Minerals: 50; Supply unused still 18; Depot 23% complete.’
Refine: ‘Train SCV’ instead of ‘Build Depot’ because supply isn’t the bottleneck; minerals are tight.
Execute: Send ‘Train SCV.’

Implementation notes:

Model: Qwen3-8B with LoRA rank 8, lr 5e-5, 10 epochs on 8×H100.
Dataset: TvT on Flat64; 50k train samples, ~6.7k val/test; 5s horizon with 1s stride.
Matching thresholds: AWD penalty tuned to map size; micro matching allows ID or position-within-threshold.

04Experiments & Results

The test: Two phases—offline prediction quality and online gameplay.

Offline asks: ‘Given current observation + actions, can StarWM predict the future observation accurately?’
Online asks: ‘Does using those predictions to refine actions actually win more games and play more cleanly?’

The competition (baselines):

Static Bias: Just copy today’s observation as tomorrow’s guess (a tough baseline under short horizons and fog-of-war).
Qwen3-8B (zero-shot): An LLM asked to predict futures without SC2-specific training.
Qwen3-32B (zero-shot): A larger LLM trying the same zero-shot prediction.

Offline scoreboard (what the numbers mean):

Economy (minerals/gas SMAPE): StarWM cuts error to 0.19/0.09 vs. 0.48/0.26 for zero-shot 32B—about a 60–65% reduction, like raising your grade from a shaky C to a solid A.
Development (Queue F1, Progress MAE%): Queue F1 0.92 and Progress error 0.43% vs. >24% for baselines—this is like guessing build timers almost perfectly while others are minutes off.
Micro-Entities (HP MAE%): Self/enemy HP errors drop (4.15%/7.90% vs. 5.11%/8.47%), showing learned combat attrition.
Macro-Situation (AWD): Self-side AWD 3.46 vs. 9.79 (32B) and 8.37 (Static), nearly 60% better—formations and moves are much more coherent.

Surprises and lessons:

Zero-shot LLMs often fail to beat Static Bias. Moral: SC2 physics aren’t in generic pretraining; you must fine-tune on action-conditioned trajectories.
Enemy AWD under fog-of-war can look worse for StarWM than Static Bias. Why? Standing-still enemies (Static) can be ‘closer’ over short spans; StarWM sometimes predicts plausible enemy motion or defenders (‘conservative hallucinations’). Offline, that’s a penalty; online, it can encourage caution that sometimes helps.

Online gameplay (vs. SC2 built-in AI on Hard/Harder/VeryHard):

Win rate: StarWM-Agent (32B policy) boosts win rates by +30% (LV5), +15% (LV6), +30% (LV7) compared to the same policy without WM. Even StarWM-Agent (8B) gains consistently.
Supply Block Rate: Big drops (e.g., from ~16–25% down to ~5–6% with 32B) show the agent plans supply preemptively.
Resource Conversion Rate: Jumps to ~78–83%, meaning minerals/gas are turned into units/tech instead of sitting idle.
Kill–Loss Ratio: Increases (~+21%), reflecting fewer bad engagements.
Valid Action Rate: Climbs to ~82–86%, indicating fewer impossible or ill-timed commands after simulation-based checks.

Ablations (what really matters):

Generate only: Baseline performance with many supply blocks and poor spending.
- Refine (self-reflection, no simulation): Big macro improvements but modest win gains—thinking longer helps, but it’s still blind.
- Zero-shot WM simulate: Some gains—external ‘what-if’ helps a bit, even if noisy.
- StarWM simulate (trained): Best across the board—accurate action-conditioned forecasts drive strong policy refinements.

Qualitative cases:

Formations: StarWM preserves army structure in predictions; zero-shot models scatter units.
Scouting: StarWM may ‘see’ likely defenders near enemy bases; offline this is a false positive, but online it can trigger safer play.
Macro choice: The classic example—skip early depot when supply isn’t the bottleneck; train worker to keep economy flowing.

Bottom line: Accurate short-horizon, action-conditioned predictions are the difference between a cautious guess and a confident, foresightful edit. That’s what moves the needle from reactive to reliably proactive play.

05Discussion & Limitations

Limitations:

Opponent modeling: Predicting enemy movement/production under fog-of-war with no explicit memory is underdetermined. StarWM sometimes ‘hallucinates’ plausible defenders, hurting offline AWD even if it encourages safer play.
Short horizon: A 5-second lookahead is great for supply, queues, and small skirmishes but won’t capture long tech swings or multi-minute strategies.
Domain scope: Training is on Terran vs. Terran on Flat64; extending to all races/maps is engineering-heavy and could require more data.
Text latency and parsing: Text I/O is interpretable but can be slower than compact tensors; strict formatting helps but still adds overhead.
Zero-think constraint: Online evaluations use /no_think; allowing more reasoning steps or memory might further boost results but also cost time.

Required resources:

Data: SC2-Dynamics-50k or similar action-conditioned trajectory data for your setting.
Models: An LLM backbone (e.g., Qwen3-8B) and LoRA fine-tuning.
Compute: Multi-GPU training (authors used 8×H100), plus inference budget for simulate-and-refine.
Environment: SC2Arena (or equivalent) extended with a world-model module and prompts.

When NOT to use it:

Fully observable, deterministic tasks where hand-coded rules or classical planners excel.
Ultra-fast micro where frame-by-frame control matters more than 5-second foresight.
Very long-horizon planning where tree search or hierarchical planners with memory are more appropriate.
Severe latency budgets where simulate-and-refine would exceed real-time constraints.

Open questions:

Temporal memory: Can adding history or an opponent-intent module improve enemy prediction without overfitting?
Longer horizons: How to safely extend from 5s to multi-step chains without compounding errors?
Multimodal inputs: Can raw spatial features (mini-map, screen) plus text factorization further improve kinematics and visibility?
Uncertainty handling: How to represent and use multiple plausible futures (ensembles or stochastic decoders) during refinement?
Generalization: How well does the approach transfer across races, maps, and metas—and what data is minimally needed?

06Conclusion & Future Work

Three-sentence summary: This paper builds StarWM, an action-conditioned world model that predicts short-horizon future observations in StarCraft II under fog-of-war. By factorizing observations into five textual modules, training on SC2-Dynamics-50k, and evaluating with multi-dimensional metrics, the model learns economy, development, movement, and combat attrition well enough to guide decisions. Plugging StarWM into a simple Generate–Simulate–Refine loop yields consistent online gains—higher win rates, fewer supply blocks, better spending, and smarter fights.

Main achievement: Showing that a lightweight, learnable, action-conditioned simulator can reliably sharpen LLM policies in a complex, partially observable RTS by turning ‘what-if’ imagination into concrete, improved actions.

Future directions: Add temporal memory and opponent-intent modeling for better enemy forecasts, explore multi-step simulations and multimodal inputs, and scale to more races, maps, and longer horizons with uncertainty-aware predictions.

Why remember this: It’s a clean recipe—tidy the state, teach a short-horizon ‘peek,’ and feed that peek back into the policy—that turns reactive agents into foresightful planners. The pattern is simple, practical, and portable to many decision-making domains where a small, trustworthy crystal ball makes all the difference.

Practical Applications

•Train RTS game agents that spend efficiently and avoid supply or production bottlenecks.
•Add short-horizon simulators to LLM agents for safer online planning in partially observable settings.
•Use structured text factorization to teach models hybrid rules (timers, costs, movement) in other games.
•Adapt the Generate–Simulate–Refine loop for robotics tasks (e.g., predict object states after a grasp).
•Improve autonomous driving modules by simulating short-horizon, action-conditioned traffic scenes.
•Optimize warehouse operations by forecasting resource queues and adjusting schedules proactively.
•Enhance web agents’ tool use by simulating the effects of actions (e.g., form submissions) before execution.
•Develop better evaluation metrics that check numbers, progress, and spatial layouts for simulators.
•Create curriculum datasets (like SC2-Dynamics-50k) for other complex domains with partial observability.
•Build coaching tools that show players ‘what if’ outcomes for training and strategy refinement.

Version: 1