🎓How I Study AIHISA
📖Read
📄Papers📰Blogs🎬Courses
💡Learn
🛤️Paths📚Topics💡Concepts🎴Shorts
🎯Practice
⏱️Coach🧩Problems🧠Thinking🎯Prompts🧠Review
SearchSettings
Dreaming in Code for Curriculum Learning in Open-Ended Worlds | How I Study AI

Dreaming in Code for Curriculum Learning in Open-Ended Worlds

Intermediate
Konstantinos Mitsides, Maxence Faldor, Antoine Cully2/9/2026
arXiv

Key Summary

  • •Agents in vast, open-ended games often learn a little and then get stuck because the next good practice steps are missing.
  • •This paper’s idea, Dreaming in Code (DiCode), asks a big AI model to write tiny pieces of game code that create just-right practice worlds for the agent.
  • •Instead of random levels, DiCode builds a step-by-step curriculum: each new level is a small, learnable nudge beyond what the agent can already do.
  • •DiCode uses a closed loop: it watches what the agent can and can’t do, then “dreams” code for the next level that targets those gaps.
  • •All generated levels run on the real game engine, so physics and rules stay correct and useful for learning.
  • •In the Craftax benchmark, DiCode improves mean score by about 16% over the best baseline and unlocks late-game combat where others get 0%.
  • •The system learns instrumental skills (like crafting iron armor) earlier and better, which unlocks deeper progress (like reaching harder floors).
  • •Ablations show the feedback loop matters: removing parent/performance context drops performance to near the baseline.
  • •DiCode hints at a general recipe: use code-writing models to shape experience, not just to make decisions.
  • •This approach could help robots, game AIs, and simulators learn complex skills by generating the right practice worlds on demand.

Why This Research Matters

In the real world, learning rarely happens from one giant challenge; it happens through many small, well-chosen steps. DiCode shows how to automatically build those steps by having models write code that the simulator can run right away. This makes training safer and more efficient for robots and other agents that must practice in complex, changing worlds. Instead of handcrafting lessons, teams can let DiCode discover and order them based on the agent’s progress. The approach reduces wasted training time on tasks that are too easy or impossible. Ultimately, it could speed up how we teach AI to handle long missions, layered goals, and rare but important events.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine you’re learning to play a huge adventure video game with a massive map, tons of tools, and tricky monsters. At first you learn fast—walk, pick up items, swing a sword—but soon you’re stuck because you can’t find the right practice spots to get from “okay” to “awesome.”

🥬 Filling:

  • What it is (Open-ended learning): Open-ended learning is when an agent keeps exploring endlessly growing worlds and tries to get better forever.
  • How it works (before this paper): People trained agents in fixed or randomly varied environments. Early on, that teaches basics. But then progress slows because the next, specific, learnable challenges aren’t guaranteed to appear.
  • Why it matters: Without the right next steps, agents plateau—like practicing only easy drills and never learning to face tougher bosses.

🍞 Anchor: In big games like Craftax (a complex, Minecraft-like world), agents often master early survival but fail to reach tough floors or enemies because they never get the right sequence of practice tasks.

🍞 Hook: You know how teachers don’t just throw the final exam at you? They build a curriculum: worksheets today, a small quiz tomorrow, then a bigger project.

🥬 Filling (Curriculum learning):

  • What it is: Curriculum learning is ordering practice from easier to harder so each step is learnable.
  • How it works: Start with skills you almost can do, give focused practice, then gently raise difficulty.
  • Why it matters: Without a curriculum, practice can be too hard (you give up) or too easy (you get bored) and you stop improving.

🍞 Anchor: A math teacher gives you single-step equations before multi-step word problems. AI needs that too.

🍞 Hook: Think of a playground that reshapes itself to keep you challenged: new monkey bars appear just when you’re ready.

🥬 Filling (Unsupervised Environment Design, UED):

  • What it is: UED is an automatic way to create training environments that keep difficulty “just right.”
  • How it works: A “teacher” generator picks or builds levels; a “student” agent learns on them; the teacher adjusts based on how the student is doing.
  • Why it matters: It prevents long stalls when fixed levels stop teaching anything new.

🍞 Anchor: If you ace a level, UED makes it a bit harder; if you fail completely, it eases up.

🍞 Hook: Imagine writing small rules that change the game—like “put a ladder here” or “add one more zombie.”

🥬 Filling (Environments as code):

  • What it is: Instead of just changing random seeds, the environment is defined by executable code that sets the world layout, rules, and goals.
  • How it works: The code tells the engine where to place resources, how enemies attack, and what counts as success.
  • Why it matters: Code lets you build structured challenges (e.g., “craft iron armor, then descend a floor”) that simple parameter tweaks can’t express.

🍞 Anchor: It’s like a level designer’s script that can place items, set monster health, and define win conditions.

🍞 Hook: Picture a super-smart librarian who can write programs, not just paragraphs.

🥬 Filling (Foundation models, FMs):

  • What it is: FMs are very large models trained on lots of data that can write text and code.
  • How it works: You give them context (what the agent can do, what’s needed next), and they generate code for a new practice level.
  • Why it matters: They can invent diverse, precise training tasks quickly.

🍞 Anchor: You say, “The agent crafts iron armor but struggles with floor 2.” The FM writes code that sets up a fair practice to reach floor 2.

🍞 Hook: Think of a fitness coach who keeps you in your “challenge zone,” not too comfy, not too crushing.

🥬 Filling (Learnability score and zone of proximal development):

  • What it is: Learnability prefers levels the agent sometimes succeeds at (p) but hasn’t mastered (p(1−p) is highest near 0.5).
  • How it works: Pick and replay levels where success is around 50%—tough enough to learn from.
  • Why it matters: Staying near this zone makes steady progress.

🍞 Anchor: A piano teacher picks pieces you can almost play, so you improve fastest.

The world before this paper: Agents in open-ended games used randomization, regret heuristics, or replay of past seeds. These often assumed small, smooth parameter spaces and produced isolated challenges. They didn’t build reliable, long sequences that unlock deep progress.

The problem: In combinatorially huge worlds, it’s hard to find and order learnable steps that depend on each other (like crafting armor before diving into tougher floors).

Failed attempts: Domain randomization spreads attention too thin; regret-based search can miss “almost-win” levels; prior code-generating systems found cool one-off tasks but didn’t chain them into a curriculum.

The gap: A closed-loop way to generate environment code that steadily raises the bar based on what the agent currently knows.

Real stakes: Better curricula matter for robots practicing safely in sim, game AIs that learn complex strategies, and any system that needs to build long-horizon skills without human-made lesson plans.

02Core Idea

🍞 Hook: You know how a coach watches your game, then designs tomorrow’s drill to fix exactly what tripped you up today?

🥬 Filling (Core insight):

  • What it is: Dreaming in Code (DiCode) uses a big code-writing AI to generate small, executable “practice worlds” that are one step beyond the agent’s current abilities.
  • How it works: It watches performance, selects a “parent” level that’s near-learnable, asks the model to mutate it into a slightly harder child level, compiles the code, and mixes that level into training with the real game.
  • Why it matters: Instead of hoping the right challenges appear, DiCode constructs them—keeping the agent in the sweet spot where learning accelerates.

🍞 Anchor: If the agent can craft iron armor but can’t reach floor 2, DiCode makes a level that starts with gear and gently tweaks enemy pressure and ladder unlocking so “reach floor 2” is learnable next.

Multiple analogies:

  1. Sports: Yesterday you dribbled fine but lost under pressure, so today’s drill adds one defender—just enough stress to improve.
  2. Video game designer: You beat Level 3 with a lot of healing nearby; the new Level 4 removes some healing and adds one smarter enemy.
  3. Music teacher: You mastered a song at 80 BPM, so the metronome moves to 88 BPM—not 140.

Before vs After:

  • Before: Systems changed knobs or found single tricky levels; agents often got stuck on deep goals (like late-game combat).
  • After: The system builds a staircase of code-defined practice tasks that target exactly what’s missing, leading agents to unlock long-horizon achievements.

Why it works (intuition behind the math):

  • Closed-loop targeting: The generator conditions on success rates, picking levels with mid-range success (high learnability), which statistically yield strong learning signals.
  • Structured mutations: Because levels are code, you can add/remove scaffolding (resources, gear), alter enemy pressure, or shift goals in precise, minimal ways.
  • Engine-grounded: All code runs in the real engine, so feedback is truthful—no physics glitches or fake wins.

Building blocks (each with a mini-sandwich):

  • 🍞 Hook: Like choosing which worksheet to build on next. 🥬 Filling (Archive and parent selection): Keep a graph of levels with their success rates; pick a “parent” with good learnability and no strong children yet, so you diversify progress paths. Without this, you might over-focus on one branch and miss other skills. 🍞 Anchor: If armor-crafting is at 60% and floor 2 is at 10%, pick a parent around 50–60% SR and push gently toward deeper exploration.

  • 🍞 Hook: Think of writing instructions for a level-builder. 🥬 Filling (Description → Code): First, the model writes a plain-language spec describing the new level (objective, map changes, mechanics tweaks). Then it writes Python that uses the simulator’s API to realize that spec. Without two-step prompting, code can drift away from the intended learning goal. 🍞 Anchor: “Start on floor 1 with iron gear provided; increase melee pressure; goal: descend to floor 2 and defeat a gnome melee mob.”

  • 🍞 Hook: Like checking your homework compiles before turning it in. 🥬 Filling (Compilation check): Run a quick rollout to catch syntax/runtime errors; keep only valid levels. Without this, training would break or waste time. 🍞 Anchor: If one candidate crashes, it’s dropped; others go into the training mix.

  • 🍞 Hook: A balanced diet beats only candy or only broccoli. 🥬 Filling (Training mix): Always train partly on the real target game, plus new and replayed generated levels. Without anchoring to the target, the agent might get great at drills but bad at the real exam. 🍞 Anchor: 20% of steps are in vanilla Craftax; the rest are split between fresh levels and the best old ones.

  • 🍞 Hook: Like a scoreboard that rewards finishing the drill. 🥬 Filling (Adaptive bonus for task success): When a generated level’s goal is met, add a bonus that scales with recent target performance so finishing stays worthwhile as you improve. Without scaling, bonuses could become too small to motivate or too large to distort. 🍞 Anchor: If your recent target score is 20, goal completion might give at least 40, keeping the focus on real progress.

03Methodology

At a high level: Input → [Pick a parent level from archive] → [FM writes description] → [FM writes executable code] → [Compile and validate] → [Mix into training with target and replayed levels] → Output: A steadily advancing curriculum that matches agent ability.

Step-by-step with sandwiches:

  1. Archive and Parent Selection
  • 🍞 Hook: Like choosing the next math problem based on which ones you sometimes miss.
  • 🥬 Filling:
    • What happens: The system keeps a graph of all past generated levels, each with success rates and category tags. It computes a learnability score (highest near 50% success). It samples a parent level that’s learnable and whose children aren’t already strong.
    • Why this step exists: It prevents tunnel vision on one branch and keeps exploring diverse, promising directions. Without it, you might overfit to one path and ignore other critical skills.
    • Example: Parent Level 112 (craft iron armor at 55% SR) is picked over Level 40 (at 90%) and Level 9 (at 5%).
  • 🍞 Anchor: Picking mid-SR parents is like practicing the piano piece you can almost play, not the lullaby (too easy) or the concerto (too hard).
  1. Description Generation (Natural Language Spec)
  • 🍞 Hook: Before building, architects write a blueprint.
  • 🥬 Filling:
    • What happens: The FM reads: game rules, API docs, the parent’s description, and performance stats (for parent and target game). It writes a precise, unambiguous description: where to place blocks, which mobs to spawn, how to tweak mechanics, and what counts as success.
    • Why this step exists: Clear specs reduce coding mistakes and keep the design focused on the intended bottleneck. Without it, the code step might wander.
    • Example: “Objective: craft iron armor and descend to floor 1; World: start with wood/stone, fewer freebies; Mechanics: raise melee pressure slightly; Relevant Achievements: …; Completed Achievements: …”
  • 🍞 Anchor: The spec might say “monsters_killed_to_clear_level = 6” so the ladder unlock rule trains controlled combat before descending.
  1. Code Synthesis (Executable Python)
  • 🍞 Hook: Now turn the blueprint into the building.
  • 🥬 Filling:
    • What happens: The FM writes a Python class that inherits from BaseTask, sets TaskParams (enemy spawn, health, needs), generates the world layout with WorldBuilder, and defines the goal via relevant achievements.
    • Why this step exists: Code-level control enables structural changes (e.g., providing iron nearby, adjusting ladder unlocks). Without code, you’re stuck with shallow parameter tweaks.
    • Example: Place 3 coal blocks within 4–8 tiles, add 1 zombie at distance 6, start with a wood sword, and set ranged_spawn_multiplier = 0.2.
  • 🍞 Anchor: The code builds a fresh procedural map every episode but follows the same rules, so practice stays varied yet targeted.
  1. Compilation Check and Quick Rollout
  • 🍞 Hook: Test the toy before giving it to the class.
  • 🥬 Filling:
    • What happens: Generate extra candidates in parallel, compile them, and run a short simulated trajectory to catch crashes.
    • Why this step exists: It saves training time and keeps the curriculum stable. Without it, broken levels would waste compute or crash runs.
    • Example: If a level forgets to define relevant achievements, it’s rejected.
  • 🍞 Anchor: A handful of valid children move on; the rest are recycled.
  1. Training Batch Construction (Target + New + Replay)
  • 🍞 Hook: A good workout mixes familiar moves with new drills.
  • 🥬 Filling:
    • What happens: Every update uses 20% of steps on the unmodified target game to anchor reality. The rest splits between brand-new levels and high-priority old ones (selected by PLR with learnability and staleness).
    • Why this step exists: Anchoring prevents the agent from getting great only at drills. Replay maintains valuable past lessons and avoids forgetting. Without it, progress could drift or collapse.
    • Example: After adding 10 new levels, the system replays 155 archived ones, with higher chance for those around 50% SR or long-unseen.
  • 🍞 Anchor: The agent keeps seeing the real game, ensuring transfer.
  1. Adaptive Goal Bonus and Goal Conditioning
  • 🍞 Hook: Stickers for finishing the assignment, sized to stay motivating.
  • 🥬 Filling:
    • What happens: Completing a generated level’s goal gives a bonus that scales with the agent’s recent target-game return. The agent’s policy also gets a multi-hot vector saying which achievements define the current goal.
    • Why this step exists: It keeps finishing level goals worthwhile as the agent improves, and clarifies what “success” means now. Without scaling and conditioning, signals could be too weak or confusing.
    • Example: If last cycle’s target score was 20, the bonus might be at least 40.
  • 🍞 Anchor: This helps the agent prioritize finishing the practice task instead of farming easy rewards.
  1. Asynchronous Generation
  • 🍞 Hook: While you practice, your coach designs the next drill in the background.
  • 🥬 Filling:
    • What happens: RL training continues while the FM generates and validates new levels. Training only pauses if several cycles pass with no new valid levels.
    • Why this step exists: It keeps GPUs busy and training smooth. Without it, generation latency would slow everything.
    • Example: New batches arrive every ~2 iterations; otherwise replay + target continue.
  • 🍞 Anchor: The conveyor belt of fresh, learnable tasks rarely runs dry.

The secret sauce:

  • Closed-loop, code-level curriculum: The model doesn’t just make random tasks; it mutates working parents, guided by learnability and target-game gaps. Code lets small, surgical changes (add/remove scaffolding, tweak pressure, adjust goals) that create clean stepping stones. Engine grounding guarantees realism, so practice transfers to the real game.

04Experiments & Results

🍞 Hook: Imagine a tournament where every player gets tested on 1,024 surprise levels. Who trained smarter?

🥬 Filling:

  • The test: Evaluate mean episode return on a fixed set of 1,024 unseen Craftax worlds over training. Also break down success on key achievements (crafting, exploration, late-game combat).
  • The competition: Compare DiCode with strong baselines: PPO-GTrXL on default worlds (no curriculum), Domain Randomization (DR), Prioritized Level Replay (PLR), and Sampling for Learnability (SFL). All use the same PPO-GTrXL agent to keep it fair.
  • The scoreboard (with context):
    • Mean return: DiCode reaches 48.33 vs the best baseline at 41.54—about a 16% relative boost. That’s like turning a solid B into a clean A- across a giant pop quiz.
    • Instrumental milestones: On “Make Iron Armour,” DiCode hits 45% success vs 14% for the best baseline—equipping the agent to survive longer and push deeper.
    • Deeper exploration: DiCode enters the Gnomish Mines (floor 2) in 30% of episodes vs 9% for the best baseline—three times as often, opening the door to late-game.
    • Late-game combat: Baselines collapse to 0% on tough enemies (Gnome Warrior/Archer). DiCode reaches 11% and 9% respectively—non-zero footholds where others have none.
  • Surprising findings:
    • Teacher-like behavior: The FM often removes scaffolding (fewer freebies) once the agent is ready, and layers pressure (more melee, fewer ranged) to target exact weaknesses.
    • Zone holding: The average success rate across active training levels stabilizes around 0.5, suggesting the system self-balances difficulty near peak learnability.
  • Ablation (does the loop matter?): Removing parent/performance context (DiCode-OL) drops final score to 40.91—close to base PPO-GTrXL (41.54). So the gain isn’t “just generation,” it’s generation steered by feedback.

🍞 Anchor: In plain terms, DiCode built the right staircase. The agent climbed to places (late-game fights) that others never reached, because the steps were small, steady, and always there.

05Discussion & Limitations

🍞 Hook: Even great coaches have limits—time, tools, and what the rules allow.

🥬 Filling:

  • Limitations (be specific):
    • Engine-bounded creativity: Because all levels run on the fixed Craftax engine, the system can’t invent brand-new physics or totally new mechanics—only reconfigure what exists.
    • FM latency: Generating and validating code can be slow; DiCode’s wall-clock training took ~4–5× longer mainly due to model inference time.
    • No self-correction loop for broken code: The system drops invalid code instead of iteratively fixing it, potentially wasting some generation attempts.
    • Goal design trade-offs: The adaptive bonus helps, but could over- or under-incentivize if scaled poorly.
  • Required resources:
    • Compute: 1,024 parallel workers for simulation; GPUs for PPO-GTrXL; additional resources for large FM inference.
    • Software: The environment’s code API, replay buffer, and evaluation suite.
  • When NOT to use it:
    • Very small or simple tasks where fixed curricula already work well.
    • Domains without a stable, trusted simulator (engine grounding is key for transfer).
    • Ultra–real-time training loops that can’t tolerate FM latency, even asynchronously.
  • Open questions:
    • Automatic “which stepping-stone is useful?” detection beyond learnability—can we predict transfer to the target even better?
    • Stronger parent/child diversity control—how to ensure exploration of multiple curriculum branches without spreading too thin?
    • Multi-agent or multi-skill coordination—how to co-train interacting skills without negative interference?
    • Faster, smaller generators—can compact code models achieve similar curriculum quality at lower cost?

🍞 Anchor: Think of DiCode as a great lesson planner that still uses the existing classroom, textbooks, and bell schedule. It can rearrange lessons brilliantly, but it can’t rewrite the laws of physics—or stop the bell from ringing.

06Conclusion & Future Work

🍞 Hook: Picture a staircase that builds itself just one step ahead of your feet—always climbable, always leading upward.

🥬 Filling:

  • 3-sentence summary: DiCode asks a foundation model to write executable environment code that creates just-right practice worlds, guided by what the agent can currently do. These worlds run inside the real engine, so learning signals are accurate and transfer to the target game. In Craftax, this closed-loop curriculum yields a ~16% mean return boost and unlocks late-game skills where baselines stall.
  • Main achievement: Turning environment code generation into a feedback-driven curriculum tool that reliably bridges long-horizon gaps.
  • Future directions: Faster, tighter generation loops; richer transfer predictors; extending to robotics and multi-agent settings; smarter repair of broken code; and spanning multiple curriculum branches in parallel.
  • Why remember this: It flips the script—use large models not just to act, but to shape experience. When the world is too big to explore blindly, “dreaming in code” builds the path, one learnable step at a time.

Practical Applications

  • •Robot training in simulation: Generate staged practice terrains and tasks that build manipulation, navigation, and safety skills.
  • •Game AI development: Create targeted practice levels that help bots learn advanced strategies and late-game tactics.
  • •Autonomous driving simulation: Write scenario code that gradually increases traffic density, weather complexity, and rare-event handling.
  • •Disaster response drills: Produce code-defined emergencies (blocked routes, aftershocks) to train coordinated planning.
  • •Warehouse automation: Scaffold picking, packing, and routing tasks from easy to complex layouts.
  • •Education technology: Auto-generate coding or math practice sets that adapt to each student’s current skill frontier.
  • •Healthcare simulators: Stage procedures from basic steps to complex, multi-step operations for trainee practice.
  • •Cybersecurity training: Script escalating attack-defense simulations that focus on one new dependency per level.
  • •A/B testing curricula: Rapidly prototype and compare training sequences to see which unlocks deeper competencies.
  • •Research benchmarking: Use code-level tasks to probe specific subskills and measure transfer to target environments.
#open-ended learning#unsupervised environment design#curriculum learning#programmatic environment generation#foundation models#code synthesis for environments#learnability score#prioritized level replay#closed-loop curriculum#Craftax benchmark#long-horizon reinforcement learning#teacher-like scaffolding#zone of proximal development#parent-offspring level mutation#PPO-GTrXL
Version: 1

Notes

0/2000
Press Cmd+Enter to submit