Theory of Space: Can Foundation Models Construct Spatial Beliefs through Active Exploration?

Pingyue Zhang; Zihan Huang; Yue Wang; Jieyu Zhang; Letian Xue; Zihan Wang; Qineng Wang; Keshigeyan Chandrasegaran; Ruohan Zhang; Yejin Choi; Ranjay Krishna; Jiajun Wu; Li Fei-Fei; Manling Li

Theory of Space: Can Foundation Models Construct Spatial Beliefs through Active Exploration?

Intermediate

Pingyue Zhang, Zihan Huang, Yue Wang et al.2/4/2026

arXiv

Key Summary

•This paper asks a simple question with big consequences: can today’s AI models actively explore a new space and build a trustworthy internal map of it?
•The authors introduce Theory of Space, a framework that measures how well an AI constructs, revises, and uses a ‘spatial belief’ through its own exploration.
•They build paired text and vision worlds so we can separate ‘seeing’ problems from ‘thinking’ problems, and they probe the model’s internal map at every step.
•Across many tasks, models do much better when passively given perfect exploration logs than when they must explore on their own (the Active–Passive Gap).
•Programmed proxy explorers reach good coverage in about 9 steps; foundation models often take 14+ steps without better maps (inefficiency).
•Vision is harder than text: models struggle especially with object orientations and keeping a stable global map over time (belief drift).
•When the world changes, models—especially in vision—show belief inertia: they cling to outdated maps instead of rewriting them with new evidence.
•A new ‘uncertainty map’ probe shows models that plan around what they still don’t know keep improving longer and make better decisions.
•Human participants outperform models by a large margin, and with simple tools humans approach perfection, showing the headroom for progress.
•The benchmark, code, and data are released to help the community build agents that explore efficiently, remember reliably, and revise correctly.

Why This Research Matters

Real robots, AR assistants, and autonomous systems must explore unfamiliar spaces safely and efficiently. This work shows that current foundation models struggle when they must choose their own views, keep a stable map, and revise it when the world changes—exactly what real deployment requires. By making internal maps explicit and grading uncertainty, we can diagnose whether failures come from seeing poorly, planning poorly, or forgetting. The benchmark gives the community a shared way to measure progress in building agents that are not only smart answerers but capable explorers. Better exploration, memory, and revision will directly improve home robots, search-and-rescue drones, and indoor navigation tools.

Reading Workflow

Turn this paper into a decision

Scan fast. Promote only the papers that survive triage.

No workflow history yet.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine you move to a new school and have to find the library, gym, and cafeteria. If someone hands you a perfect map, that’s easy. But what if you must walk around, peek into rooms, and figure it out yourself? That’s harder—but it also teaches you more. 🥬 The Concept: Before this paper, most AI tests were like handing the AI a perfect map or a full photo album and asking questions. These are ‘passive’ tests: the AI doesn’t decide what to look at next; it just answers about what it’s given. How it works: 1) The AI is shown a fixed set of observations (text or images). 2) It answers questions about where things are or how to navigate. 3) It never has to choose its next move. Why it matters: Real robots and apps don’t get perfect maps; they must explore new places, decide what to look at, and remember correctly. 🍞 Anchor: Think of a robot vacuum: it has to explore your house to clean it well. No human gives it a perfect plan—it has to learn the space.

🍞 Hook: You know how you build a mental picture of your house—where the bathroom is, which door sticks? 🥬 The Concept (Spatial Belief): A spatial belief is the AI’s internal idea of where everything is and how it all fits together, even for parts it hasn’t seen yet. How it works: 1) Start with guesses about the layout. 2) Move and look around for clues. 3) Update the belief to stay consistent with what’s been seen. 4) Use it to answer questions and plan paths. Why it matters: Without a good spatial belief, the AI gets lost, contradicts itself, or can’t plan smart next steps. 🍞 Anchor: Your ‘mental map’ that lets you find the kitchen at night without turning on the lights is your spatial belief at work.

🍞 Hook: Imagine wearing a cardboard box with a small window: you can’t see the whole room at once. 🥬 The Concept (Partial Observability): Partial observability means the agent only sees a slice of the world at any time and must act to see more. How it works: 1) The agent looks. 2) It updates its map. 3) It chooses a next view to reduce uncertainty. Why it matters: If the agent can’t choose informative views, it stays confused and wastes time. 🍞 Anchor: When you peek around corners in hide-and-seek, you’re actively fixing what you don’t know.

🍞 Hook: You know how you can picture your neighborhood like a bird’s-eye map? 🥬 The Concept (Cognitive Map): A cognitive map is a structured mental map of places, directions, and distances. How it works: 1) Record objects with coordinates and headings. 2) Keep pairs of relations (e.g., couch is northeast of table). 3) Keep it consistent as you add new facts. Why it matters: Without a coherent cognitive map, local observations don’t join into a stable big picture. 🍞 Anchor: Drawing a floor plan after walking through a house is turning your memory into a cognitive map.

🍞 Hook: Sometimes you remember, “Go past the blue mailbox, then turn right at the bakery.” Other times you imagine a top-down map and shortcut. 🥬 The Concept (Route vs Survey Knowledge): Route belief is step-by-step, egocentric know-how; survey belief is a global, bird’s-eye understanding. How it works: 1) Route: focus on what you see as you move. 2) Survey: place everything on a shared map. 3) Convert back and forth to answer questions. Why it matters: If you only have route knowledge, you struggle with shortcuts; if you only have survey, you might miss stepwise cues. 🍞 Anchor: Using landmarks to walk to class (route) vs. drawing the school map (survey).

🍞 Hook: Do you explore a new house just for fun, or only when told to find the TV? 🥬 The Concept (Task-agnostic Exploration): Task-agnostic exploration means the agent explores to build a general map, not to solve a single task right now. How it works: 1) Identify unknowns. 2) Choose views that shrink those unknowns fastest. 3) Stop when the map is accurate enough. Why it matters: If exploration is only tied to one task, the map may be brittle and not useful for other tasks. 🍞 Anchor: Scouting a campsite thoroughly today helps you find anything quickly tomorrow.

🍞 Hook: If you think the gym is near the cafeteria but then you see it elsewhere, you must change your mind. 🥬 The Concept (Belief Revision): Belief revision is updating your internal map when the world changes or when you learn you were wrong. How it works: 1) Notice a conflict between your map and new evidence. 2) Revisit the area to confirm. 3) Overwrite old facts with new ones. Why it matters: Without revision, your map becomes stale and misleading. 🍞 Anchor: When a store moves to a new mall wing, you stop going to the old spot.

🍞 Hook: If you always insist your keys are on the desk even after looking elsewhere, you’re stuck. 🥬 The Concept (Belief Inertia): Belief inertia is when the agent clings to old beliefs even as new observations say otherwise. How it works: 1) Old belief is strong. 2) New evidence disagrees. 3) Update fails or only shifts slightly toward truth. Why it matters: The agent keeps making wrong choices because it trusts the past too much. 🍞 Anchor: Insisting a class is in the old room after the schedule changed.

Why this paper: The authors argue that to judge real spatial intelligence, we must test how an AI actively explores to build, probe, and revise its own cognitive map—not just how it answers from perfect inputs. They create a benchmark in both text and 3D vision worlds, force agents to pick actions (move, rotate, observe), and at every step ask them to externalize their internal map so we can measure what they truly believe, what’s uncertain, and how quickly their actions reduce that uncertainty.

02Core Idea

🍞 Hook: You know how a good detective doesn’t just wait for clues—they go find them, write down theories, cross them out when wrong, and keep a tidy case board? 🥬 The Concept (Theory of Space): The ‘Theory of Space’ framework measures an agent’s ability to actively construct, revise, and use an internal spatial belief by exploring. How it works: 1) Construct: pick smart actions to gather the most useful new views. 2) Probe: show your current map and what’s still unknown. 3) Revise: if the world changes, overwrite old beliefs. 4) Exploit: answer tasks using your belief. Why it matters: Without this, we only know if an agent can parrot answers from given views, not if it can be a capable explorer in the real world. 🍞 Anchor: Like a scout mapping a forest: explore efficiently, keep a clean map, fix it when trails move, and then guide others.

Aha! Moment in one sentence: Don’t just grade the answers—grade the whole belief-building process by making the agent explore, reveal its map, track uncertainty, and prove it can fix mistakes when the world changes.

Three analogies for the same idea:

Detective board: pin photos (observations), draw lines (relations), circle unknowns (uncertainty), and rewrite the board when the story changes (revision).
Puzzle builder: pick pieces that reduce confusion fastest (information gain), place them on the big picture (cognitive map), and if a piece doesn’t fit, re-place it (revision) instead of forcing it (inertia).
Science lab: form a hypothesis (map), design informative experiments (actions), report what you know and don’t (uncertainty), and update your theory when data disagree (revision).

🍞 Hook: You know how you can explain your plan better when you show your sketch? 🥬 The Concept (Belief Probing): Belief probing asks the agent to externalize its internal cognitive map as a structured ‘map JSON’ at each step. How it works: 1) After every observe, the agent outputs a global map (who is where, facing what). 2) It also outputs a local snapshot of the current view. 3) It marks which spots are still unobserved via an uncertainty probe. Why it matters: If we only look at final answers, we miss whether the agent actually built a strong map or guessed. Probing opens the black box. 🍞 Anchor: Sharing a neat notebook during math shows not just the final answer but the reasoning steps.

🍞 Hook: When studying for a test, you don’t read random pages—you target gaps. 🥬 The Concept (Information Gain): Information gain is choosing the next view that most shrinks what you don’t know about the map. How it works: 1) Track how many positions each object could still be in. 2) Pick an action that rules out the most possibilities. 3) Prefer steps that improve knowledge quickly, not just add more photos. Why it matters: Without this, exploration is slow and redundant. 🍞 Anchor: If you forgot only chapter 5, you read chapter 5, not the whole book again.

🍞 Hook: Sometimes a careful tour guide beats a curious but chaotic tourist. 🥬 The Concept (Proxy Agents): Proxy agents (SCOUT and STRATEGIST) are scripted explorers that produce strong, efficient exploration logs. How it works: 1) SCOUT rotates and scans at each spot to see everything quickly (good for vision). 2) STRATEGIST plans views to cut ambiguity fastest (good for text). 3) Models get these logs to test pure reasoning, separate from exploration skill. Why it matters: This separation shows whether low scores come from poor exploration choices or poor belief construction. 🍞 Anchor: Following a museum docent vs. wandering randomly.

Before vs. After: Before, we mainly tested passive reasoning; high scores could hide poor exploration skills. After, we can see the Active–Passive Gap: when models must explore on their own, accuracy drops and steps bloat, exposing real-world weaknesses.

🍞 Hook: You know how houses get renovated? 🥬 The Concept (False Belief Paradigm): After initial mapping, the environment secretly changes; can the agent notice and fix its map? How it works: 1) Move or rotate some objects. 2) Ask the agent to re-explore, find what changed, and update the map. 3) Measure how often it keeps clinging to old locations (belief inertia). Why it matters: Real spaces change; robust agents must adapt. 🍞 Anchor: Your favorite shop moves; good navigators stop going to the old spot.

Why it works (intuition):

Forcing explicit maps discourages hand-wavy answers and rewards consistent world models.
Tracking uncertainty makes exploration purposeful, not random.
Comparing active vs. passive decouples ‘can think’ from ‘can explore’.
False-belief tests surface memory rigidity (inertia) you’d never see from one-shot Q&A.

Building blocks, briefly introduced with sandwiches:

🍞 Hook: You sometimes remember where you are by noticing the window’s side. 🥬 The Concept (Self-tracking): The agent keeps track of its own position and facing over time. How it works: update your pose after every move/turn. Why it matters: If you forget yourself, the map collapses. 🍞 Anchor: Counting steps in the dark to find the light switch.
🍞 Hook: New facts shouldn’t break old truths. 🥬 The Concept (Local↔Global Consistency): New local views must fit into the global map without contradictions. How it works: check that both maps agree on relations. Why it matters: Inconsistency causes map drift. 🍞 Anchor: A new puzzle piece must match the already-finished picture.
🍞 Hook: Don’t backslide on finished homework. 🥬 The Concept (Stability): Once a fact is solid, don’t degrade it later. How it works: guard previously correct entries from needless overwriting. Why it matters: Belief drift tanks final accuracy. 🍞 Anchor: Erasing correct answers makes your test worse.
🍞 Hook: Glasses help you see detail. 🥬 The Concept (Perception Quality): In vision, you must recognize objects and their facing to feed the map. How it works: map pixels to identities, positions, and orientations. Why it matters: Bad seeing breaks good thinking. 🍞 Anchor: Misreading a sign takes you to the wrong store.

03Methodology

At a high level: Input (a new multi-room scene) → Active exploration loop (move, rotate, observe) while probing the internal belief → Stop when confident → Exploit the belief on spatial tasks → Score exploration efficiency, belief quality, and task success.

Step-by-step recipe with sandwiches and examples:

Start in a partially observable world (text or 3D vision).

🍞 Hook: Walking into an unfamiliar house shows only what’s in front of you. 🥬 The Concept (Partial Observability): You only see a 90° slice of the room each time. How it works: the agent can Rotate(90/180/270), JumpTo(visible object), Observe(), and optionally Query(one object’s absolute coordinate). Why it matters: Planning the next best look is the heart of exploration. 🍞 Anchor: You pivot to check corners you can’t see yet.
Example (text): Observe returns symbolic bins like “chair: front-left, near” and facing if visible. Example (vision): Observe returns an RGB image; a calibration image shows what near/mid/far and orientations look like.

Maintain and probe the cognitive map every step.

🍞 Hook: Good note-taking helps you not forget. 🥬 The Concept (Cognitive Map Probing): After each Observe, the agent outputs a global map (allocentric) and a local map (current view) in a simple JSON. How it works: record positions (x, y) and facing (N/E/S/W). Keep pairwise relations consistent. Why it matters: We can measure not only answers, but how well the map is built. 🍞 Anchor: Writing a tidy floor plan while touring a house.
What happens: The evaluator checks (a) positional accuracy, (b) directional relations, (c) facing correctness; also per-turn (d) perception, (e) self-tracking, (f) local↔global consistency, (g) stability.
What breaks without it: We couldn’t tell if a wrong answer came from seeing errors, map forgetting, or inconsistent integration.
Tiny example: If the agent first sees ‘sofa: front, mid’ then rotates and sees ‘lamp: front-right, near’, the global map should place sofa roughly north and lamp northeast, and keep that consistent later.

Plan actions to maximize information gain, not just coverage.

🍞 Hook: If you only have time for three peeks, pick the most revealing ones. 🥬 The Concept (Information Gain-Guided Exploration): Prefer actions that remove the most uncertainty about object positions. How it works: track how many possible cells each object could still occupy; choose a view that reduces these possibilities the most. Why it matters: Random spinning wastes steps without improving the map. 🍞 Anchor: Studying the chapters you don’t know yet boosts your score fastest.
Example (text): If two objects still could be many places, STRATEGIST picks angles that best disambiguate their relative directions. Without it: the agent might hop between doors and forget to finish scanning a room.

Use proxy explorers to separate ‘explore’ from ‘reason’ skill.

🍞 Hook: To see if you can do math, a teacher lets you use a calculator only for arithmetic. 🥬 The Concept (Proxy Agents: SCOUT/STRATEGIST): Scripted paths provide strong observation logs so models only need to build the map, not choose actions. How it works: SCOUT rotates and scans everywhere fast in vision; STRATEGIST targets uncertainty in text using constraint propagation. Why it matters: This tells us whether failures come from exploration or from mapping/reasoning. 🍞 Anchor: A guided museum tour eliminates getting lost, so you focus on understanding.
Example: On SCOUT logs (~9 steps), top models outperform their own active runs, proving exploration is a major bottleneck.

Exploit the belief on downstream tasks (route and survey).

🍞 Hook: After mapping, can you navigate and answer spatial questions? 🥬 The Concept (Route vs Survey Tasks): Route tasks test egocentric reasoning (e.g., what you’d see after actions); survey tasks test global mapping (e.g., give object coordinates). How it works: open-ended questions like pairwise direction, action-to-view, perspective-taking, allocentric mapping, and mental rotation. Why it matters: A good map should power many tasks, not just one. 🍞 Anchor: Using your house map to tell a friend directions and also to draw the floor plan.
Tiny example: “From table to chair?” → “northeast, mid.” Or “After Rotate(90) and JumpTo(lamp), where’s the sofa?”

Test belief revision with the false-belief paradigm.

🍞 Hook: Stores move; can your map keep up? 🥬 The Concept (Belief Revision & Belief Inertia): After initial mapping, some objects move or rotate. The agent must re-explore, identify changes, and update. How it works: measure changed-object detection, steps to finish, correctness on changed items, and inertia (clinging to the old spot or orientation). Why it matters: Agents must be plastic—able to rewrite outdated facts. 🍞 Anchor: Moving your dot on the school map when your homeroom changes.

Measure what matters.

Exploration efficiency: steps taken vs. information gained over time. Why it matters: Faster certainty is better than slow coverage.
Belief quality: final map correctness and per-turn diagnostics (perception, self-tracking, stability, consistency, uncertainty modeling). Why it matters: Pinpoints if the problem is seeing, remembering, or integrating.
Belief exploitation success: accuracy on all tasks. Why it matters: A great map should answer questions well.

The secret sauce:

Explicit cognitive-map probing turns a black-box answerer into a transparent explorer whose internal belief can be inspected, graded, and improved.
Parallel text vs. vision worlds isolate perception bottlenecks from reasoning bottlenecks.
The false-belief test catches belief inertia—stubbornness that standard benchmarks miss.

04Experiments & Results

The test: The authors built paired worlds (text and 3D vision) with procedurally generated multi-room layouts and named objects. Agents either explore actively (choose their own actions) or passively receive strong proxy logs. After exploration, they answer open-ended route and survey tasks. Throughout, agents are forced to output their internal map so we can evaluate belief correctness, stability, and uncertainty modeling.

The competition: State-of-the-art proprietary models (GPT-5.2, GEMINI-3 PRO, CLAUDE-4.5-SONNET) and open-source VLMs (GLM-4.6V, QWEN3-VL, INTERNVL-3.5) were tested. Humans were also evaluated. Two scripted proxy explorers (SCOUT for vision, STRATEGIST for text) set strong reference trajectories.

Scoreboard with context:

Active–Passive Gap: In the vision world, top models drop when moving from passive to active. For example, GPT-5.2’s overall passive accuracy is about 57% but falls to ~46% when it must explore itself—like going from a solid C+ to a shaky D. GEMINI-3 PRO falls from roughly 60% to 57%—a smaller drop but still real.
Inefficiency: SCOUT reaches target coverage in about 9 steps; foundation models often take 14+ steps and don’t produce better maps. It’s like running more laps but not getting fitter.
Modality gap: Text-world scores are much higher than vision-world scores across the board. Seeing well (recognizing objects and orientation) is the bigger challenge than reasoning symbolically.
Human results: Humans beat models strongly; with simple tools (like a protractor/compass for angles), humans approach perfect mapping and task accuracy, showing large headroom for AI.

Surprising findings and clear diagnostics:

Orientation pain: In vision, models especially struggle with object facing, which drags down perspective-taking tasks. It’s like telling which way a chair points is oddly hard for today’s models.
Belief drift and instability: Models often perceive something correctly early on but later overwrite those facts, reducing final map correctness. Stability matters: erasing past correct answers hurts.
Externalization gap: Forcing models to explicitly output their cognitive map can slightly reduce downstream task accuracy compared to answering directly, suggesting the JSON map is a lossy snapshot of a richer internal state. Still, map correctness significantly correlates with task success, making it a strong diagnostic proxy.
Uncertainty that pays off: Models that keep track of unknown regions improve longer. GPT-5.2 ramps up information gain fast but plateaus early; GEMINI-3 PRO improves more steadily and continues to refine the map.
Scale hurts active more than passive: As rooms increase (e.g., to four rooms), active exploration costs jump and the active–passive gap widens, especially in vision, underscoring that planning informative views is the bottleneck.

Concrete comparisons:

Passive vs. active (vision): GPT-5.2 around 57% → 46%; GEMINI-3 PRO around 60% → 57%. That’s like going from a B- down to a D+ for one model, and to a C+ for the other.
Steps: Proxy $SCOUT ≈ 9$ steps vs. $models ≥ 14$ steps on average, with no accuracy gain—like taking the long way around town and still ending up confused.
Text vs. vision: In text, top models often achieve strong scores; in vision, the same models lag on orientation, perspective taking, and final map correctness.

Bottom line: When models explore by themselves, they’re slower, less thorough, and less accurate than when a program explores for them. The main culprits: perception in vision, lack of uncertainty-driven planning, and unstable belief updates that decay earlier correct facts.

05Discussion & Limitations

Limitations (be specific):

Vision perception is a major bottleneck: identifying objects and especially their facing direction is error-prone, which poisons the map early.
Exploration planning is weak: models don’t reliably pick the next most-informative view, leading to redundancy and missed areas (incomplete coverage).
Belief stability is fragile: previously correct entries are overwritten later (belief drift), lowering final correctness.
Belief inertia is common after environment changes: agents cling to obsolete coordinates or orientations, especially in vision.

Required resources:

A simulator (ThreeDWorld) with 3D assets (Objaverse) for vision scenes; or the symbolic text world.
Sufficient context length to hold exploration histories and maps.
Prompting budget for iterative probe-and-update cycles; optional tools (e.g., geometric helpers) improve human-like precision.

When not to use:

If your application has full observability (e.g., you always have a perfect floor plan), this benchmark’s partial-observability stresses may be overkill.
If your agent never needs to revise beliefs (static, unchanging layouts), the false-belief paradigm may not reflect your deployment.
If perception is out-of-scope (e.g., purely symbolic worlds), focus on the text setting; the vision setting may conflate your core goal with recognition.

Open questions:

How to teach models to choose next views by true information gain, not just coverage? Can lightweight planners or learned uncertainty estimators help?
How to stabilize belief updates so correct facts are protected while still allowing fast revisions when contradicted?
How to robustly perceive orientations from limited ego views (better prompts, fine-tuned heads, or explicit 3D priors)?
Can multi-agent exploration coordinate to share beliefs efficiently and reduce redundancy?
How to compress internal maps so externalization (JSON) matches latent quality without losing information?

06Conclusion & Future Work

Three-sentence summary: This paper introduces Theory of Space, a benchmark and framework that evaluates whether foundation models can actively explore under partial observability to construct, probe, and revise their internal spatial beliefs. By forcing explicit cognitive-map outputs, tracking uncertainty, and testing false-belief revision, it reveals a large Active–Passive Gap, major inefficiencies, vision perception bottlenecks (especially orientation), and belief instability/inertia. The result is a clear roadmap for building agents that explore efficiently, remember reliably, and update correctly as the world changes.

Main achievement: Turning spatial mapping from a hidden byproduct into a directly measurable competency—via task-agnostic exploration, explicit cognitive-map probing, and belief-revision tests—so we can truly diagnose and improve embodied spatial intelligence.

Future directions:

Add uncertainty-aware planners that choose actions for maximal information gain.
Improve vision orientation perception with better priors, multi-view fusion, or specialized heads.
Design memory mechanisms that lock in verified facts yet allow fast, principled overwrites on conflict.
Extend to multi-agent settings for shared and aligned spatial beliefs.

Why remember this: Because real-world agents must be curious cartographers, not just armchair answerers. Theory of Space brings exploration, belief, and revision to the forefront—so tomorrow’s robots, AR assistants, and autonomous systems can navigate new places with confidence, efficiency, and adaptability.

Practical Applications

•Home robots that plan informative views to quickly map a new apartment and clean efficiently.
•AR indoor navigation that explores and updates maps on the fly as stores move or furniture changes.
•Warehouse robots that reduce scanning redundancy and keep stable, revisable shelf maps.
•Search-and-rescue drones that explicitly track uncertainty and prioritize unexplored rooms.
•Education tools that teach geometry and map skills by turning exploration into ‘information gain’ games.
•Retail store re-mapping where agents detect layout changes and avoid belief inertia.
•Elder-care assistants that maintain robust, up-to-date home maps despite frequent small changes.
•Game AI that learns new levels efficiently and adapts when layouts shift between rounds.
•Facility inspection bots that separate perception errors from mapping errors using cognitive-map probes.

Version: 1