AI Gamestore: Scalable, Open-Ended Evaluation of Machine General Intelligence with Human Games

Lance Ying; Ryan Truong; Prafull Sharma; Kaiya Ivy Zhao; Nathan Cloos; Kelsey R. Allen; Thomas L. Griffiths; Katherine M. Collins; José Hernández-Orallo; Phillip Isola; Samuel J. Gershman; Joshua B. Tenenbaum

AI Gamestore: Scalable, Open-Ended Evaluation of Machine General Intelligence with Human Games

Intermediate

Lance Ying, Ryan Truong, Prafull Sharma et al.2/19/2026

arXiv

Key Summary

•The paper argues that the fairest way to check how generally smart an AI is, is to see how quickly and well it learns lots of different human-made games, just like a person with the same time and practice.
•They define the Multiverse of Human Games as the huge, open-ended space of games people create and enjoy, because games compress many real-life skills.
•They build AI GAMESTORE, a platform that automatically sources popular games, adapts them into simple web versions, refines them with human feedback, and standardizes them for testing.
•A capability rubric (like “memory,” “planning,” and “world-model learning”) labels each game so we can diagnose what skills a model is missing.
•They evaluated seven top vision-language models and 106 human players on 100 adapted games, each for two minutes of play.
•Top AI models scored under 10% of the human median on most games and were 12–18 times slower to act because they needed time to “think.”
•Models especially struggled when a game required memory, planning several moves ahead, or learning hidden rules by experimenting.
•Performance dropped further as games demanded more different skills at once, showing that integrating abilities is a big challenge for today’s AIs.
•AI GAMESTORE is designed to keep growing with new, human-like game variants so it stays hard to game and hard to overfit.
•This provides a practical, open-ended way to measure (and motivate) progress toward human-like general intelligence.

Why This Research Matters

This work gives us a practical, honest way to see whether AI is truly becoming more human-like, not just good at a few memorized tricks. Because games compress real-world skills into fun, bite-sized challenges, doing well across many different human-made games is a strong signal of flexible intelligence. A living benchmark that keeps adding new, human-grounded tasks makes it harder for models to game the test and easier for us to spot real progress. Clear skill labels (like memory and planning) help researchers fix the right weaknesses instead of just chasing leaderboard points. Policymakers and companies get a clearer picture of what today’s AI can and cannot do under fair time limits. Over time, this can steer AI toward safer, more trustworthy systems that learn as quickly and robustly as people. It’s a north star for building AI that plays—and works—well in our human world.

Reading Workflow

Turn this paper into a decision

Scan fast. Promote only the papers that survive triage.

No workflow history yet.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine trying out for a school talent show where you only have to juggle one ball. If you do great, does that prove you’re the most talented student overall? Probably not—you’d also want to see singing, dancing, drawing, storytelling, and more.

🥬 The Concept (Benchmarks—why old ones fell short):

What it is: Benchmarks are tests that measure what an AI can do.
How it works: People build a list of tasks (like math problems, coding puzzles, or single games) and score models on them.
Why it matters: If the list is small or narrow, AIs can practice those exact tasks and look “great,” while still struggling in the wide world outside the test.

🍞 Anchor: An AI that aces a multiplication worksheet might still fail to split a restaurant bill—both use numbers, but one is a narrow drill and the other is real-life flexible thinking.

The World Before You know how different school subjects train different parts of your brain? For years, AI got tested on separate “subjects”: reading tests (language benchmarks), math quizzes, coding challenges, or single famous games like Chess or Go. These were impressive and useful, but each was like judging a whole student by one unit test. Many newer mega-benchmarks tried to combine lots of little tests into one place, but the tasks were still mostly short, fixed, and predictable.
The Problem AI is now good enough that we care about general intelligence—being flexible and adaptive like a person. But building one test that covers “everything a human might do” is impossible. And when we freeze a benchmark, AI teams can quietly train on it or on very similar tasks. Then scores shoot up, but we don’t know if the AI actually got more general or just studied the answer key.

🍞 Hook: You know how a video game gets boring if you memorize the level order and never see new levels? You stop learning, you just replay.

🥬 The Concept (Benchmark saturation):

What it is: Saturation is when a fixed test becomes too easy because players (or AIs) practice the exact items.
How it works: Over time, models tune themselves to the same questions or formats.
Why it matters: Scores go up even if real-world ability hasn’t.

🍞 Anchor: If the spelling bee always uses the same word list, the winner just memorizes it—no guarantee they can spell new words.

Failed Attempts People tried building larger collections of tasks and also synthetic games created by code. These helped a lot, but some synthetic tasks drifted away from the things humans actually do for fun or challenge. And real commercial games were hard to use directly because of licensing, different engines, privacy, reaction-speed needs, and training-data contamination (we can’t confirm whether a model saw those exact games during training).

🍞 Hook: Imagine testing a chef by making them cook foods that no one would ever eat. They might pass your test, but would they be a good real-world chef?

🥬 The Concept (Data contamination):

What it is: Contamination is when test items appear in a model’s training set.
How it works: If the AI saw the answers during training, test scores don’t reflect fresh reasoning.
Why it matters: You can’t trust the test to measure real generalization.

🍞 Anchor: If you studied with a leaked exam, your A+ doesn’t prove you understood the material.

The Gap We need an evaluation that is broad (covers many skills), human-grounded (comes from activities humans enjoy), renewable (keeps growing), and standardized (so models and humans can be compared fairly). That’s where the idea of testing across the “Multiverse of Human Games” comes in: games designed by people, for people, because games compress real-life skills—planning, memory, learning rules, social thinking, and more—into small, engaging packages.

🍞 Hook: Think of games like tiny worlds that practice big-life skills—like Lego sets that teach building and creativity.

🥬 The Concept (Multiverse of Human Games):

What it is: The open-ended set of all games people invent and enjoy.
How it works: Sample many different kinds of human-made games and compare AI to humans with the same time or practice.
Why it matters: It measures flexible, human-like intelligence without getting stuck on one narrow trick.

🍞 Anchor: From Sudoku to platformers to team strategy games, if an AI keeps up with humans across the variety, that’s a strong sign of general smarts.

Real Stakes Why should anyone care? Because we need to know whether AI can learn new tasks quickly, follow hidden rules safely, and integrate multiple skills—just like people must in daily life. Better evaluation helps builders aim for the right goals (like memory and planning), helps buyers and policymakers understand real capabilities and risks, and helps everyone track honest progress toward human-like intelligence. With a living, game-based benchmark, we can keep the test fresh, fair, and fun—pushing AI to be more reliably helpful in the real world.

02Core Idea

🍞 Hook: You know how a school field day has many stations—sprints, long jump, relay race—and the “best all-around” student is the one who does well across many, not just one?

🥬 The Concept (Aha!):

What it is: Measure AI’s general intelligence by how fast and how well it learns to play many human-made games under the same time limits as humans.
How it works: Build a never-ending stream of human-like games, label what skills each game demands, and compare AI vs. human progress fairly.
Why it matters: Without this, we can’t tell if an AI is just good at one trick or truly flexible like a person.

🍞 Anchor: If an AI keeps up with typical humans across puzzles, platformers, physics challenges, and hidden-rule games, it’s probably getting genuinely smart.

Multiple Analogies (3 ways to picture it)

School decathlon: Instead of only doing spelling, a student has to do writing, math, science experiments, and public speaking. Doing well across the set shows general academic strength. AI GAMESTORE is a decathlon for thinking.
Driver’s ed obstacle course: You don’t just drive straight; you park, merge, handle rain, and avoid cones. Passing all stations under time pressure shows you can drive in the wild. Games are the cones and corners of thinking.
Cooking show mystery box: The chef must use surprise ingredients and techniques. If they can still produce a great dish quickly, they’re adaptable. Games are those surprise ingredients for AI.

Before vs. After

Before: We mostly judged AIs on narrow tests or static collections. Models sometimes looked great but didn’t generalize well. Over time, benchmarks got saturated.
After: With the AI GAMESTORE, the test keeps growing with new, human-inspired games and variants. It’s harder to “study the answer key,” easier to diagnose missing skills, and fairer to compare AI vs. humans under the same budget.

🍞 Hook: You know how learning a board game teaches more than the rules—you also learn to plan, remember, and spot patterns.

🥬 The Concept (Why it works—intuition, not equations):

What it is: Games are compact training grounds for real cognitive skills.
How it works: Each game spotlights different abilities—like memory, planning, learning hidden rules (“world models”), precise timing, and social reasoning.
Why it matters: If an AI consistently improves on such varied mini-worlds as fast as people do, it shows the kind of flexible smarts we care about in life.

🍞 Anchor: Angry Birds trains physics intuition; Sudoku trains logical planning; a stealth game nudges you to track what others see and believe. Together, they cover a broad slice of human thinking.

Building Blocks (introduced with Sandwiches)

🍞 Hook: Imagine a giant library of games—old, new, simple, wild—all created by people. 🥬 The Concept (Multiverse of Human Games):
- What it is: The open-ended set of all human-designed, human-enjoyed games.
- How it works: Sample across this space to get variety in skills and styles.
- Why it matters: Variety prevents overfitting and reveals true flexibility. 🍞 Anchor: From Go to mobile puzzlers to party deception games—if you can learn them fast, you’re probably adaptable.
🍞 Hook: Think of a game store that never runs out of new titles. 🥬 The Concept (AI GAMESTORE):
- What it is: A platform that sources, adapts, and continually grows a set of standardized human-like games for testing AI.
- How it works: It gathers popular games, auto-builds web versions, and refines them with people.
- Why it matters: It’s scalable, renewable, and hard to overfit. 🍞 Anchor: Today 100 games; tomorrow 1,000—always with clear scoring and fair rules.
🍞 Hook: You know how a coach gives tips that make your practice much better? 🥬 The Concept (Human-in-the-Loop):
- What it is: People play early versions, give natural-language feedback, and help the AI fix and polish the game.
- How it works: LLMs generate code; humans test; LLMs refine; repeat.
- Why it matters: Keeps games fun, playable, and human-like, not weird or broken. 🍞 Anchor: “The jump feels too floaty—tighten gravity.” LLM updates code; game improves.
🍞 Hook: Imagine putting skill stickers on each game box: memory, planning, physics. 🥬 The Concept (Cognitive Capability Demand Levels):
- What it is: A rubric (0–5) labeling how much each skill a game requires.
- How it works: Experts annotate games for memory, planning, world-model learning, visual processing, timing, physics, and social reasoning.
- Why it matters: Lets us diagnose exactly where models struggle. 🍞 Anchor: If a model flops on games with $Memory≥3$ , we know to improve its memory systems.
🍞 Hook: Picture a gamepad where every second the game pauses so you can pick the next moves. 🥬 The Concept (Evaluation Harness):
- What it is: A standardized way for slow-responding models to play fast games fairly.
- How it works: The game pauses each second; the model proposes five 0.2-second action chunks; then the game runs them.
- Why it matters: Levels the field so we test decision quality, not typing speed. 🍞 Anchor: It’s like turn-based mode for a real-time platformer so the AI can think.
🍞 Hook: You know how grades mean more if we compare fairly across classes? 🥬 The Concept (Human-normalized scoring):
- What it is: Scores are normalized to the median human score per game.
- How it works: Each game’s human median = 100; models score relative to that.
- Why it matters: Different games have different raw points; normalization makes them comparable. 🍞 Anchor: Scoring 20 on a hard puzzle and 2,000 on an arcade shooter become apples-to-apples once scaled to humans.

03Methodology

At a high level: Popular game descriptions → [Suitability filtering] → [LLM code generation + auto-tests] → [Human-in-the-loop refinement + variants] → [Cognitive annotation] → [Standardized AI+human evaluation] → Scores normalized to humans and analyzed by skills.

Stage 1: Sourcing and Suitability Filtering 🍞 Hook: Imagine picking field-day events that work well in your school gym—fun, short, and easy to judge. 🥬 The Concept (Suitability filtering):

What it is: Picking games that are practical to rebuild and score in minutes.
How it works: Start with popular Apple App Store and Steam titles; filter by ratings/reviews; use an LLM to check: (a) can it be played in a few minutes, (b) can it be built in p5.js/JS, (c) has a clear score, (d) doesn’t need obscure knowledge (like poker tells).
Why it matters: Ensures we can actually build and grade lots of games quickly. 🍞 Anchor: From 7,500 candidates, keep 100 that are short, scorable, and JS-friendly.

Stage 2: Game Generation and Refinement 🍞 Hook: Think of a robot chef making a dish from a recipe, then a human taste-tester giving notes. 🥬 The Concept (LLM-powered game building with human feedback):

What it is: LLMs write the game code; both bots and people test and fix it.
How it works (step by step):
1. Prompt the LLM (e.g., Claude Sonnet) with the original game’s description and a design spec (must be keyboard-only, pausable, scorable, multi-level).
2. Auto-generate basic tests (simulate key presses) to catch crashes and broken mechanics; let the LLM fix issues.
3. Human plays the game, writes natural-language feedback (“Jump is too slow; make level 2 harder.”); LLM refines code.
4. Repeat until it’s fun, stable, and faithful to the core idea.
Why it matters: Keeps games human-relevant and smooth, not just technically running. 🍞 Anchor: “Match-3 clone” becomes a clean, keyboard-only puzzle with 3 escalating levels and clear scoring.

Variants Generation 🍞 Hook: You know how teachers create harder versions of a worksheet to challenge you? 🥬 The Concept (Novel variants):

What it is: Humans suggest new twists; LLM implements them.
How it works: A player proposes mechanics (“Cat now chases you but only sees in a cone”); LLM updates code; we get a fresh variant with different skill demands.
Why it matters: Explodes variety, reduces overfitting, and allows targeted skill stress-tests. 🍞 Anchor: The same maze game spawns a stealth variant requiring more planning and social reasoning.

Stage 3: Game Annotation and Profiling 🍞 Hook: Picture a nutrition label for thinking skills. 🥬 The Concept (Cognitive Capability Demand Levels):

What it is: A 0–5 label for each skill the game uses: Visual Processing (VP), Spatial-Temporal Coordination (ST), Memory (ME), Planning (PL), World Model Learning (WM), Physical Reasoning (PH), Social Reasoning (SO).
How it works: Three experts label each game, then discuss to agree.
Why it matters: Lets us map model failures to specific skills instead of just “low score.” 🍞 Anchor: A physics puzzler might be VP=2, ST=2, ME=1, PL=3, WM=1, PH=4, SO=0.

Stage 4: Model and Human Evaluation 🍞 Hook: Think of a fair race where everyone runs the same distance with the same stopwatch. 🥬 The Concept (Standardized play + harness):

What it is: Humans and AIs both play 120 seconds. For AIs, a harness pauses the game every second so the model can pick 5 short action chunks (each 0.2s).
How it works:
1. Prompt includes game instructions, recent screenshots, and a “scratchpad” (the model’s own notes from past steps).
2. Model returns: next 5 action segments, updated scratchpad, and brief reasoning.
3. Apply actions, advance the game, repeat.
Why it matters: Today’s models have latency; pausing avoids penalizing them for typing speed and tests decision quality. 🍞 Anchor: A platformer runs in slow-mo for the model; a human just plays normally for the same 120 seconds.

Scoring and Normalization 🍞 Hook: If one game’s max score is 2,000 and another’s is 20, how do we compare fairly? 🥬 The Concept (Human-normalized scoring and geometric mean):

What it is: Normalize each score by the median human score per game, then aggregate with a geometric mean across games.
How it works:
- Normalized score formula: $\text{Normalized} = \text{clip}\left(100 \times \frac{\text{Raw}}{\text{Human Median}},\ 1,\ 10000\right)$ . Example: If Raw=50 and Human Median=200, then $100\times50/200=25$ , clipped stays 25.
- Geometric mean across games: $G=\left(\prod_{i=1}^{n} x_i\right)^{1/n}$ . Example: For scores $x_1=10$ , $x_2=40$ , $x_3=100$ , $G=(10\times40\times100)^{1/3}=(40000)^{1/3}\approx 34.2$ .
Why it matters: Normalization makes different games comparable; geometric mean reduces the impact of a few giant wins or zeros and reflects steady across-the-board ability. 🍞 Anchor: A model that’s 20 on a hard puzzle and 200 on an easy shooter might average near a typical “across-games” skill of about 63 (geometric), not overstate one lucky blowout.

The Secret Sauce

Human-grounded: Start from popular human games so tasks stay meaningful and enjoyable.
Renewable: Keep generating/refining/variant-ing so it resists overfitting.
Diagnosable: Skill labels turn raw scores into insight about memory, planning, world models, etc.
Fair to today’s models: The harness tests the thinking, not the API latency.

Concrete Mini Example

Source: A top-chart mobile runner.
Adapt: Keyboard-only (LEFT/RIGHT/JUMP), 3 levels, increasing obstacle density, score = coins collected.
Auto-test: Simulated key presses verify jumping works and coins increment score.
Human refine: “Level 2 too easy; add moving obstacles and coin clusters.”
Annotate: VP=2, ST=3, ME=1, PL=1, WM=1, PH=0, SO=0.
Evaluate: Humans hit median 150 coins; Model raw=18. Normalized = $100\times 18/150=12$ . Aggregate across other games by geometric mean.

04Experiments & Results

The Test: What and Why

Who: 106 human players (2 minutes per game) and seven frontier vision-language models (VLMs): GPT‑5.2, GPT‑5‑MINI, Gemini‑2.5‑PRO, Gemini‑2.5‑FLASH, Claude‑Opus‑4.5, Qwen‑3‑VL‑32B, Llama‑4‑Maverick.
What: 100 AI GAMESTORE games adapted from popular titles; each has clear scoring and capability labels.
Why: Compare how fast and how well models learn and act under the same 120-second budget as humans, then diagnose skill gaps (memory, planning, world-model learning, etc.).

The Competition: Baselines and Setup

Humans: Play directly in real time.
Models: Use the pause-per-second harness; each call sees images, history, and a “scratchpad” to maintain memory.
Budget: 120 seconds per game for both; models effectively make 120 calls (each proposes 5 short action chunks).
Scoring: Normalize to human median = 100 per game; aggregate using geometric mean to avoid domination by outliers.

The Scoreboard: Contextualized Results

Overall: Top models scored under 10% of the human median on average. Think of it as getting a 9 out of 100 where typical humans average 100—that’s like a D– when the class median is a solid B.
Speed: Humans finished in 120 seconds (by definition). Models, due to latency and “thinking,” often took 12–18 times longer wall-clock time—like taking a 2-minute quiz in 24–36 minutes.
Distribution: Model scores were bimodal. On about two-thirds of games, models made some progress (often 10–30% of human). On 30–40% of games, models barely moved the needle at all (near-zero relative to human median).

Skill-by-Skill Findings 🍞 Hook: You know how some games feel easy until you hit a level that requires remembering a code or planning three moves ahead? 🥬 The Concept (World model learning, memory, planning):

What it is: Three skills where models struggled the most.
How it works:
- Memory: Holding and using info from earlier steps.
- Planning: Simulating several actions and outcomes in your head before you move.
- World-model learning: Figuring out hidden rules by experimenting.
Why it matters: These are building blocks of human-like reasoning in open, messy worlds. 🍞 Anchor: A model might handle simple matching, but freeze when it must remember a key location (memory), plan a sequence of moves (planning), or discover that pushing a weird switch flips gravity (world-model learning).

Integration Hurts (and That’s Important)

As the number of distinct skills a game demands goes up, model performance drops further. In other words, many models can do one thing okay in isolation, but mixing skills (like timing + memory + planning) is much harder—exactly the kind of integration humans routinely do.

Not Just About Reflexes

You might guess models are slow because games need lightning-fast reactions. But when the authors filtered to games with low timing demand (slow, puzzle/turn-based feel), the models still didn’t catch up much. This points to reasoning, memory, and learning gaps—not just reaction speed.

Trajectories Over Time

On public sample games, humans steadily improve their cumulative score across the 120 seconds. Models often show a quick bump (they find an easy pattern), then stall, or make no progress on some titles. Closing that “keep-improving” gap is key for human-like learning.

Surprising/Notable Observations

Models sometimes outperformed median humans on a few very simple, fast-grind games where a repetitive, quick strategy racks up points—suggesting they can latch onto obvious short-term tactics when the environment is simple.
But when rules are ambiguous, levels require multi-step lookahead, or remembering earlier screens matters, models hit a wall—even with an explicit scratchpad.

Bottom Line in Plain Terms

Today’s best VLMs act like smart beginners who can pick up easy patterns but struggle to combine skills, remember over time, and infer hidden rules—especially under a tight learning budget where typical humans still shine.

05Discussion & Limitations

Limitations

Scope: The first release focuses on 100 relatively short, casual-style games. That’s a tiny sample of the Multiverse of Human Games, and it under-represents long-horizon narratives, deep social play, or complex multi-agent strategy.
NPC and Social Depth: Many games lack sophisticated opponents or teammates. Without agents that adapt or mentalize, social reasoning demands stay low.
Timing and Real-Time Play: The harness pauses games to accommodate model latency; that’s fair for thinking tests but not full “human-speed” interaction. True real-time skill isn’t fully measured yet.
Generation Challenges: LLMs can write playable games, but level design is tricky. Human feedback helps, yet procedural level generation that’s reliably fun and fairly graded remains an R&D task.
Contamination Risk (mitigated, not erased): By adapting games and generating variants, contamination risk is reduced—but verifying model training data remains hard industry-wide.

Required Resources

People: Human annotators and crowd players for refinement and for building human baselines.
Compute: LLM calls for generation/refinement and for model evaluations (lots of API calls per game).
Infrastructure: A web portal for standardized play, logging scores and videos, and secure submission to private test sets.

When NOT to Use

Ultra-twitch esports or pro racing sims where sub-100 ms reactions are the main skill; the harness design favors thought-over-speed.
Tasks that hinge on deep, specialized world knowledge (e.g., poker metagame), which the filtering intentionally avoids for fairness and accessibility.
Safety certification for high-stakes deployment on its own; GAMESTORE is a capability probe and should be complemented with domain-specific safety tests.

Open Questions

Better Memory and Planning: What architectures or training curricula help models keep and use long, structured memories and plan deeply under time pressure?
Richer Social Worlds: How do we add adaptive NPCs and multi-agent play that require true theory-of-mind, yet keep tests standardized and fair?
Procedural Level Design: How can we automatically generate interesting, solvable levels with controllable difficulty and skill targeting?
Measurement Science: Can we adapt measurement layouts or ADeLe-style methods to estimate latent skills precisely from interactive behavior across many noisy tasks?
Real-Time, Human-Like Agents: What’s needed to integrate perception, memory, planning, and action so models can play smoothly without pauses—like people do?

06Conclusion & Future Work

Three-Sentence Summary

This paper proposes measuring AI’s general intelligence by how quickly and well it learns to play many human-designed games, compared fairly to humans under the same time budget.
It introduces AI GAMESTORE, a living platform that sources popular games, adapts them into standardized web versions, refines them with human feedback, labels their skill demands, and evaluates both humans and AI with human-normalized scores.
Early results on 100 games show top models scoring under 10% of the human median and struggling especially with memory, planning, and learning hidden rules, pointing the way to what must improve.

Main Achievement

Turning the “test AI on human games” vision into a practical, scalable, open-ended meta-benchmark that both measures progress and diagnoses missing cognitive abilities.

Future Directions

Expand into richer, longer, and more social games with adaptive NPCs and multi-agent play.
Build better procedural level generation with controllable difficulty and skill targeting.
Advance skill measurement methods to estimate latent abilities from interactive behavior, and push toward real-time, human-speed integrated agents.

Why Remember This

Games are tiny worlds that train big skills. A benchmark that never stops growing—grounded in the games people actually enjoy—can keep AI honest, motivate the right research, and bring us closer to safe, flexible, human-like intelligence.

Practical Applications

•Track AI progress: Use the human-normalized leaderboard to see if new models actually learn faster and generalize better.
•Diagnose weaknesses: Break down results by memory, planning, and world-model learning to guide research priorities.
•Curriculum design: Build targeted game variants that train and test specific skills (e.g., long-horizon planning).
•Model selection: Choose the right AI for an application by matching its strengths to the game-skill profile.
•Safety pre-checks: Stress-test agents on hidden-rule and ambiguity-heavy games before deploying them to messy real-world tasks.
•Skill transfer studies: Measure whether improvements on puzzle variants carry over to navigation or physics tasks.
•Agent architectures: Prototype scratchpad, memory modules, or planners using the harness and compare versions fairly.
•Education and training: Create game bundles that teach and assess human students’ reasoning with instant feedback.
•Accessible benchmarking: Standardized, browser-based games lower barriers for labs, classrooms, and hobbyists to evaluate models.
•Continual eval pipelines: Automatically add new human-inspired games to keep internal QA tests fresh and robust against overfitting.

Version: 1