EcoGym: Evaluating LLMs for Long-Horizon Plan-and-Execute in Interactive Economies

Xavier Hu; Jinxiang Xia; Shengze Xu; Kangqi Song; Yishuo Yuan; Guibin Zhang; JinCheng Ren; Boyu Feng; Li Lu; Tieyong Zeng; Jiaheng Liu; Minghao Liu; He Zhu; Yuchen Eleanor Jiang; Wei Wang; Wangchunshu Zhou

EcoGym: Evaluating LLMs for Long-Horizon Plan-and-Execute in Interactive Economies

Intermediate

Xavier Hu, Jinxiang Xia, Shengze Xu et al.2/10/2026

arXiv

Key Summary

•EcoGym is a new open test playground where AI agents run small businesses over many days to see if they can plan well for the long term.
•It has three worlds: Vending (store manager), Freelance (gig worker), and Operation (social app operator).
•Agents get only a few simple actions each day but must survive and grow over hundreds to thousands of steps.
•Scores are real-world style: Net Worth (Vending), Income (Freelance), and DAU (daily users in Operation).
•No single AI model won in all worlds: Gemini-3-Pro led Vending, GPT-5-Mini led Freelance, and Claude-Sonnet-4.5 led Operation.
•EcoGym hides some rules (like seasons or burnout) so agents must explore, learn, and adjust, not just follow instructions.
•Long context windows and extra memory sometimes help—but not always—and different models like different memory styles.
•Letting models “think with actions” (keeping a running reasoning log) boosted stability and performance.
•Vending showed high randomness across runs, while Freelance and Operation were steadier.
•Top models matched or beat human experts in Operation DAU, showing super-human long-horizon potential in that setting.

Why This Research Matters

EcoGym pushes AI beyond quick tricks and into the messy, day-to-day reality of running things over time. It measures success in everyday terms—money, well-being, and active users—so results relate to real decisions. By hiding some rules and adding randomness, it rewards agents that explore, learn, and adapt like smart teammates. It also shows where today’s models stumble—strategy vs execution—and which tools (memory, thinking logs) actually help. This can guide better products, safer automation, and fairer comparisons across models. If we want AI to help manage stores, gigs, or platforms tomorrow, we need tests like EcoGym today.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine running a lemonade stand all summer. You can’t just think about today’s cups—you must plan for heat waves, lemons running out, and saving money to buy more supplies tomorrow.

🥬 The Concept: Long-horizon planning means making smart choices today that still look smart many days (or months) later. How it works:

You set a big goal (like growing savings).
You pick small daily actions (buy, price, rest, promote).
You watch what happens.
You adjust the plan and keep going—again and again. Why it matters: Without it, an agent might do great on one day and then crash later, like selling out today but having no money to restock tomorrow. 🍞 Anchor: A smart lemonade seller might sell a little cheaper on hot days to bring in many customers, then use the extra money to buy more lemons for the week.

Before this paper, many AI tests were short and simple—like a single homework problem, a one-time webpage click, or a quick puzzle. These are good for checking basic reasoning, but they don’t show if an AI can stay steady and smart over long stretches, where yesterday’s choice shapes tomorrow’s options. Some newer tests moved closer to real work (like coding tasks or research sprints), but they were still time-limited or focused on one narrow game-like world.

🍞 Hook: You know how in a busy marketplace, what one shop does can change what others do—prices, customers, and supplies all affect each other?

🥬 The Concept: Interactive economies are worlds where many parts (buyers, sellers, creators, users) bump into each other, and each action changes the next day’s starting point. How it works:

The agent acts (sets prices, picks jobs, runs promos).
The market reacts (demand goes up/down, tasks appear/expire, users join/leave).
The agent sees results and acts again. Why it matters: Without interaction, tests feel like video replays—safe, predictable, and too easy. 🍞 Anchor: If a snack shop raises prices too high, fewer people buy tomorrow; if it runs out of stock, tomorrow’s customers might shop elsewhere.

The problem researchers faced: existing benchmarks often (1) end quickly, (2) stick to one tiny domain, or (3) score success in points that don’t look like real-world outcomes. That makes it hard to tell if an AI can run a business over time, where cash, energy, and customers all must be managed together.

🍞 Hook: Think of a report card that works for every school subject, not just math.

🥬 The Concept: A generalizable benchmark is a fair, reusable test that works across different worlds, so we can compare agents apples-to-apples. How it works:

Use the same rules for actions and feedback.
Keep the action set small but meaningful.
Measure real outcomes (money, users) over many days. Why it matters: Without this, you can’t tell whether a model is truly good or just overfitted to a single game. 🍞 Anchor: If two students both take the same style of exam in science, history, and math, you can trust their report cards more.

What was missing: a single, open, long-running testbed where agents must plan, execute, and adapt in economic-style worlds—using just a few tools per day—while chasing business outcomes. EcoGym fills that gap with three linked but different environments (Vending, Freelance, Operation), shared interfaces, and business-grounded goals.

🍞 Hook: Playing a mystery game is harder if some clues are hidden behind a curtain.

🥬 The Concept: Partial observability means the agent can’t see all the rules or the full state of the world. How it works:

The agent sees reports and hints (sales, stress, DAU).
Some mechanics stay hidden (seasonality, burnout limits).
The agent must infer patterns from feedback. Why it matters: Without hidden parts, the test becomes a memorization quiz, not real decision-making. 🍞 Anchor: A store owner never knows the exact number of customers tomorrow, but learns patterns like “weekends sell more snacks.”

🍞 Hook: Rolling dice means you can’t predict every outcome.

🥬 The Concept: Stochasticity is built-in randomness that makes the world a little noisy. How it works:

Two good choices can still lead to slightly different sales.
Models must aim for strategies that stay good on average.
We repeat runs or average results to judge fairly. Why it matters: Without randomness, agents might ‘overfit’ to a scripted world and fall apart in reality. 🍞 Anchor: Even with the same lemonade price, a sudden rain shower can change today’s sales.

The real stakes: In daily life, long-term planning is everywhere—saving allowance, balancing homework and play, or keeping a club active. For businesses, it’s even louder: run out of cash, burn out your workers, or annoy users, and you’re done. EcoGym’s outcomes—Net Worth, Income, DAU—map to things people and companies truly care about. That’s why this work matters: it pushes AI toward being reliable teammates over weeks and months, not just clever sprinters for a minute.

02Core Idea

🍞 Hook: Imagine three theme parks that share the same ticket gate and rules, but inside each park, the rides behave differently and keep changing.

🥬 The Concept (Aha! in one sentence): EcoGym is an open, long-running, three-world benchmark with simple daily actions but hidden, shifting economics, built to test if AI agents can plan, execute, and adapt for the long haul. How it works (like a recipe):

Keep the action menu tiny (4–5 tools) to focus on planning, not button-mashing.
Let time run long (hundreds to 1000+ steps) so yesterday’s choices shape tomorrow.
Hide key mechanics (seasons, burnout, quality/engagement trade-offs) so agents must explore and learn.
Score outcomes in business terms (Net Worth, Income, DAU).
Use the same interface across three different economic worlds (Vending, Freelance, Operation) to test generality. Why it matters: Without a shared, open, long-horizon ground, we can’t honestly compare agents or teach them to be steady, not just speedy. 🍞 Anchor: It’s like giving all racers the same simple bicycle and a very long track through city, forest, and desert to see who can pace themselves and finish strong.

Multiple analogies:

Garden analogy: Everyone gets the same simple tools (shovel, water, seeds), but must grow plants across spring, summer, and fall; weather is partly hidden.
Chess season analogy: Same rules, many matches in a season; success is about consistent strategy, not a single flashy move.
Fitness plan analogy: Few exercises repeated daily; progress depends on balancing effort and recovery over months, not one gym session.

Before vs After:

Before: Agents aced short puzzles or one-off games but stumbled when asked to run a business over weeks.
After: With EcoGym, we can see who adjusts prices with seasons, rests before burnout, and trades off quality vs engagement to keep users.

🍞 Hook: Think of a coach keeping the whole team on the same playbook for the entire season.

🥬 The Concept: Strategic coherence is when many small moves line up over time with the big goal. How it works:

Set a long-term aim (profit, health, active users).
Choose daily actions that don’t contradict each other.
Update plans when the world pushes back. Why it matters: Without coherence, agents ping-pong—overprice then understock; overwork then crash. 🍞 Anchor: A good student spreads study time across the week instead of cramming, so the plan actually works.

🍞 Hook: A good test should fit many shoes, not just one pair.

🥬 The Concept: Generalizable benchmark means one test style that fits multiple worlds and models fairly. How it works:

Same interface (tools, reports) everywhere.
Different environments to prevent overfitting to one trick.
Open-source for transparency and community growth. Why it matters: Without generality, we keep guessing whether success ‘transfers’ to new tasks. 🍞 Anchor: It’s like one driver’s license test that covers city, highway, and rain, so you know a driver is truly road-ready.

Why it works (intuition):

Small action sets force real planning—no hiding behind complicated moves.
Long timelines surface slow mistakes (like quiet cash leaks or fatigue buildup).
Hidden mechanics require exploration and learning, not memorization.
Business-grounded scores make success meaningful and comparable.

Building blocks:

Three principles: (1) Simple actions, infinite horizon; (2) Economic outcomes; (3) Latent (hidden) mechanics to encourage discovery.
Three worlds: Vending (stock and price), Freelance (work and wellness), Operation (user growth and content).
One protocol: standardized tool calls, daily budgets, and survival/goal rules.

03Methodology

At a high level: Observation → Pick one action (tool) → Environment updates hidden state → Get feedback report → Repeat across many days → Final business score.

Shared decision loop (per day):

See today’s report (cash/inventory; jobs/energy/stress; DAU/quality/activity).
Choose ONE best action (strict rule) from 4–5 tools.
Environment runs hidden mechanics (demand, burnout, decay).
Log feedback, possibly update memory, and continue until you call task_done for the next day. Why this structure? Limiting to one action per turn spotlights prioritization. Hidden transitions reward exploration. Long runs test stability.

Examples of actual data:

You set chips price to $2.50 → sales fall vs$ 2.20 → Net Worth rises slower.
You pick a medium coding task while tired → stress spikes → future failures become likelier.
You boost engagement today → DAU rises tomorrow but content quality erodes unless moderated.

Environment I: Vending (maximize Net Worth)

State you care about: cash, inventory by SKU, wholesale costs, your prices, pending orders, and hidden market knobs (seasonality, price sensitivity).
Actions: • products_research: discover items and costs. • order_place: buy stock that arrives later. • price_set: change retail price to steer demand. • price_query: check current price. • task_done: end the day.
Hidden transitions (why they matter): • Demand uses a season curve and price elasticity—too expensive, fewer sales; too cheap, you might sell out but profit less per item. • Logistics delay: orders arrive on delivery day; if you forget to reorder, shelves go empty.
What breaks without each step: • No products_research → you miss better-margin items. • No order_place → you run out of stock and stall. • No price_set → you can’t adapt to seasons and lose money.
Mini example: Raise soda price from $1.50 to$ 2.50. If season peaks, sales might still be okay; if not, sales drop, and Net Worth lags.

Environment II: Freelance (maximize Income while surviving)

State you care about: money, energy, stress, skills, job pool, and a hidden burnout threshold.
Actions: • tasks_browse / task_inspect: see jobs, pay, and difficulty. • tasks_discover (free/paid): refresh job list. • solution_submit: do work and get judged/paid by an auditor model. • energy_restore (low/medium/high): recover at a cost, reduce stress. • task_done: end the day.
Hidden transitions: • Success improves skills; failure raises stress. • Energy drains faster on hard tasks if your skill is low. • Unclaimed jobs can expire.
What breaks without each step: • No inspect → you pick jobs blind and overstrain. • No restore → you earn fast then burn out and fail. • No discover → you miss high-value jobs as the pool shifts.
Mini example: You grab a ‘medium’ writing task while your energy is low; you fail, lose time, stress jumps; later, even easy tasks feel costly.

Environment III: Operation (maximize average DAU)

State you care about: DAU (users), content volume, content quality, creator activity, engagement, plus hidden system coefficients.
Actions (one per turn): • acquisition_boost: add users now (spend budget). • engagement_tune: raise stickiness but can erode quality. • creator_incentive: motivate creators to make more content. • moderation_tighten: clean up/raise quality but slow creators. • task_done: end the day.
Hidden transitions: • Zero-attractor: without interventions, DAU drifts down. • Quality-quantity tension: more engagement may hurt quality; more moderation may hurt supply.
What breaks without each step: • No acquisition → no fresh users; DAU decays. • No moderation → quality tanks; retention drops. • No incentives → creators churn; content dries up.
Mini example: You push engagement for 3 days; DAU rises. But quality slips, so you tighten moderation next day to reset quality before it’s too late.

Secret sauce (what’s clever):

Simple tools, infinite horizon: makes planning—not tool juggling—the hard part.
Latent mechanics: agents must form and test hypotheses (“Maybe chips sell better on weekends”).
Unified interface: same way to act and observe in all three worlds makes comparisons fair.
Budgeted actions: forces prioritization each day.
Diagnostic add-ons: memory modules and ‘thinking mode’ to probe weaknesses and boosts.

🍞 Hook: Choosing how much control to use is like steering a bike—you can wobble if you overcorrect or fall if you don’t steer enough.

🥬 The Concept: Controllability–utility trade-offs balance how tightly we steer (constraints, tools, memory, context) versus how much useful performance we get. How it works:

Add memory or long context → more control of past info.
But too much context can confuse some models.
Structured tools help, yet strict one-action rules can slow recovery. Why it matters: Finding the sweet spot gives stable, high utility without overload. 🍞 Anchor: A note card helps you study, but carrying 100 note cards might slow you down.

04Experiments & Results

The test: Measure if agents can grow business-style scores over long runs while surviving. We look at:

Vending: Net Worth (cash + inventory value + pending orders).
Freelance: Income (money earned minus costs) with survival (no burnout).
Operation: Average DAU, with collapse if DAU falls too low. Why these? They mirror real goals: wealth, sustainable work, and active users.

The competition: 11 leading models (both proprietary and open) ran the same rules, tools, and budgets for up to 365 days (1 year). Context windows were standardized (128 recent steps) unless varied in a specific test.

Scoreboard highlights (with context):

Vending (Net Worth): Gemini-3-Pro led with about 11,275, far ahead of many others—like earning an A+ when several got C to B. Gemini-3-Flash also did well; Claude and Grok were much lower.
Freelance (Income): GPT-5-Mini topped with about 2,991, beating larger models like GPT-5.2—an ‘inverse scaling’ surprise like the younger sibling winning the marathon.
Operation (DAU): Claude-Sonnet-4.5 led with about 1,572 average DAU, ahead of Gemini and others—solid, steady leadership. Key takeaway: No single model dominated all three. Each had strengths and blind spots.

Stochastic stability: We ran multiple trials. Vending showed high variance (some runs rocket, others sputter), while Freelance and Operation were steadier. So Vending results were averaged over five runs; the others used single runs.

Context window length: Extending context from 32 up to 1024 tokens did not guarantee better results. For example, Gemini-3-Pro peaked at 128 but got worse with longer windows, while Gemini-3-Flash dipped then bounced back at 1024. Lesson: more memory isn’t always more mastery.

Behavior over time (action rhythm):

Vending: A ‘cold start’ of research and pricing settled into a steady reorder cycle—like finding a weekly groove.
Freelance: A healthy pattern emerged: inspect → submit → restore energy → repeat, maintaining income without burning out.
Operation: Focus shifted from user acquisition to moderation to creator incentives—state-aware strategies rather than blind repetition.

Failure modes (two main ones):

Strategy priority: Some models chased the wrong lever. In Operation, top models learned that scale (more content/users) could beat perfect quality for DAU, especially early on.
Execution efficiency: Others wasted actions (repeating queries, forgetting to place orders), which hurt momentum and outcomes.

Memory modules: We tested working, symbolic, episodic, and a commercial Mem0. Results:

Memory usually helped, but not always; the best type depended on the model and task.
One model liked working memory across worlds; another did better with symbolic or episodic depending on the environment.
Message: ‘One-size-fits-all memory’ is not here yet.

Thinking with actions (persistent reasoning): Turning ‘thinking mode’ ON (keeping a running plan/notes in context) boosted DAU for both Gemini-3-Flash and Gemini-3-Pro—closing the gap between them and improving stability.

Environment complexity: In Vending, increasing SKU count raised potential profits but also cognitive load. Gemini-3-Flash scaled up profit with complexity; Gemini-3-Pro stagnated—showing that some models struggle to exploit larger state spaces.

Human comparison: Human experts averaged about 1,404 DAU in Operation (around 45 minutes per episode). Top models—Claude-Sonnet-4.5, DeepSeek-V3.2, Gemini-3-Flash, Gemini-3-Pro—met or exceeded this, showing super-human performance in this particular long-horizon scenario.

05Discussion & Limitations

Limitations (specific):

Reality gap: Even with hidden rules and noise, EcoGym is still a simulation; real markets have richer competitors, regulations, and unexpected shocks.
Single-action-per-turn constraint: Great for testing prioritization, but real operators often parallelize tasks.
Scoring scope: Net Worth, Income, and DAU are meaningful, yet real businesses juggle more KPIs (churn, NPS, unit economics).
Memory dependence: Gains from memory vary by model and task, so results can hinge on careful agent engineering.

Required resources:

Long runs: Hundreds to thousands of steps per episode can be computationally and financially expensive.
Logging and evaluation: To study behavior, you need detailed traces and replay tools.
Optional modules: External memory stores or longer context windows raise system complexity and cost.

When NOT to use EcoGym:

If you only need to test short, atomic skills (e.g., single math problems or isolated coding snippets).
If your application has fixed, fully known rules and no randomness—simpler benchmarks will do.
If you cannot afford long-run compute or careful analysis of noisy outcomes.

Open questions:

How to design memory that generalizes across tasks without hand-tuning per model?
Can we learn robust ‘meta-strategies’ that adapt automatically to hidden mechanics?
What is the best way to combine short-term precision (execution) with long-term wisdom (strategy)?
How big should context be before it starts to hurt more than help, and how can we mitigate that?
How to add multi-agent interactions (competitors, teammates) without losing clarity in evaluation?

06Conclusion & Future Work

Three-sentence summary: EcoGym is an open, long-horizon benchmark with simple daily actions and hidden economic rules, built to test whether AI agents can plan, execute, and adapt over time. Across three worlds—Vending, Freelance, Operation—no single model dominated, revealing a gap between short-term smarts and long-term stability. Diagnostics showed that memory and ‘thinking with actions’ can help, but benefits vary by model and task.

Main achievement: Turning long-horizon, economically grounded performance—Net Worth, Income, DAU—into a transparent, multi-world, unified test so the community can honestly measure (and improve) durable agent competence.

Future directions:

Smarter, model-agnostic memory and planning scaffolds.
Richer multi-agent economies and competitor dynamics.
Better measures beyond single metrics (profit quality, retention cohorts, health of creator ecosystems).
Automated curriculum that ramps complexity as agents improve.

Why remember this: EcoGym moves evaluation from quick puzzles to living systems, making agents prove they can keep their heads, budgets, and users over the long haul—exactly what real-world AI teammates must do.

Practical Applications

•Prototype AI store managers that set prices, restock, and learn seasonal patterns safely in simulation.
•Build freelancer copilots that choose tasks wisely, schedule rest, and avoid burnout while maximizing pay.
•Design growth agents for apps that balance acquisition, engagement, moderation, and creator incentives.
•Evaluate which memory systems (working, symbolic, episodic) best stabilize your agent before deployment.
•Use ‘thinking with actions’ to improve long-run consistency in planning-heavy automations.
•Stress-test agents under randomness to ensure strategies hold up across different runs.
•Benchmark multiple models on the same open, long-horizon tasks to pick the right one for your use case.
•Tune action budgets and daily routines to teach agents prioritization and reduce wasted steps.
•Study failure modes (strategy vs execution) to choose the right scaffold—planner vs executor focus.
•Iterate on open-source environments to add competitors, budgets, or new KPIs for your domain.

Version: 1