AutoWebWorld: Synthesizing Infinite Verifiable Web Environments via Finite State Machines

Yifan Wu; Yiran Peng; Yiyu Chen; Jianhao Ruan; Zijie Zhuang; Cheng Yang; Jiayi Zhang; Man Chen; Yenchi Tseng; Zhaoyang Yu; Liang Chen; Yuyao Zhai; Bang Liu; Chenglin Wu; Yuyu Luo

AutoWebWorld: Synthesizing Infinite Verifiable Web Environments via Finite State Machines

Intermediate

Yifan Wu, Yiran Peng, Yiyu Chen et al.2/15/2026

arXiv

Key Summary

•AutoWebWorld builds pretend websites with clear rules so AI can practice safely and be checked automatically.
•It uses Finite State Machines (FSMs) to list every state, action, and rule, so nothing is hidden or guessy.
•A coding agent turns the FSM into a working website, and a search algorithm (BFS) finds the shortest correct steps to finish tasks.
•Each found plan is replayed on the website; any plan that doesn’t fully work is thrown away, so the final data is clean and reliable.
•The system made 11,663 verified practice journeys across 29 websites for only $0.04 per journey.
•A 7B AI model trained on just ~16K steps of this synthetic data reached 27.42% success on WebVoyager within 15 steps, beating strong baselines.
•As they added more synthetic data, the AI kept getting better on real benchmarks (a clear scaling law).
•Because the rules are inside the environment, success is verifiable without paying humans or LLMs to judge each step.
•Trajectories are longer on average (about 22 steps), which helps AIs learn planning and memory.
•The same worlds double as stable, reusable benchmarks with built‑in success checks.

Why This Research Matters

Better web agents can help with everyday tasks like booking trips, filing support tickets, or enrolling in courses—without constant human supervision. AutoWebWorld makes this training cheaper and more reliable by baking the rules and the grader into the environment itself. With clean labels and longer practice sequences, agents learn planning and follow real instructions better. Because the worlds are controllable, teams can tune difficulty, add new skills, and reuse them as fair, stable tests over time. The strong scaling results suggest we can keep improving real‑world performance simply by generating more verified practice. This is a practical path to trustworthy, helpful AI assistants for the web.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Top Bread (Hook) You know how board games come with rule books that tell you exactly what moves are allowed and when you win? Imagine trying to play a game without seeing the rules—just the pictures on the box. You’d keep guessing what’s right.

🥬 Filling (The Actual Concept)

What it is: Web GUI agents are AIs that click, type, and navigate websites to do tasks (like booking flights or finding courses).
How it works (before this paper): The AI acts on a real website, gets a screenshot back, and guesses if the click was correct based only on what it can see.
Why it matters: Real websites hide their true internal state (like what’s really in the cart), so people or other AIs must judge whether each step was correct. That is slow, inconsistent, and expensive.

🍞 Bottom Bread (Anchor) Think of clicking “Add to cart.” The page might look updated, but did the item, size, and discount really get recorded? With only a screenshot, you can’t be sure—so you need a judge.

— The World Before —

Online exploration: Agents roamed live websites, produced step sequences, and then external judges (humans or LLMs) tried to tell if those steps were correct.
Human demonstrations: People performed tasks while recording actions for the AI to learn from.
Video-to-actions: Systems converted screen recordings into action sequences. All three depended on outside verifiers because the true website state was hidden. This created a verifier bottleneck: pricey, slow, and sometimes contradictory judgments.

🍞 Top Bread (Hook) Imagine a science fair where every experiment has a checklist and a built‑in sensor that turns green only when everything is done right.

🥬 Filling (The Actual Concept: Verification Mechanism)

What it is: A verification mechanism checks if steps and goals are correct based on the rules, not guesses.
How it works: The environment itself knows its states and rules. If an action is allowed, it changes the state in a specific way; if not, nothing changes. Reaching a goal state means success—no debate.
Why it matters: This removes the need for external judges and makes data collection cheap, fast, and reliable.

🍞 Bottom Bread (Anchor) Like a quiz that grades itself immediately because the answer key is built in.

— The Problem —

Hidden state = guessing from screenshots.
Judges disagree and cost money.
Data at scale is hard: more tasks mean more judging.

— Failed Attempts —

More exploration on real sites? Still needs judges.
More human demos? Too expensive to scale.
Smarter LLM judges? Still inconsistent and slow.

🍞 Top Bread (Hook) Imagine if every website worked like a Lego set with clear pieces and instructions you can see and count.

🥬 Filling (The Actual Concept: Finite State Machines — FSMs)

What it is: An FSM is a map of all allowed states (like pages + settings), all actions, and exactly what each action does.
How it works: Each action has preconditions (when it’s legal) and effects (how it changes the state). Navigation between pages resets or carries the right fields deterministically.
Why it matters: When rules are explicit, you can search for valid plans and verify success programmatically.

🍞 Bottom Bread (Anchor) Like a traffic light system: from red, you can go to green only through yellow, with fixed timing and no surprises.

— The Gap — We needed a way to train web agents with built-in verification, not external judging. The missing piece: generate controllable websites whose internal state and rules are fully known.

— Real Stakes —

Cheaper training: $0.04 per verified trajectory instead of$ 0.15–$1.00.
More reliable training: No guessing, fewer errors.
Better agents for everyday tools: shopping, education, travel, productivity—done by AIs that actually follow rules.

🍞 Top Bread (Hook) You know how you explore a maze by checking every nearby path before going deeper, so you don’t miss a shorter way out?

🥬 Filling (The Actual Concept: Breadth‑First Search — BFS)

What it is: A search that explores step by step in layers to find the shortest valid path to a goal.
How it works: Start at the initial state, expand all allowed actions, then the next layer, and so on. Stop when you reach a goal state.
Why it matters: It guarantees the shortest plan and avoids wasting time on wrong or longer paths.

🍞 Bottom Bread (Anchor) Like finding your classroom by checking the nearest hallway first, then the next, so you don’t wander forever.

🍞 Top Bread (Hook) Imagine a kitchen where one cook writes the menu, another checks it, and a third improves the recipes—working together faster than one person could.

🥬 Filling (The Actual Concept: Multi‑Agent Framework)

What it is: Multiple AI agents collaborate—one proposes an FSM, one validates it, one improves it, and coding agents turn it into a real site.
How it works: Propose → Validate → Revise until correct; then generate the website; then search and verify trajectories.
Why it matters: It scales creation of many websites with consistent quality.

🍞 Bottom Bread (Anchor) Like a relay team: each runner handles their leg so the baton (the FSM) finishes strong.

02Core Idea

— The “Aha!” Moment in One Sentence — If we turn a website into an explicit rule map (an FSM) and then auto‑build that website, we can search for correct action paths and verify success inside the environment—no human judges needed.

🍞 Top Bread (Hook) You know how math puzzles are easy to grade when all steps are shown? No guessing, just check the rules.

🥬 Filling (The Actual Concept)

What it is: AutoWebWorld is a pipeline that (1) generates an FSM for a themed website, (2) turns it into a working site, (3) searches the FSM for shortest valid action sequences, and (4) replays them to keep only those that execute perfectly.
How it works: Every action has preconditions and effects; BFS explores only legal actions; selectors link FSM actions to real UI clicks/typing; Playwright replays; failed replays get filtered out.
Why it matters: The result is a large set of cheap, clean, verified training trajectories.

🍞 Bottom Bread (Anchor) Like building a practice driving course with traffic lights that are wired to rules you control—then recording perfect lessons to teach new drivers.

— Multiple Analogies —

LEGO Instruction Book: The FSM is the manual; BFS follows the shortest build steps; verification checks you built the model exactly right.
Video Game Level Editor: You design the map (FSM), auto‑generate the level (website), run the pathfinder (BFS), and keep only runs that beat the level with no glitches.
Recipe Factory: Define ingredients and steps (FSM), cook the dish (site), taste‑test by rule (verification), and throw away any batch that doesn’t match the flavor profile.

— Before vs After —

Before: Agents guessed from screenshots; external judges argued about correctness; data was costly and sometimes wrong.
After: Agents train on verifiably correct paths; success is reaching goal states; cost is low; scaling up is straightforward.

— Why It Works (Intuition, not equations) —

Determinism: Given a state and a valid action, the next state is uniquely defined. That lets search and verification be exact.
Separation of concerns: Semantic correctness (preconditions/effects) is separate from UI execution (selectors/clicks). The FSM is the source of truth; the website is the stage.
Shortest paths: BFS prevents bloated, confusing demonstrations.
Programmatic checks: Success equals reaching a goal state—no human opinions.

— Building Blocks (Small Pieces) —

FSM generator: Multi‑agent propose‑validate‑revise to define pages, signature variables, actions, and goals.
Website builder: Coding agents render a Vue front‑end that enforces the same selectors and interactions as the FSM.
Searcher (BFS): Enumerates valid, shortest action sequences from the initial state to goal states.
Executor and filter: Playwright clicks/types using selectors; any mismatch discards the trajectory.
Query and grounding: Turn paths into training prompts, and also build UI‑grounding examples for where‑to‑click learning.
Trainer: Fine‑tune a GUI agent (e.g., 7B model) with ~16K steps of this verified data to improve real‑site skills.

🍞 Top Bread (Hook) Imagine a lock where each click of a dial moves you to a new, known position. If you know the rules, you can dial the fastest combo every time.

🥬 Filling (The Actual Concept: Why FSM + BFS beats guessing)

What it is: A rule‑first design that prevents illegal moves and confirms goals.
How it works: Preconditions bar nonsense actions; effects update the state cleanly; BFS uncovers the shortest legal path; UI replay double‑checks the front‑end really behaves.
Why it matters: Clean data means better learning with fewer examples.

🍞 Bottom Bread (Anchor) Like practicing piano with a metronome and sheet music: the rules and timing keep you honest, so you learn faster.

03Methodology

— High‑Level Recipe — Input (Theme + Reference Site Name) → Step A: Generate FSM → Step B: Build Website → Step C: BFS Search → Step D: Replay + Filter → Output: Verified Trajectories + Queries

Step A: Generate an FSM (the rule map)

What happens: A proposer agent drafts pages, signature variables (like search text, filters, selected item), actions with preconditions/effects, and goal states. A validator checks reachability and rule correctness. An improver fixes issues. Repeat until valid.
Why this step exists: Without a precise FSM, you can’t search or verify. Missing preconditions or fuzzy effects would create wrong paths and bad labels.
Example: On a shopping site, “Add to cart” requires size and color chosen (preconditions). Effects set cart contents; navigation may move to the cart page and reset pagination.

Step B: Build the Website (the stage)

What happens: Coding agents generate a runnable Vue site from the FSM with a consistent selector namespace. Pages, components, and data mocking align with FSM semantics. Build, run, and self‑repair until it compiles and launches.
Why this step exists: You need a real front‑end to take screenshots, coordinates, and to confirm the FSM can be enacted in UI.
Example: The product grid has items with selectors like #item-card-42 that match the FSM’s action parameter item_id=42.

Step C: BFS Search (find shortest valid plans)

What happens: Each FSM state is a node (page + signature). BFS expands only actions whose preconditions are true. Effects deterministically compute the next state. Reaching a goal records the shortest action sequence.
Why this step exists: Random exploration wastes time and can miss shortest solutions; BFS guarantees minimal, valid plans.
Example: To checkout, BFS ensures it first fills shipping info, then payment, then confirms—no skipping steps.

Step D: Replay + Filter (trust but verify)

What happens: Convert each high‑level action into its GUI procedure (click/type/scroll) using the agreed selectors. Replay with Playwright. Keep only trajectories where all steps execute and the goal state is reached; discard any that fail due to front‑end mismatch.
Why this step exists: Even perfect FSM plans can fail if the generated website slightly differs. Execution filtering guarantees only reproducibly correct trajectories survive.
Example: If “#checkout-button” doesn’t render due to a layout bug, that trajectory is dropped.

The Secret Sauce

Explicit states and rules (FSM) + shortest‑path search (BFS) + real UI replay = intrinsic verification. No human judges, no ambiguous screenshots.
Shared selector namespace: Actions map cleanly to DOM elements, turning semantics into clicks.
Deterministic transitions: Given (state, action) you always know the next state—ideal for programmatic grading.

Concrete Data Path Example

Task: “Create a new repository named ‘my-notes’.”
FSM path: Home → click New Repo (preconditions ok) → type name → set visibility → click Create → Goal: repo page loaded.
GUI replay: click(#new-repo), type_text(#repo-name, "my-notes"), click(#visibility-private), click(#create-btn).
Verification: After each action, effects update signature; terminal repo page reached = success.

What breaks without each step

No FSM: You can’t decide validity; search explodes; labels become noisy.
No BFS: You get long, meandering demos that confuse training.
No shared selectors: You can’t reliably click the right UI element.
No replay filtering: Hidden front‑end bugs pollute the dataset.

Outputs

11,663 verified trajectories across 29 sites at $0.04 each.
~16K total steps used for training after sampling and augmentation (trajectory + grounding data).
Reusable worlds with built‑in success checks for benchmarking.

04Experiments & Results

The Test: What they measured and why

Success on real websites: Do agents trained on synthetic, verified data perform better on benchmarks like WebVoyager and Online‑Mind2Web?
Cost and scale: How cheap and scalable is trajectory creation compared to prior datasets?
Grounding accuracy: Do click‑where results improve on benchmarks like ScreenSpot‑V2 and ScreenSpot‑Pro?

The Competition

Real‑world datasets (Explorer, AgentTrek, FARA, Mind2Web) that rely on external judges and cost $0.15–$ 1.00 per trajectory.
Open‑source and closed‑source baseline models across sizes (≤7B, ≥7B, and frontier models).

The Scoreboard (with context)

Cost: AutoWebWorld costs $0.04 per verified trajectory—like buying 1 cookie instead of 4–25 cookies elsewhere.
Length: Average 21.9 steps versus 6.9–12.1 elsewhere—richer, longer tasks that teach planning.
WebVoyager (≤15 steps): The 7B model trained on ~16K steps of AutoWebWorld data achieves 27.42% overall success, outperforming strong baselines (e.g., UI‑TARS‑1.5‑7B at 26.51%) and far above Qwen2.5‑VL‑7B (5.62%). Domains like Cambridge Dictionary (60.47%), Coursera (30.00%), and Hugging Face (32.43%) show notable gains; even tougher Google sites see non‑zero wins.
Small but mighty: The 3B model trained on this data reaches 15.09%, beating some 7B baselines—evidence of data quality and efficiency.
Grounding: On ScreenSpot‑V2 and ScreenSpot‑Pro, both 3B and 7B versions improve substantially (e.g., 3B overall +4.01 on ScreenSpot‑V2; +4.7 average on ScreenSpot‑Pro). Better where‑to‑click understanding transfers.

Surprising (Good) Findings

Clear scaling law: As synthetic training samples grow (8 → 256 → 1,024 → ~16K), WebVoyager rises 3.92% → 17.59% → 19.09% → 27.42%, and Online‑Mind2Web 1.22% → 7.32% → 7.93% → 14.02%.
Grounding matters: Removing grounding data gives a tiny early reward bump but hurts long‑term learning; with grounding, coordinate rewards grow faster and higher.
Not just toys: On two synthesized sites (Quora, GitHub), frontier agents do worse than on WebVoyager, showing these synthetic worlds are not trivial—they are controlled yet challenging.

Why these numbers matter

Getting an “A” (27.42%) on a tough test where many get “D”–“C” proves verified synthetic data can beat bigger but noisier real‑world corpora.
Longer trajectories teach planning skills that generalize.
Built‑in verification crushes costs and inconsistencies, unlocking scalable training.

05Discussion & Limitations

Limitations

Fidelity gap: Synthetic sites, while realistic, may miss quirks of live web (CAPTCHAs, rate limits, flaky pop‑ups), so some real‑world skills still need fine‑tuning.
FSM correctness: If an FSM’s preconditions/effects are wrong, you get beautifully verified but mis‑specified data. The multi‑agent validator helps, but it’s not magic.
Front‑end drift: Even with shared selectors, generation glitches can break actions—execution filtering fixes this but reduces yield.
Coverage: 29 sites and 11,663 trajectories are strong but not internet‑scale; more themes/pages/actions will broaden skills.

Required Resources

Model APIs or local LLMs to propose/validate/improve FSMs and code sites.
Playwright (or similar) for deterministic replay.
Training compute (the paper used 8× A800 GPUs) or smaller‑scale alternatives.
Modest budget for generation (dominant cost was step‑level “thinking,” not environment building).

When NOT to Use

If you must exactly mirror a specific live site’s shifting behavior or dynamic anti‑bot defenses.
If you can’t enforce selectors or need pixel‑perfect layout matching of a target brand.
If you require data from private or auth‑protected flows without synthetic analogs.

Open Questions

How to auto‑learn FSMs from real sites to reduce hand‑offs and close the fidelity gap?
Can we mix synthetic and small amounts of verified real data to get the best of both?
How to scale themes to hundreds of sites while keeping action semantics consistent?
Can richer state observability (e.g., partial real DOM/state access) reduce reliance on screenshots in the wild?

06Conclusion & Future Work

3‑Sentence Summary AutoWebWorld converts website behavior into Finite State Machines, auto‑builds matching sites, and uses BFS plus replay to generate large sets of cheap, verified training trajectories. This removes the verifier bottleneck (humans/LLMs) by baking success checks into the environment itself. Training on only ~16K verified steps yields state‑of‑the‑art results within 15 steps on WebVoyager and shows clear gains with more synthetic data.

Main Achievement A practical, scalable recipe for infinite, verifiable web practice: FSMs → websites → shortest plans → execution‑verified data.

Future Directions

Expand environment diversity and complexity (forms, payments, auth flows) while retaining intrinsic verification.
Semi‑automatic FSM extraction from real sites to narrow the reality gap.
Smarter data selection and curriculum shaping to improve data efficiency further.

Why Remember This By moving the rules and the grader inside the environment, AutoWebWorld shows that synthetic, verifiable practice can train better real‑world web agents at a fraction of the cost—turning a messy guessing game into a clean science experiment.

Practical Applications

•Train workplace assistants to navigate internal dashboards with verified steps before deploying them on live systems.
•Create a curriculum of tasks (easy → hard) by adjusting FSM complexity to steadily grow an agent’s skills.
•Benchmark GUI agents fairly with stable, reusable websites and built‑in success checks (no changing online pages).
•Generate click‑where grounding data to improve an agent’s accuracy in selecting the right UI elements.
•Prototype new product flows (e.g., checkout variants) and auto‑verify that assistance policies still succeed.
•Stress‑test agents with long, multi‑page tasks to improve planning and memory.
•Rapidly explore ablations (remove certain actions or add constraints) and see how training responds.
•Fine‑tune enterprise agents on synthetic twins of sensitive apps to avoid leaking real data.
•Use execution filtering to maintain a high‑precision dataset even as websites or components evolve.
•Scale data cheaply for small models to reach or beat larger baselines via better supervision.

Version: 1