MobilityBench: A Benchmark for Evaluating Route-Planning Agents in Real-World Mobility Scenarios
Key Summary
- •MobilityBench is a big, carefully built test that checks how well AI helpers can plan real-world routes using natural language and map tools.
- •It uses real, anonymized travel questions from people across 350+ cities and 22 countries, so it reflects everyday needs like 'avoid tolls' or 'arrive by 7:30.'
- •A special replay sandbox freezes map and traffic responses, so every model is tested in the exact same conditions for fair, repeatable results.
- •The evaluation looks at multiple skills: understanding instructions, planning steps, choosing and using tools, finishing the task, and doing it efficiently.
- •Current models do well on basic info and simple routes but struggle when users add personal preferences (like avoiding highways or minimizing transfers).
- •Two popular agent styles—ReAct (think-act-observe) and Plan-and-Execute (plan first, then act)—trade off between robustness and efficiency.
- •Bigger or reasoning-focused models generally perform better, but at higher cost and latency.
- •MobilityBench releases data, tools, and documentation, so researchers and companies can compare methods fairly and improve faster.
- •Results show the best Final Pass Rate around 69% in the toughest ReAct setting, leaving plenty of room for better personalization.
- •This benchmark helps turn route-planning AIs from good GPS helpers into reliable travel companions that handle real-life constraints.
Why This Research Matters
MobilityBench helps transform smart GPS-like assistants into trustworthy travel partners that truly understand and follow your preferences. It gives researchers and companies a fair way to compare models and find exactly where they need to improve. Because it uses real, anonymized user requests across many cities, progress on this benchmark is more likely to help everyday people. The replay sandbox removes randomness, so better scores really mean better agents, not lucky timing. By highlighting struggles with preferences, it pushes the field toward more personalized, reliable travel assistance. In short, it accelerates safer, smarter, and more user-friendly navigation for everyone.
Detailed Explanation
Tap terms for definitions01Background & Problem Definition
🍞 Top Bread (Hook) You know how you ask your phone, “How do I get to the museum, but please avoid highways?” and you expect a quick, clear answer?
🥬 Filling (The Actual Concept)
- What it is: This paper is about MobilityBench, a fair and repeatable test for AI route-planning helpers that understand your words and use map tools to find the best way to go.
- How it works (story of why we need it):
- Before: Phones and GPS apps were great at basic point-to-point directions, but understanding messy, spoken requests with preferences (like “arrive by 7:30” or “fewest transfers”) was hard for AIs.
- The rise of Large Language Models (LLMs) made it easier to understand natural language, but connecting that understanding to correct tool use (like routing APIs) and evaluating it fairly was still tricky.
- Real map services change over time (traffic updates, API backend tweaks), so the same request could return different answers on different days, making fair comparison difficult.
- Earlier benchmarks often focused on high-level trip ideas (like multi-day itineraries) instead of detailed, street-by-street routing with strict constraints.
- Researchers needed a test built from real user requests that could freeze the environment and score many skills, not just the final answer.
- Why it matters: Without a fair and stable test, we can’t tell which route-planning AI is truly better or how to improve them for everyday use.
🍞 Bottom Bread (Anchor) Imagine a science fair where every student bakes the same cookie recipe, but the ovens keep changing temperature. You can’t judge fairly! MobilityBench gives every student the exact same oven settings, so the best cookie recipe (route-planning AI) truly stands out.
🍞 Top Bread (Hook) Imagine chatting with a super-smart helper who both understands what you say and can press the right buttons in an app for you.
🥬 Filling (The Actual Concept): Large Language Models (LLMs)
- What it is: LLMs are AIs that read and write human-like text.
- How it works: 1) They learn patterns from tons of text; 2) They turn your question into a meaning-rich representation; 3) They generate answers or next steps.
- Why it matters: LLMs can understand “avoid tolls,” “stop at the bank first,” or “arrive by 7:30,” but they still need tools to actually compute routes.
🍞 Bottom Bread (Anchor) When you ask “What’s the fastest bus route with at most one transfer?”, an LLM understands that rule (“≤ 1 transfer”) and prepares the right tool calls to check options.
🍞 Top Bread (Hook) Think of a GPS that also listens to your special wishes.
🥬 Filling (The Actual Concept): Tool-Augmented Agents
- What it is: Tool-augmented agents are LLMs that can call external tools (like map APIs) to act in the real world.
- How it works: 1) Read your request; 2) Plan which tools to use (e.g., geocoding, routing); 3) Call them with correct parameters; 4) Combine results into a route and explanation.
- Why it matters: Understanding words isn’t enough; the agent must press the right “buttons” (APIs) correctly to produce a usable route.
🍞 Bottom Bread (Anchor) It’s like a helper who not only knows the address of a bakery but also orders the cake online for pickup at the perfect time.
🍞 Top Bread (Hook) You know how your parents sometimes say, “Let’s avoid toll roads this time,” or “Please, no more than one transfer with the kids”?
🥬 Filling (The Actual Concept): Route-Planning Agents
- What it is: A route-planning agent is an AI that turns your words into a detailed, executable travel plan using map tools.
- How it works: 1) Understand the goal and preferences; 2) Look up locations; 3) Compute routes for the chosen mode (drive, transit, bike, walk); 4) Check constraints like “avoid highways” or “arrive by 7:30”; 5) Present the final plan clearly.
- Why it matters: Without careful tool use and constraint checking, the plan might be wrong, late, or ignore your preferences.
🍞 Bottom Bread (Anchor) “Drive from school to grandma’s house, avoid highways, and stop at the library.” A good agent finds the library on the way and sets a route that skips highways.
🍞 Top Bread (Hook) Imagine saying, “I need to get to the airport by 7:30, but I don’t want to use toll roads,” and the system gets confused.
🥬 Filling (The Actual Concept): Preference-Constrained Route Planning
- What it is: Planning routes that obey user rules like avoid tolls, limit transfers, or visit waypoints in order.
- How it works: 1) Extract preferences from text; 2) Pass them as options to routing tools; 3) Validate that the chosen route follows all rules; 4) If not, try another plan.
- Why it matters: Ignoring preferences wastes time, money, or comfort, and breaks user trust.
🍞 Bottom Bread (Anchor) “Bus route from home to the stadium with at most one transfer.” The agent must pick a transit plan that never exceeds one transfer, even if a faster two-transfer trip exists.
02Core Idea
🍞 Top Bread (Hook) Imagine a fair obstacle course where every runner faces the exact same course, weather, and rules. That’s how you find the real champion.
🥬 Filling (The Actual Concept): MobilityBench
- What it is: MobilityBench is a fair, repeatable benchmark that tests route-planning agents on real, diverse mobility requests with a frozen, replayable map environment and a multi-skill scorecard.
- How it works:
- Gather many real, anonymized travel questions from users across many cities.
- Build a deterministic API-replay sandbox that returns the same map and traffic answers every time.
- Define a clear scoring system that checks understanding, planning, tool use, correctness, and efficiency.
- Release data and tools so everyone can test in the same way.
- Why it matters: Without fixed conditions and a rich scorecard, we can’t tell if an agent is truly good or just lucky.
🍞 Bottom Bread (Anchor) It’s like recording the exact weather and track layout for a race, then letting every runner compete on that identical replay, so results are truly comparable.
Multiple Analogies for the Aha! Idea
- Classroom analogy: Every student takes the same test, graded by a detailed rubric (understanding, planning steps, tool use, final answer), so grades are fair and explain where to improve.
- Cooking show analogy: Everyone cooks using the exact same pantry (replay sandbox) and recipe steps (ground-truth programs), so judges can compare dishes fairly.
- Video game analogy: Each player replays the same level with the same enemies and items, so the leaderboard is meaningful and repeatable.
🍞 Top Bread (Hook) Ever tried to learn from a YouTube tutorial that keeps changing halfway through? Frustrating!
🥬 Filling (The Actual Concept): Deterministic API-Replay Sandbox
- What it is: A controlled environment that always returns the same map and traffic responses for the same inputs.
- How it works: 1) Cache real API responses during data building; 2) Canonicalize inputs (like coordinates and times) so equal calls hit the same cache; 3) Use strict schema checks; 4) Apply safe fuzzy matching when needed; 5) Return identical outputs each run.
- Why it matters: It removes randomness from live services, so differences in scores come from the agent’s skill, not from changing traffic.
🍞 Bottom Bread (Anchor) Like practicing a piano piece on the same tuned piano every time, not one that keeps changing keys.
🍞 Top Bread (Hook) Imagine scoring a soccer player only by goals, ignoring passing, defense, and teamwork.
🥬 Filling (The Actual Concept): Multi-Dimensional Evaluation Protocol
- What it is: A scorecard that checks several abilities: instruction understanding, planning, tool use, decision making, and efficiency.
- How it works: 1) Intent Detection and Information Extraction test comprehension; 2) Task Decomposition tests planning steps; 3) Tool Selection and Schema Compliance test tool use; 4) Delivery Rate and Final Pass Rate test outcomes; 5) Input/Output Tokens measure efficiency.
- Why it matters: If we only check the final answer, we miss where the agent goes wrong and how to fix it.
🍞 Bottom Bread (Anchor) Like a report card with reading, math, science, and PE—so you know strengths and where to practice more.
🍞 Top Bread (Hook) Think of each travel request as a mini adventure with a beginning, middle, and end.
🥬 Filling (The Actual Concept): Episode-Centric Design
- What it is: Each “episode” is one complete, solvable mobility request with its context, tools replay, and ground truth for evaluation.
- How it works: 1) Query text and context (like city) define the task; 2) A replay snapshot guarantees consistent tool answers; 3) A structured ground-truth log shows the minimal correct tool sequence and evidence.
- Why it matters: It makes each test self-contained and checkable.
🍞 Bottom Bread (Anchor) “Drive from A to B via C, avoiding highways.” That’s one episode with all needed info to test an agent end-to-end.
Before vs After
- Before: Tests relied on live, changing APIs or judged only final answers; comparisons weren’t replicable and missed failure points.
- After: MobilityBench freezes the environment and grades multiple skills, enabling fair, deep diagnosis and faster improvement.
Why It Works (Intuition)
- Freeze the world to remove luck. Use real user questions for realism. Check many skills to see both what works and what breaks. That combination makes results meaningful and actionable.
Building Blocks
- Task taxonomy from real voice-to-text user queries across 11 scenarios (info lookups, route-dependent info, basic routing, and preference-constrained routing).
- Ground-truth “standard tool programs” that encode the minimal correct tool calls and validations.
- Deterministic replay sandbox with canonicalization, validation, and safe fallbacks.
- Multi-dimensional metrics that tie behavior to outcomes.
🍞 Bottom Bread (Anchor) It’s like a detailed LEGO kit: clear instructions (ground truth), identical bricks (replay), and a checklist (metrics) so builders can be compared fairly.
03Methodology
At a high level: Input (user query + context) → Parse and classify intent → Extract slots (locations, time, preferences) → Resolve places (geocoding/POIs) → Plan and call routing tools → Validate constraints → Produce final itinerary and scores.
🍞 Top Bread (Hook) Imagine turning a messy shopping list into a perfect grocery trip with exact aisles and timing.
🥬 Filling (The Actual Concept): Data Curation from Real Queries
- What it is: MobilityBench builds episodes from anonymized, real voice-to-text mobility queries, cleaned to be self-contained and solvable without follow-ups.
- How it works: 1) Filter malformed/ambiguous items; 2) Deduplicate near-copies; 3) Classify intents (open-set) with a small LLM (Qwen-4B) and expert review; 4) Organize 100k episodes into 11 scenarios and 4 families across 350+ cities.
- Why it matters: Realistic, diverse requests make the benchmark meaningful and broad.
🍞 Bottom Bread (Anchor) “Find a pharmacy near Nanshan District” or “Arrive at the airport by 7:30” become well-defined, testable episodes.
🍞 Top Bread (Hook) Think of a recipe that lists only the exact steps you must do—no more, no less.
🥬 Filling (The Actual Concept): Ground-Truth Standard Tool Program
- What it is: For each scenario, experts define the minimal correct sequence of tool calls (SOP) needed to answer the query.
- How it works: 1) Extract and normalize slots (origin, destination, mode, time, preferences); 2) Resolve names into coordinates via POI search or geocoding; 3) Validate parameters; 4) Call routing/traffic/weather tools; 5) Verify constraints; 6) Archive the execution trace and key artifacts as reference y.
- Why it matters: A clear, minimal “gold path” lets us check if agents did the right things in the right order.
🍞 Bottom Bread (Anchor) For “Drive to Shanghai Disneyland avoiding tolls,” the gold path looks like: (a) geocode “Shanghai Disneyland”; (b) call drivinlanning with avoiolls=true; (c) output ETA and distance that match the tool.
🍞 Top Bread (Hook) Imagine a practice arena where every target stays exactly where it was last time.
🥬 Filling (The Actual Concept): Deterministic Replay Sandbox
- What it is: A tool gateway that serves pre-recorded, validated API responses.
- How it works: 1) Canonicalize inputs (coordinates, time formats); 2) Cache lookup for exact response; 3) If no exact hit, use safe fuzzy matching (e.g., closest POI within a small radius); 4) Enforce strict schema and type checks; 5) Mark unresolved calls as tool-use failures.
- Why it matters: Guarantees reproducible, fair testing unaffected by live traffic changes or rate limits.
🍞 Bottom Bread (Anchor) Calling drivinlanning with the same origin/destination returns the same route and ETA every run.
🍞 Top Bread (Hook) You know how coaches grade players on different drills, not just the final game score?
🥬 Filling (The Actual Concept): Multi-Dimensional Metrics
- What it is: A set of indicators for understanding, planning, tool use, outcomes, and efficiency.
- How it works:
- Instruction Understanding:
- Intent Detection (ID): pick the right scenario label.
- Information Extraction (IE): get all required fields (origins, times, preferences) exactly right.
- Planning:
- Task Decomposition (DEC-P/R): list the right steps without missing or junk steps.
- Tool Use:
- Tool Selection (TS-P/R): choose all needed tools, avoid extras.
- Schema Compliance (SC): fill parameters correctly (formats, ranges, required fields).
- Decision Making:
- Delivery Rate (DR): produce a complete, executable answer.
- Final Pass Rate (FPR): satisfy all explicit/implicit constraints.
- Efficiency:
- Input Tokens (IT) and Output Tokens (OT): track context length and verbosity.
- Instruction Understanding:
- Why it matters: Pinpoints exactly where agents succeed or fail.
🍞 Bottom Bread (Anchor) If an agent picks the right tools but forgets to set “avoiighways,” FPR will drop even if DR is high.
Mini-Sandwiches for Key Sub-Concepts
- 🍞 Hook: Ever label your schoolwork so it goes to the right folder? 🥬 ID is choosing the correct task type (like “POI query” vs “route planning”). Without it, the agent starts the wrong process. 🍞 Anchor: Mistaking “What’s the weather?” for “Plan a bus route” breaks everything.
- 🍞 Hook: Packing your backpack with exactly what you need. 🥬 IE is extracting every required detail (origin, destination, time, preferences). Miss one, and the route may be impossible. 🍞 Anchor: Forgetting “arrive by 7:30” can make you late.
- 🍞 Hook: Writing a to-do list in the right order. 🥬 DEC ensures the step sequence covers all actions without fluff. Without it, tools are called in the wrong order. 🍞 Anchor: Trying to route before geocoding the destination fails.
- 🍞 Hook: Choosing the right tool for the job. 🥬 TS is selecting needed tools (geocoding, routing). Picking wrong or extra tools wastes time or fails. 🍞 Anchor: Calling weather when you need bulanning adds noise.
- 🍞 Hook: Filling out a form correctly. 🥬 SC checks parameters are complete and well-formed. Without it, tools reject calls. 🍞 Anchor: Forgetting destination coordinates causes an error.
- 🍞 Hook: Finishing your homework. 🥬 DR is whether a full, runnable answer was produced. Without it, the user gets nothing. 🍞 Anchor: Route missing steps or links is undeliverable.
- 🍞 Hook: Meeting all the rules on a science fair rubric. 🥬 FPR checks every constraint is met. Without it, answers look good but break promises. 🍞 Anchor: A route that uses a toll road when you said “avoid tolls” fails.
- 🍞 Hook: Using too many words in an essay. 🥬 IT/OT track how much the model reads/writes. Too big means slower, costlier runs. 🍞 Anchor: A concise plan with the same quality is better.
Concrete Example Walkthrough
- Query: “Drive from my location to Shanghai Disneyland, avoid tolls, arrive by 7:30 PM.”
- ID: Classify as Preference-Constrained Route Planning.
- IE: Extract origin (current location), destination (Shanghai Disneyland), mode (drive), preferences (avoiolls), time (arrivy=19:30).
- Resolve: reverseocoding for current location; pouery to get exact Disneyland coordinates.
- Plan: If arrive-by, search departure window; call drivinlanning with avoiolls=true.
- Validate: Check computed :30; if not, propose earlier departure or alternative.
- Deliver: Present route, steps, ETA, distance, toll-free confirmation.
Secret Sauce
- Realistic data + frozen environment + gold execution traces + multi-skill grading = a benchmark that’s fair, deep, and actually helpful for improving agents.
04Experiments & Results
🍞 Top Bread (Hook) Imagine a big sports day where athletes are tested on speed, strength, skill, and teamwork—not just who crosses the finish line first.
🥬 Filling (The Actual Concept): What They Tested
- What it is: The team measured five ability areas—understanding, planning, tool use, decision quality, and efficiency—over 7,098 sampled episodes from the 100,000-episode pool.
- How it works: They compared many top LLM backbones (OpenAI GPT, Anthropic Claude, Google Gemini, Qwen, DeepSeek) and two agent frameworks (ReAct vs. Plan-and-Execute) under the same replayed conditions with fixed prompts and step caps.
- Why it matters: We get a clean scoreboard that’s fair and repeatable across diverse, real-world scenarios.
🍞 Bottom Bread (Anchor) It’s like a multi-event tournament where every runner uses the same track and timer.
The Competition
- Backbones: Closed-source (GPT-4.1/5.2, Claude-Sonnet/Opus 4.5, Gemini-3 Pro/Flash) and open-source (Qwen-3 series from 4B to 235B-A22B MoE, DeepSeek-V3.2-Exp, DeepSeek-R1 for reasoning).
- Frameworks: ReAct (think-act-observe loops) vs. Plan-and-Execute (plan first, then carry out).
The Scoreboard (with context)
- Decision Quality:
- Under Plan-and-Execute, Claude-Opus-4.5 achieved the highest Delivery Rate (about 83.53%) and a strong Final Pass Rate (about 65.77%). Think: solid A in reliability.
- Under ReAct, Gemini-3-Pro-Preview hit the top Final Pass Rate (~69.09%), showing strong adaptability in iterative tool use. Think: A- to A in finding correct end solutions.
- Understanding and Tool Use:
- Top closed-source models led instruction understanding, but the gap is narrowing. Open-source Qwen3-235B-A22B and DeepSeek-V3.2-Exp were very competitive, with strong DR/FPR and lower cost profiles.
- Efficiency Trade-offs:
- ReAct generally used ~35% more input tokens than Plan-and-Execute, implying higher costs and latency.
Scenario-Level Findings
- Easier: Basic Information Retrieval and Basic Route Planning—models perform well here.
- Harder: Preference-Constrained Route Planning—where models stumble, especially when preferences must be strictly obeyed.
- Framework twist: Plan-and-Execute sometimes shines on structured, constraint-heavy problems because preplanning reduces drift and hallucination. ReAct excels in dynamic adaptation but pays a token cost.
Surprises and Insights
- Bigger or reasoning-tuned models generally score higher, following classic scaling behavior. Under the same architecture, moving from 4B to 32B helps; MoE (Qwen-235B-A22B, 22B active) gives further gains.
- Longer plans: Large models often produce more steps (some redundant) but achieve higher final pass rates due to broader search-and-verify behavior.
- Thinking mode (chain-of-thought style): Consistently boosts success across models (e.g., Qwen-30B-A3B gains ~+5.98% FPR), but increases token usage and latency, making deployment heavier. DeepSeek-R1, a strong reasoning model, reached ~70.46% FPR in a sample—impressive but costly.
🍞 Bottom Bread (Anchor) Think of ReAct as a careful hiker who checks the map at every fork (more time, fewer surprises), and Plan-and-Execute as a planner who studies the whole trail first (faster, but can be thrown off by unexpected turns).
05Discussion & Limitations
🍞 Top Bread (Hook) No tool is perfect the first time you build it—even the best bike needs tuning.
🥬 Filling (The Actual Concept): Honest Assessment
- Limitations
- Map Ecosystem Scope: The sandbox is built on AMap tool responses; while it spans 22 countries and 350+ cities, results may not directly transfer to other map providers without adaptation.
- Frozen World: Replay ensures fairness but removes real-time surprises (sudden traffic, service outages). Agents that thrive on live adaptation don’t get to show that here.
- No Clarifications: Episodes are designed to be solvable without follow-ups; this limits evaluation of dialog-based preference elicitation.
- Speech Nuances: Voice-to-text queries are realistic, but ASR errors aren’t modeled separately; misheard inputs are filtered out in curation.
- Preference Breadth: Many preferences are covered (avoid tolls/highways, fewer transfers), but the infinity of bespoke constraints in the real world can go beyond benchmark scope.
- Required Resources
- Access to the dataset and sandbox toolkit; an LLM backbone; an agent framework (e.g., ReAct or Plan-and-Execute); compute budget mindful of token usage caps.
- When Not to Use
- If you need to test live, time-sensitive adaptation (e.g., changing traffic in real time), MobilityBench’s replay may be too static.
- If your agent depends on multi-turn clarifications, this benchmark’s single-turn episodes won’t showcase that strength.
- Open Questions
- How to best blend planning and reactive loops for both high FPR and low cost?
- Can agents reliably handle more creative or layered constraints (e.g., scenic detours plus electric charging availability plus weather risk)?
- What training or fine-tuning helps models extract preferences more precisely from messy language?
- How to generalize across different map providers and regions while keeping replay fairness?
- Can we simulate realistic, bounded non-determinism (e.g., small ETA swings) without losing reproducibility?
🍞 Bottom Bread (Anchor) It’s like testing a car in a wind tunnel: you learn a lot under controlled air, but you’ll still want a road test later.
06Conclusion & Future Work
🍞 Top Bread (Hook) Imagine giving every route-planning AI the same fair obstacle course, the same weather, and the same stopwatch.
🥬 Filling (The Actual Concept): Takeaway
- 3-Sentence Summary: MobilityBench is a large, realistic, and reproducible benchmark for testing AI route-planning agents built from real user queries. It freezes map tool responses in a deterministic replay sandbox and scores multiple skills—from understanding to final correctness and efficiency. Results across leading models and frameworks reveal strong basic abilities but hard challenges in honoring user preferences.
- Main Achievement: Turning messy, real-world mobility requests into clean, comparable, multi-skill evaluations by combining a replayable tool environment with structured ground-truth programs and a detailed scoring protocol.
- Future Directions: Add more providers and cities, model limited non-determinism, support clarification turns, broaden preference types (e.g., EV charging, accessibility), and explore hybrid planning + reactive strategies that balance robustness and cost.
- Why Remember This: MobilityBench shows how to fairly measure what truly matters for everyday navigation—understanding people’s words, using the right tools, and delivering routes that keep promises—so tomorrow’s travel AIs can be trustworthy co-pilots.
🍞 Bottom Bread (Anchor) Like a well-designed driving test makes safer drivers, MobilityBench’s fair and thorough testing can make route-planning AIs safer, smarter, and more dependable.
Practical Applications
- •Compare different route-planning agents fairly before deploying one in a navigation app.
- •Diagnose where your agent fails (e.g., misunderstanding preferences) and target fine-tuning there.
- •Evaluate new prompting or reasoning strategies (e.g., Plan-and-Execute vs. ReAct) under identical conditions.
- •Test cost-saving strategies by tracking input/output tokens while maintaining accuracy.
- •Benchmark open-source vs. closed-source backbones to choose a cost-effective solution for private deployment.
- •Validate preference-handling features (avoid tolls/highways, limit transfers) before releasing to users.
- •Train or distill smaller models using ground-truth tool programs as supervision signals.
- •Stress-test schema compliance to reduce production errors in tool calls.
- •Prototype hybrid planning + reactive loops and measure gains without live API noise.
- •Extend the benchmark to new cities/tools and re-run tests for consistent cross-region evaluation.