GUI-Libra: Training Native GUI Agents to Reason and Act with Action-aware Supervision and Partially Verifiable RL
Key Summary
- ā¢GUI-Libra is a training recipe that helps computer-using AI agents both think carefully and click precisely on screens.
- ā¢It fixes a common problem: long, rambling reasoning often hurts where the model should actually tap or type.
- ā¢The authors build an 81K-step, action-aligned reasoning dataset and clean it so thoughts match the actual on-screen actions.
- ā¢They introduce Action-aware Supervised Fine-Tuning (ASFT), which teaches the model with both 'reason-then-act' and 'act-directly' styles while giving extra weight to action and coordinate tokens.
- ā¢They design Reinforcement Learning (RL) that works even when only one of many valid actions is labeled as 'correct' (partial verifiability).
- ā¢A small KL 'stay-close' penalty keeps the policy from drifting too far, making offline scores better predict real task success.
- ā¢They add Success-adaptive Negative Gradient Scaling (SNGS) to downweight uncertain 'wrong' feedback so the model doesnāt over-punish valid alternatives.
- ā¢Across AndroidWorld, WebArena-Lite-v2, and Online-Mind2Web, GUI-Libra consistently boosts both step accuracy and end-to-end success, often rivaling larger or closed systems.
- ā¢Careful data curation plus tailored post-trainingānot massive online data collectionāunlocks big gains in long-horizon GUI navigation.
- ā¢All data, code, and models are released to help others train strong, reasoning-capable GUI agents efficiently.
Why This Research Matters
Stronger GUI agents can reliably help with everyday digital chores, from booking trips to managing schoolwork, because they not only plan but also click exactly right. By making offline training results actually predict real-world performance, teams can build useful assistants without expensive, fragile online data collection. This efficiency levels the playing field so smaller, open models can compete with giant, closed systems. Accurate, explainable actions reduce frustration and prevent misclicks that waste time or cause errors at checkout or in forms. The released dataset and code let the community reproduce and extend these gains. Overall, this work makes trustworthy, capable computer-use assistants more accessible to everyone.
Detailed Explanation
Tap terms for definitions01Background & Problem Definition
š Top Bread (Hook) You know how using a phone or a computer to finish a choreālike booking tickets or saving a recipeātakes many small steps in the right order? An AI helper must both understand the screen and choose the right taps and types, step after step.
š„¬ Filling (The Actual Concept)
- What it is: This paper tackles how to train "native GUI agents"āsingle models that see the screen, read your goal, and directly output actionsāto reason well over many steps and still click precisely.
- How it works (step by step):
- Start with existing open GUI interaction data.
- Clean and enrich it so the agentās thoughts match the actions it should take.
- Fine-tune the model so it learns both to explain and to act, but gives extra attention to the action parts.
- Use a careful kind of RL that handles uncertain rewards and keeps the model stable.
- Test on realistic benchmarks to check both single steps and full tasks.
- Why it matters: Without this, agents often talk too much and click badly, or they learn from fuzzy signals and get unstable, failing long tasks.
- š Bottom Bread (Anchor) Imagine you ask an AI to āFind a sports shoe on a store app and save it.ā It must swipe to find search, type the shoe name, scroll, open the item, and tap save. GUI-Libra helps it think through these steps without messing up where to tap.
š Top Bread (Hook) You know how highlighting important words helps you answer a question better? Focusing on the right bits matters.
š„¬ Filling (Visual Grounding)
- What it is: Visual grounding is the skill of pointing to the exact spot or element on the screen to interact with.
- How it works:
- Read the instruction (e.g., "Tap the star icon").
- Look over the screenshot to find the matching element.
- Output precise coordinates (x, y) to tap.
- Check that the point lies inside the target elementās box.
- Why it matters: Without good grounding, even perfect reasoning wonāt helpāthe model will tap the wrong place.
- š Bottom Bread (Anchor) If the instruction says āTap the star to save,ā grounding finds the star iconās spot, not a random dot nearby.
š Top Bread (Hook) Think of following a recipe: you need to plan multiple steps, not just crack one egg.
š„¬ Filling (Long-horizon Navigation)
- What it is: Tasks that require many correct steps in a row across changing screens.
- How it works:
- Observe the current screen.
- Recall previous actions.
- Decide the next action that moves toward the goal.
- Repeat until success or time runs out.
- Why it matters: One small error changes the next screen and can derail the whole plan.
- š Bottom Bread (Anchor) Buying a textbook online involves search, filters, cart, checkoutāeach step depends on the last.
š Top Bread (Hook) You know how explaining your reasoning can help you choose wisely but can also slow you down?
š„¬ Filling (Chain-of-Thought, CoT)
- What it is: The model writes out its thoughts before giving an action.
- How it works:
- Describe what it sees.
- Reflect on the goal and options.
- Plan the next move.
- Output an executable action.
- Why it matters: Too-long thoughts can crowd out attention to precise tapping, hurting grounding.
- š Bottom Bread (Anchor) If the model writes 300 tokens of thoughts before tapping, itās more likely to miss the tiny star icon.
š Top Bread (Hook) Imagine practicing by copying a coachās moves.
š„¬ Filling (Supervised Fine-Tuning, SFT)
- What it is: Training by showing the model input-output pairs and asking it to imitate them.
- How it works:
- Show the instruction, history, and screenshot.
- Provide the correct reasoning trace and action.
- Ask the model to match those tokens.
- Repeat across many steps and tasks.
- Why it matters: SFT teaches the basic skills but can overemphasize long thoughts and underemphasize precise clicks.
- š Bottom Bread (Anchor) If the dataset says āClick star at (540, 1429),ā SFT makes the model learn to output that action.
š Top Bread (Hook) Think of learning by trial and errorālike hot and cold hints when finding a hidden toy.
š„¬ Filling (Reinforcement Learning, RL)
- What it is: Training by letting the model try actions and rewarding good outcomes.
- How it works:
- Sample several candidate actions for a state.
- Score them based on correctness and format.
- Push the policy toward better-scoring actions.
- Repeat, keeping the policy stable.
- Why it matters: RL can refine decisions beyond copying, but needs careful design to avoid instability.
- š Bottom Bread (Anchor) If ātap starā gets a positive score and ātap heartā gets low, RL nudges the model toward ātap star.ā
š Top Bread (Hook) Suppose there are many right ways to solve a puzzle, but your teacher only marks one of them as correct.
š„¬ Filling (Partial Verifiability)
- What it is: In GUIs, several actions might be valid, but the dataset often marks just one as ācorrect.ā
- How it works:
- Label gives one demonstrated action per step.
- Other valid actions look like mistakes to the verifier.
- Negative feedback becomes noisy and misleading.
- Training can overfit to one path and penalize good alternatives.
- Why it matters: This makes offline step-matching a weak predictor of real task success.
- š Bottom Bread (Anchor) Both ātap search barā and ātap magnifying glassā could open search, but only one gets credit.
š Top Bread (Hook) Think about practicing on worksheets (offline) versus taking the real exam (online).
š„¬ Filling (Offline vs Online Metrics)
- What it is: Offline checks if your action matches the demo on fixed data; online checks if you finish tasks in a real loop.
- How it works:
- Offline: match the demoās action step-by-step.
- Online: run in the environment; screens change with your actions.
- Compare how well offline predicts online.
- Reduce the gap by stabilizing training.
- Why it matters: High offline scores can still fail online if the policy drifts to unseen states or valid alternatives arenāt credited.
- š Bottom Bread (Anchor) A student who memorizes answers might ace practice, but stumble on the live test when questions are reordered.
š Top Bread (Hook) Before this paper, many teams tried simple recipes: either only copy demos or do RL that chases easy-to-check clicks.
š„¬ Filling (The Problem and Gap)
- What it is: Long CoT in SFT hurts grounding, and step-wise RL rewards are ambiguous, making offline metrics poorly predict online success.
- How it works:
- Datasets have noisy labels and short thoughts.
- SFT loss is dominated by reasoning tokens.
- RL treats uncredited-but-valid actions as wrong.
- Policies drift away from demo data and get unstable.
- Why it matters: Models think a lot but click badly, or click well in practice data but fail live tasks.
- š Bottom Bread (Anchor) Agents that type long essays but miss the tiny search bar wonāt help you finish tasks faster.
02Core Idea
š Top Bread (Hook) Imagine a tightrope walker who must balance thinking (planning steps) and doing (placing feet exactly). Too much thinking, and they wobble; too little, and they step wrong.
š„¬ Filling (The Aha! Moment)
- What it is: Train GUI agents with action-aware supervision that emphasizes actions and coordinates, then apply conservative RL with a small KL āstay-closeā leash and success-adaptive scaling so the model learns from uncertain feedback without drifting.
- How it works (three analogies):
- Music lesson: First practice the melody (actions/coordinates) louder than the background (long thoughts). Then perform with a metronome (KL) so you donāt speed up or slow down wildly. If a judge canāt hear some notes, donāt punish yourself too hard (SNGS).
- Navigation coach: Learn both a route with explanations and a direct route, but give extra attention to the turns themselves. When exploring variants, tie a safety rope (KL) so you donāt get lost, and ignore suspiciously harsh booing from the crowd when your alternative route is also fine (SNGS).
- Homework vs. tests: Study using clean answer keys that line up with the steps. On practice tests, donāt wander far from what worked (KL), and treat unclear red Xās gently (SNGS) because there may be more than one right way.
- Before vs After:
- Before: CoT-heavy SFT made clicks worse; RL without constraints chased proxy rewards and collapsed policy diversity; offline scores didnāt predict online success.
- After: Mixed supervision + token reweighting keep grounding strong even with thoughts; KL-regularized RL stabilizes learning and improves offline-to-online alignment; SNGS avoids over-penalizing valid alternatives.
- Why it works (intuition, no equations):
- Weight the tokens that truly execute the task (action and coordinates) so thoughts donāt drown them out.
- Keep the new policy close to the demo-initialized one (KL) so it visits familiar states and maintains credit on the demonstrated action, making offline scores more meaningful.
- Downweight negative gradients when group success is low (SNGS), because many āmissesā might actually be good-but-uncredited actions.
- Building blocks (each with a mini sandwich):
⢠š Action-aware Supervised Fine-Tuning (ASFT)
- What it is: A training objective that mixes reason-then-act and act-only data while giving extra weight to action and grounding tokens.
- How it works:
- Train on both reasoning+action samples and action-only samples.
- Reweight tokens so action and coordinates count more in the loss.
- Keep the ability to output either concise actions or full thoughts.
- Reduce the harm of long thoughts on precision.
- Why it matters: It preserves grounding while keeping useful reasoning.
- š Anchor: The model can either say āTap the star at (540,1429)ā directly or think briefly firstāboth stay accurate. ⢠š Token Reweighting
- What it is: Turning up the volume on action and coordinate tokens during SFT.
- How it works:
- Detect tokens inside the answer block.
- Separate semantic action tokens from coordinate tokens.
- Apply bigger weights to coordinates (and action fields) in the loss.
- Train so the model learns to be extra-precise when pointing.
- Why it matters: It stops long thoughts from overshadowing where to click.
- š Anchor: The numbers (x, y) for the star icon matter more than a long paragraph. ⢠š Partially Verifiable RL
- What it is: RL where only one of several valid actions gets credit.
- How it works:
- Sample groups of candidate actions.
- Score them with format and correctness checks.
- Use group-relative advantages to update the policy.
- Recognize that many ānegativesā are uncertain.
- Why it matters: It fits real GUIs where multiple moves can make progress.
- š Anchor: Both ātap search barā and ātap magnifying glassā move you toward search. ⢠š KL Regularization (trust region)
- What it is: A small penalty that keeps the updated policy close to the reference model.
- How it works:
- Measure how different the new action distribution is from the reference.
- Add a small penalty for drifting too far.
- Maintain healthy policy entropy (diversity).
- Improve the link between offline and online scores.
- Why it matters: Prevents collapse and keeps learning predictable.
- š Anchor: Like a metronome, it keeps your tempo steady during practice. ⢠š Success-adaptive Negative Gradient Scaling (SNGS)
- What it is: Downweight negative updates when many samples disagree with the demo, because negatives are ambiguous.
- How it works:
- Compute group success rate for the demo action.
- Scale negative advantages by a factor that grows with that success.
- Keep positive updates untouched.
- Avoid over-punishing good alternatives.
- Why it matters: It stabilizes RL under partial credit.
- š Anchor: If few picked the demo action, being āwrongā might not mean badāpenalize gently.
03Methodology
At a high level: Input (instruction + history + screenshot) ā Stage 1: Action-aware Supervised Fine-Tuning (mixed supervision + token reweighting) ā Stage 2: Conservative RL (KL-regularized GRPO + success-adaptive scaling) ā Output (reasoned, executable action that succeeds online).
Stage 0: Data curation and filtering (the prep kitchen) š Hook: You canāt bake a great cake with stale flour. š„¬ The Concept (Data Curation Pipeline)
- What it is: A process to augment, align, and filter GUI trajectories so thoughts and actions match.
- How it works:
- Aggregate open web and mobile trajectories (clean extremes and compound steps).
- Convert to a unified format: <think> reasoning </think> + <answer> JSON action </answer>.
- Augment reasoning with structured prompts that describe observation, reflection, and plan.
- Filter by re-prediction agreement (drop uncertain samples) and bounding-box verification (keep coordinates inside predicted boxes).
- Why it matters: Clean, action-aligned thoughts teach the model to both think and act reliably.
- š Anchor: If the action says āTap āSearchā button,ā the coordinate must land inside the āSearchā button box.
Stage 1: Action-aware Supervised Fine-Tuning (ASFT) š Hook: Practice both with explanations and with quick moves, but focus most on the moves. š„¬ The Concept (ASFT)
- What it is: Train on a mixture of reason-then-act and act-only data, giving extra weight to action and coordinate tokens.
- How it works:
- Build two views of each step: (a) full reasoning+action, (b) action-only.
- Train so the model learns to output either style at inference.
- Reweight tokens: action tokens > reasoning tokens; coordinates often highest.
- Keep response lengths healthy so grounding doesnāt degrade.
- Why it matters: It preserves accurate clicking while keeping useful reasoning available.
- š Anchor: The model can reply āTap the star at (540,1429)ā with or without a long explanationāand still hit the right spot.
Example with actual data
- Input: āTap on the star icon to save the recipe,ā screenshot with a visible star, prior step history.
- Output: <think> I see the star near the recipe title. To save it, tap the star. </think> <answer> {"actioype": "Click", "actioarget": "Star save icon", "poind": [540, 1429]} </answer>
- What breaks without ASFT: If all weight goes to long thoughts, the modelās taps drift and miss the star.
Stage 2: Conservative RL with partial verifiability š Hook: When feedback is uncertain, explore gently and stay near what you already do well. š„¬ The Concept (Partially Verifiable RL with KL)
- What it is: An RL setup that samples candidate actions, scores them with partially verifiable rewards, and uses a KL trust region to avoid policy drift.
- How it works:
- For each state, sample a group of candidate actions (GRPO).
- Score each action: 10% for valid format; 90% for action-type, value-match, and whether the point falls in the target box.
- Compute group-relative advantages and update policy.
- Add a small KL penalty to keep the policy near the SFT reference, preserving predictability.
- Why it matters: It improves alignment between offline step scores and real task success.
- š Anchor: The model keeps a āmetronomeā on (KL) while it practices picking the best of its candidate taps.
Success-adaptive Negative Gradient Scaling (SNGS) š Hook: Donāt scold yourself harshly when the grading might be unfair. š„¬ The Concept (SNGS)
- What it is: Downweight negative gradients based on how often the demo action wins within the sampled group.
- How it works:
- Measure group success ratio for the demo action.
- If itās low, many ānegativesā might be valid alternativesāshrink the negative update.
- If itās high, stronger negatives are safe.
- Keep positive updates intact.
- Why it matters: It avoids overfitting to a single demonstrated path when multiple are valid.
- š Anchor: Both tapping the search bar or magnifying glass can open search; donāt punish one too hard if itās not the labeled choice.
Secret sauce (whatās clever)
- Mixed-mode learning prevents long thoughts from hijacking precision.
- Token reweighting spotlights the exact tokens that move the cursor.
- KL regularization keeps exploration on-track so offline scores predict online success.
- SNGS respects that GUIs often have multiple correct next steps.
What breaks without each step
- Without curation: Noisy labels and mismatched thoughts hurt both reasoning and clicks.
- Without ASFT: CoT-heavy SFT degrades grounding; act-only SFT weakens planning.
- Without KL: Training reward rises but online success drops (policy collapse/reward hacking).
- Without SNGS: The policy over-penalizes valid alternatives and becomes brittle.
End-to-end flow
- Input ā Curated sample ā ASFT update (mixed + reweight) ā KL-regularized GRPO + SNGS ā Output: <think> ⦠</think> <answer> {action JSON} </answer> ā Better offline scores that now predict higher online task success.
04Experiments & Results
š Hook: Testing an agent is like trying a recipe in different kitchens: a phone, a local web sandbox, and real live websites.
š„¬ The Test
- What they measured and why:
- Step accuracy (Pass@1, Pass@4): Does the next action match the demo exactly? Tests precision.
- Online task success: Does the agent actually finish the task in a live, step-by-step run? Tests real-world usefulness.
- Grounding vs. response length: Do longer thoughts hurt where the model clicks? Tests the reasoningāgrounding trade-off.
- Offline-to-online correlation: Do good offline scores predict online wins? Tests training reliability.
The Competition
- Open-source native models (e.g., Qwen2.5-VL-3/7/32/72B, Qwen3-VL-4/8/32B, GUI-R1, UI-TARS) and proprietary LLMs paired with a grounding module.
- Also compared against agent frameworks that add extra modules (e.g., step-wise summaries), to separate pure native-model strength from toolkits.
Scoreboard with context
- AndroidControl-v2 (offline step accuracy): ⢠GUI-Libra-3B beats its base model by +20.9 Pass@1 on high-level and +14.8 on low-level tasks, and big gains in Pass@4. Thatās like moving from a C to a strong B+/A- with a much smaller model. ⢠GUI-Libra-4B/8B also post strong margins over their baselines, rivaling much larger models.
- MM-Mind2Web-v2 (offline step accuracy across cross-task/site/domain): ⢠GUI-Libra-7B improves Pass@1 by +14.0 over its 7B baseline; GUI-Libra-3B jumps +19.3 on averageāhuge for a small model.
- AndroidWorld (online, 20 steps): ⢠GUI-Libra-4B/8B hit 42.6% success, beating many larger native models and matching or exceeding some agent frameworks with extra modules. Compared to baselines, GUI-Libra-8B gains +12.2 points; GUI-Libra-3B leaps from 3.5% to 25.2% (+21.7), like jumping from barely passing to a solid pass.
- WebArena-Lite-v2 (online web sandbox, 15 steps): ⢠Even with relatively little web training data, GUI-Libra-7B rises from 4.9% to 22.6%; GUI-Libra-8B from 15.3% to 26.6%. These are strong, parameter-efficient results, often matching or beating much larger systems.
- Online-Mind2Web (live websites): ⢠GUI-Libra-8B lifts overall scores from 19.3 to 28.0, topping all evaluated native models; GUI-Libra-7B moves from 15.8 to 25.5; GUI-Libra-4B from 21.7 to 25.7; and GUI-Libra-3B from 4.8 to 21.3.
Surprising/Important Findings
- Longer thoughts hurt grounding unless you train with action-aware weighting and mixed supervision; then the harm is largely removed.
- A small KL penalty doesnāt change training reward much, but it dramatically improves online performance and prevents policy collapse.
- Offline-to-online correlation becomes strong with KL (.89), proving the method makes offline scores meaningful.
- Mixing direct grounding data into RL boosts pure grounding benchmarks but can reduce performance on reasoning-heavy navigation tasks: a trade-off knob depending on your goal.
- SNGS helps online generalization notably, with modest trade-offs on some low-level step metrics.
05Discussion & Limitations
š Hook: No tool is perfect; itās best to know when it shines and when to try something else.
š„¬ Limitations
- Data scope: GUI-Libra relies on curated open-source trajectories. Great for efficiency, but it doesnāt yet explore large-scale fully online RL where the agent collects new data live.
- Web data volume: The training corpus includes fewer web trajectories than some rivals; even so, results are strong, but more web-rich data could help further.
- Hyperparameters: SNGS brings gains but is sensitive to settings; careful tuning may be required for new domains.
- Partial verifiability remains: Even with SNGS and KL, some ambiguity persists; better multi-action verifiers would help.
Required Resources
- Visionālanguage backbones (3Bā8B) and GPUs for SFT + RL.
- Clean, action-aligned reasoning data (their released 81K set helps).
- A reproducible evaluation harness for both offline (step) and online (task) testing.
When NOT to Use
- If you only need single-step grounding without multi-step planning, a simpler grounding-only pipeline may be faster.
- If your environment provides fully verifiable step rewards (only one true action), standard RLVR without KL may suffice.
- If you cannot afford any RL runs, SFT with mixed supervision and token reweighting still helpsābut youāll miss RLās online robustness.
Open Questions
- Can we learn verifiers that credit multiple valid actions automatically, reducing ambiguity?
- How to scale to fully online RL efficiently without heavy infrastructure?
- Can we adaptively decide when to reason vs. act-directly at inference for best speedāaccuracy trade-offs?
- How to personalize agents that learn user-specific UI patterns while staying robust to layout shifts?
- Can we fuse world models or simulators to predict future screens and plan even better?
06Conclusion & Future Work
Three-sentence summary
- GUI-Libra shows that carefully curated, action-aligned reasoning data plus action-aware SFT and conservative, KL-regularized RL can make GUI agents both think clearly and click precisely.
- It stabilizes learning under partially verifiable rewards and makes offline scores reliably predict real task success, without costly online data collection.
- Across phone and web benchmarks, small-to-mid models trained with GUI-Libra rival or beat much larger systems.
Main Achievement
- A unified, data-efficient post-training recipeāASFT + KL-regularized GRPO + SNGSābacked by an 81K curated dataset that jointly improves reasoning and grounding, and tightens offline-to-online alignment.
Future Directions
- Add multi-action credit verifiers to shrink ambiguity; scale web data; explore efficient fully online RL; learn when to show thoughts vs. act directly; and combine with world models to foresee screen changes.
Why Remember This
- GUI-Libra turns a tricky trade-offāthinking vs. clickingāinto a wināwin by focusing training on what actually executes the task while keeping exploration safe and sensible. If you want an AI that can really get things done on screens, this is a practical path that works today.
Practical Applications
- ā¢Automate app workflows on phones (e.g., open, search, save items) with fewer misclicks.
- ā¢Assistive technology that reliably navigates websites and apps for users with motor or visual challenges.
- ā¢Enterprise IT automation for repetitive dashboards: log in, filter reports, export, and send.
- ā¢Customer support agents that perform multi-step account actions directly in web portals.
- ā¢Personal digital assistants that handle shopping lists, bookings, and reminders end-to-end.
- ā¢QA testing bots that click through app flows while explaining their reasoning for debugging.
- ā¢Robotic process automation (RPA) upgrades that need robust visual grounding and planning.
- ā¢Education tools that demonstrate step-by-step computer tasks with clear rationale and precise actions.
- ā¢Data labeling assistants that validate UI element locations by cross-checking targets and coordinates.
- ā¢Accessible interfaces that translate natural language goals into accurate, multi-step UI actions.