GUI-Libra: Training Native GUI Agents to Reason and Act with Action-aware Supervision and Partially Verifiable RL

Rui Yang; Qianhui Wu; Zhaoyang Wang; Hanyang Chen; Ke Yang; Hao Cheng; Huaxiu Yao; Baoling Peng; Huan Zhang; Jianfeng Gao; Tong Zhang

GUI-Libra: Training Native GUI Agents to Reason and Act with Action-aware Supervision and Partially Verifiable RL

Intermediate

Rui Yang, Qianhui Wu, Zhaoyang Wang et al.2/25/2026

arXiv

Key Summary

•GUI-Libra is a training recipe that helps computer-using AI agents both think carefully and click precisely on screens.
•It fixes a common problem: long, rambling reasoning often hurts where the model should actually tap or type.
•The authors build an 81K-step, action-aligned reasoning dataset and clean it so thoughts match the actual on-screen actions.
•They introduce Action-aware Supervised Fine-Tuning (ASFT), which teaches the model with both 'reason-then-act' and 'act-directly' styles while giving extra weight to action and coordinate tokens.
•They design Reinforcement Learning (RL) that works even when only one of many valid actions is labeled as 'correct' (partial verifiability).
•A small KL 'stay-close' penalty keeps the policy from drifting too far, making offline scores better predict real task success.
•They add Success-adaptive Negative Gradient Scaling (SNGS) to downweight uncertain 'wrong' feedback so the model doesn’t over-punish valid alternatives.
•Across AndroidWorld, WebArena-Lite-v2, and Online-Mind2Web, GUI-Libra consistently boosts both step accuracy and end-to-end success, often rivaling larger or closed systems.
•Careful data curation plus tailored post-training—not massive online data collection—unlocks big gains in long-horizon GUI navigation.
•All data, code, and models are released to help others train strong, reasoning-capable GUI agents efficiently.

Why This Research Matters

Stronger GUI agents can reliably help with everyday digital chores, from booking trips to managing schoolwork, because they not only plan but also click exactly right. By making offline training results actually predict real-world performance, teams can build useful assistants without expensive, fragile online data collection. This efficiency levels the playing field so smaller, open models can compete with giant, closed systems. Accurate, explainable actions reduce frustration and prevent misclicks that waste time or cause errors at checkout or in forms. The released dataset and code let the community reproduce and extend these gains. Overall, this work makes trustworthy, capable computer-use assistants more accessible to everyone.

Reading Workflow

Turn this paper into a decision

Scan fast. Promote only the papers that survive triage.

No workflow history yet.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Top Bread (Hook) You know how using a phone or a computer to finish a chore—like booking tickets or saving a recipe—takes many small steps in the right order? An AI helper must both understand the screen and choose the right taps and types, step after step.

🥬 Filling (The Actual Concept)

What it is: This paper tackles how to train "native GUI agents"—single models that see the screen, read your goal, and directly output actions—to reason well over many steps and still click precisely.
How it works (step by step):
1. Start with existing open GUI interaction data.
2. Clean and enrich it so the agent’s thoughts match the actions it should take.
3. Fine-tune the model so it learns both to explain and to act, but gives extra attention to the action parts.
4. Use a careful kind of RL that handles uncertain rewards and keeps the model stable.
5. Test on realistic benchmarks to check both single steps and full tasks.
Why it matters: Without this, agents often talk too much and click badly, or they learn from fuzzy signals and get unstable, failing long tasks.
🍞 Bottom Bread (Anchor) Imagine you ask an AI to “Find a sports shoe on a store app and save it.” It must swipe to find search, type the shoe name, scroll, open the item, and tap save. GUI-Libra helps it think through these steps without messing up where to tap.

🍞 Top Bread (Hook) You know how highlighting important words helps you answer a question better? Focusing on the right bits matters.

🥬 Filling (Visual Grounding)

What it is: Visual grounding is the skill of pointing to the exact spot or element on the screen to interact with.
How it works:
1. Read the instruction (e.g., "Tap the star icon").
2. Look over the screenshot to find the matching element.
3. Output precise coordinates (x, y) to tap.
4. Check that the point lies inside the target element’s box.
Why it matters: Without good grounding, even perfect reasoning won’t help—the model will tap the wrong place.
🍞 Bottom Bread (Anchor) If the instruction says “Tap the star to save,” grounding finds the star icon’s spot, not a random dot nearby.

🍞 Top Bread (Hook) Think of following a recipe: you need to plan multiple steps, not just crack one egg.

🥬 Filling (Long-horizon Navigation)

What it is: Tasks that require many correct steps in a row across changing screens.
How it works:
1. Observe the current screen.
2. Recall previous actions.
3. Decide the next action that moves toward the goal.
4. Repeat until success or time runs out.
Why it matters: One small error changes the next screen and can derail the whole plan.
🍞 Bottom Bread (Anchor) Buying a textbook online involves search, filters, cart, checkout—each step depends on the last.

🍞 Top Bread (Hook) You know how explaining your reasoning can help you choose wisely but can also slow you down?

🥬 Filling (Chain-of-Thought, CoT)

What it is: The model writes out its thoughts before giving an action.
How it works:
1. Describe what it sees.
2. Reflect on the goal and options.
3. Plan the next move.
4. Output an executable action.
Why it matters: Too-long thoughts can crowd out attention to precise tapping, hurting grounding.
🍞 Bottom Bread (Anchor) If the model writes 300 tokens of thoughts before tapping, it’s more likely to miss the tiny star icon.

🍞 Top Bread (Hook) Imagine practicing by copying a coach’s moves.

🥬 Filling (Supervised Fine-Tuning, SFT)

What it is: Training by showing the model input-output pairs and asking it to imitate them.
How it works:
1. Show the instruction, history, and screenshot.
2. Provide the correct reasoning trace and action.
3. Ask the model to match those tokens.
4. Repeat across many steps and tasks.
Why it matters: SFT teaches the basic skills but can overemphasize long thoughts and underemphasize precise clicks.
🍞 Bottom Bread (Anchor) If the dataset says “Click star at (540, 1429),” SFT makes the model learn to output that action.

🍞 Top Bread (Hook) Think of learning by trial and error—like hot and cold hints when finding a hidden toy.

🥬 Filling (Reinforcement Learning, RL)

What it is: Training by letting the model try actions and rewarding good outcomes.
How it works:
1. Sample several candidate actions for a state.
2. Score them based on correctness and format.
3. Push the policy toward better-scoring actions.
4. Repeat, keeping the policy stable.
Why it matters: RL can refine decisions beyond copying, but needs careful design to avoid instability.
🍞 Bottom Bread (Anchor) If “tap star” gets a positive score and “tap heart” gets low, RL nudges the model toward “tap star.”

🍞 Top Bread (Hook) Suppose there are many right ways to solve a puzzle, but your teacher only marks one of them as correct.

🥬 Filling (Partial Verifiability)

What it is: In GUIs, several actions might be valid, but the dataset often marks just one as ‘correct.’
How it works:
1. Label gives one demonstrated action per step.
2. Other valid actions look like mistakes to the verifier.
3. Negative feedback becomes noisy and misleading.
4. Training can overfit to one path and penalize good alternatives.
Why it matters: This makes offline step-matching a weak predictor of real task success.
🍞 Bottom Bread (Anchor) Both “tap search bar” and “tap magnifying glass” could open search, but only one gets credit.

🍞 Top Bread (Hook) Think about practicing on worksheets (offline) versus taking the real exam (online).

🥬 Filling (Offline vs Online Metrics)

What it is: Offline checks if your action matches the demo on fixed data; online checks if you finish tasks in a real loop.
How it works:
1. Offline: match the demo’s action step-by-step.
2. Online: run in the environment; screens change with your actions.
3. Compare how well offline predicts online.
4. Reduce the gap by stabilizing training.
Why it matters: High offline scores can still fail online if the policy drifts to unseen states or valid alternatives aren’t credited.
🍞 Bottom Bread (Anchor) A student who memorizes answers might ace practice, but stumble on the live test when questions are reordered.

🍞 Top Bread (Hook) Before this paper, many teams tried simple recipes: either only copy demos or do RL that chases easy-to-check clicks.

🥬 Filling (The Problem and Gap)

What it is: Long CoT in SFT hurts grounding, and step-wise RL rewards are ambiguous, making offline metrics poorly predict online success.
How it works:
1. Datasets have noisy labels and short thoughts.
2. SFT loss is dominated by reasoning tokens.
3. RL treats uncredited-but-valid actions as wrong.
4. Policies drift away from demo data and get unstable.
Why it matters: Models think a lot but click badly, or click well in practice data but fail live tasks.
🍞 Bottom Bread (Anchor) Agents that type long essays but miss the tiny search bar won’t help you finish tasks faster.

02Core Idea

🍞 Top Bread (Hook) Imagine a tightrope walker who must balance thinking (planning steps) and doing (placing feet exactly). Too much thinking, and they wobble; too little, and they step wrong.

🥬 Filling (The Aha! Moment)

What it is: Train GUI agents with action-aware supervision that emphasizes actions and coordinates, then apply conservative RL with a small KL ‘stay-close’ leash and success-adaptive scaling so the model learns from uncertain feedback without drifting.
How it works (three analogies):
1. Music lesson: First practice the melody (actions/coordinates) louder than the background (long thoughts). Then perform with a metronome (KL) so you don’t speed up or slow down wildly. If a judge can’t hear some notes, don’t punish yourself too hard (SNGS).
2. Navigation coach: Learn both a route with explanations and a direct route, but give extra attention to the turns themselves. When exploring variants, tie a safety rope (KL) so you don’t get lost, and ignore suspiciously harsh booing from the crowd when your alternative route is also fine (SNGS).
3. Homework vs. tests: Study using clean answer keys that line up with the steps. On practice tests, don’t wander far from what worked (KL), and treat unclear red X’s gently (SNGS) because there may be more than one right way.
Before vs After:
- Before: CoT-heavy SFT made clicks worse; RL without constraints chased proxy rewards and collapsed policy diversity; offline scores didn’t predict online success.
- After: Mixed supervision + token reweighting keep grounding strong even with thoughts; KL-regularized RL stabilizes learning and improves offline-to-online alignment; SNGS avoids over-penalizing valid alternatives.
Why it works (intuition, no equations):
1. Weight the tokens that truly execute the task (action and coordinates) so thoughts don’t drown them out.
2. Keep the new policy close to the demo-initialized one (KL) so it visits familiar states and maintains credit on the demonstrated action, making offline scores more meaningful.
3. Downweight negative gradients when group success is low (SNGS), because many ‘misses’ might actually be good-but-uncredited actions.
Building blocks (each with a mini sandwich): • 🍞 Action-aware Supervised Fine-Tuning (ASFT)
- What it is: A training objective that mixes reason-then-act and act-only data while giving extra weight to action and grounding tokens.
- How it works:
  1. Train on both reasoning+action samples and action-only samples.
  2. Reweight tokens so action and coordinates count more in the loss.
  3. Keep the ability to output either concise actions or full thoughts.
  4. Reduce the harm of long thoughts on precision.
- Why it matters: It preserves grounding while keeping useful reasoning.
- 🍞 Anchor: The model can either say “Tap the star at (540,1429)” directly or think briefly first—both stay accurate. • 🍞 Token Reweighting
- What it is: Turning up the volume on action and coordinate tokens during SFT.
- How it works:
  1. Detect tokens inside the answer block.
  2. Separate semantic action tokens from coordinate tokens.
  3. Apply bigger weights to coordinates (and action fields) in the loss.
  4. Train so the model learns to be extra-precise when pointing.
- Why it matters: It stops long thoughts from overshadowing where to click.
- 🍞 Anchor: The numbers (x, y) for the star icon matter more than a long paragraph. • 🍞 Partially Verifiable RL
- What it is: RL where only one of several valid actions gets credit.
- How it works:
  1. Sample groups of candidate actions.
  2. Score them with format and correctness checks.
  3. Use group-relative advantages to update the policy.
  4. Recognize that many ‘negatives’ are uncertain.
- Why it matters: It fits real GUIs where multiple moves can make progress.
- 🍞 Anchor: Both “tap search bar” and “tap magnifying glass” move you toward search. • 🍞 KL Regularization (trust region)
- What it is: A small penalty that keeps the updated policy close to the reference model.
- How it works:
  1. Measure how different the new action distribution is from the reference.
  2. Add a small penalty for drifting too far.
  3. Maintain healthy policy entropy (diversity).
  4. Improve the link between offline and online scores.
- Why it matters: Prevents collapse and keeps learning predictable.
- 🍞 Anchor: Like a metronome, it keeps your tempo steady during practice. • 🍞 Success-adaptive Negative Gradient Scaling (SNGS)
- What it is: Downweight negative updates when many samples disagree with the demo, because negatives are ambiguous.
- How it works:
  1. Compute group success rate for the demo action.
  2. Scale negative advantages by a factor that grows with that success.
  3. Keep positive updates untouched.
  4. Avoid over-punishing good alternatives.
- Why it matters: It stabilizes RL under partial credit.
- 🍞 Anchor: If few picked the demo action, being ‘wrong’ might not mean bad—penalize gently.

03Methodology

At a high level: Input (instruction + history + screenshot) → Stage 1: Action-aware Supervised Fine-Tuning (mixed supervision + token reweighting) → Stage 2: Conservative RL (KL-regularized GRPO + success-adaptive scaling) → Output (reasoned, executable action that succeeds online).

Stage 0: Data curation and filtering (the prep kitchen) 🍞 Hook: You can’t bake a great cake with stale flour. 🥬 The Concept (Data Curation Pipeline)

What it is: A process to augment, align, and filter GUI trajectories so thoughts and actions match.
How it works:
1. Aggregate open web and mobile trajectories (clean extremes and compound steps).
2. Convert to a unified format: <think> reasoning </think> + <answer> JSON action </answer>.
3. Augment reasoning with structured prompts that describe observation, reflection, and plan.
4. Filter by re-prediction agreement (drop uncertain samples) and bounding-box verification (keep coordinates inside predicted boxes).
Why it matters: Clean, action-aligned thoughts teach the model to both think and act reliably.
🍞 Anchor: If the action says “Tap ‘Search’ button,” the coordinate must land inside the “Search” button box.

Stage 1: Action-aware Supervised Fine-Tuning (ASFT) 🍞 Hook: Practice both with explanations and with quick moves, but focus most on the moves. 🥬 The Concept (ASFT)

What it is: Train on a mixture of reason-then-act and act-only data, giving extra weight to action and coordinate tokens.
How it works:
1. Build two views of each step: (a) full reasoning+action, (b) action-only.
2. Train so the model learns to output either style at inference.
3. Reweight tokens: action tokens > reasoning tokens; coordinates often highest.
4. Keep response lengths healthy so grounding doesn’t degrade.
Why it matters: It preserves accurate clicking while keeping useful reasoning available.
🍞 Anchor: The model can reply “Tap the star at (540,1429)” with or without a long explanation—and still hit the right spot.

Example with actual data

Input: “Tap on the star icon to save the recipe,” screenshot with a visible star, prior step history.
Output: <think> I see the star near the recipe title. To save it, tap the star. </think> <answer> {"actio $n_t$ ype": "Click", "actio $n_t$ arget": "Star save icon", "poin $t_2$ d": [540, 1429]} </answer>
What breaks without ASFT: If all weight goes to long thoughts, the model’s taps drift and miss the star.

Stage 2: Conservative RL with partial verifiability 🍞 Hook: When feedback is uncertain, explore gently and stay near what you already do well. 🥬 The Concept (Partially Verifiable RL with KL)

What it is: An RL setup that samples candidate actions, scores them with partially verifiable rewards, and uses a KL trust region to avoid policy drift.
How it works:
1. For each state, sample a group of candidate actions (GRPO).
2. Score each action: 10% for valid format; 90% for action-type, value-match, and whether the point falls in the target box.
3. Compute group-relative advantages and update policy.
4. Add a small KL penalty to keep the policy near the SFT reference, preserving predictability.
Why it matters: It improves alignment between offline step scores and real task success.
🍞 Anchor: The model keeps a ‘metronome’ on (KL) while it practices picking the best of its candidate taps.

Success-adaptive Negative Gradient Scaling (SNGS) 🍞 Hook: Don’t scold yourself harshly when the grading might be unfair. 🥬 The Concept (SNGS)

What it is: Downweight negative gradients based on how often the demo action wins within the sampled group.
How it works:
1. Measure group success ratio for the demo action.
2. If it’s low, many ‘negatives’ might be valid alternatives—shrink the negative update.
3. If it’s high, stronger negatives are safe.
4. Keep positive updates intact.
Why it matters: It avoids overfitting to a single demonstrated path when multiple are valid.
🍞 Anchor: Both tapping the search bar or magnifying glass can open search; don’t punish one too hard if it’s not the labeled choice.

Secret sauce (what’s clever)

Mixed-mode learning prevents long thoughts from hijacking precision.
Token reweighting spotlights the exact tokens that move the cursor.
KL regularization keeps exploration on-track so offline scores predict online success.
SNGS respects that GUIs often have multiple correct next steps.

What breaks without each step

Without curation: Noisy labels and mismatched thoughts hurt both reasoning and clicks.
Without ASFT: CoT-heavy SFT degrades grounding; act-only SFT weakens planning.
Without KL: Training reward rises but online success drops (policy collapse/reward hacking).
Without SNGS: The policy over-penalizes valid alternatives and becomes brittle.

End-to-end flow

Input → Curated sample → ASFT update (mixed + reweight) → KL-regularized GRPO + SNGS → Output: <think> … </think> <answer> {action JSON} </answer> → Better offline scores that now predict higher online task success.

04Experiments & Results

🍞 Hook: Testing an agent is like trying a recipe in different kitchens: a phone, a local web sandbox, and real live websites.

🥬 The Test

What they measured and why:
1. Step accuracy (Pass@1, Pass@4): Does the next action match the demo exactly? Tests precision.
2. Online task success: Does the agent actually finish the task in a live, step-by-step run? Tests real-world usefulness.
3. Grounding vs. response length: Do longer thoughts hurt where the model clicks? Tests the reasoning–grounding trade-off.
4. Offline-to-online correlation: Do good offline scores predict online wins? Tests training reliability.

The Competition

Open-source native models (e.g., Qwen2.5-VL-3/7/32/72B, Qwen3-VL-4/8/32B, GUI-R1, UI-TARS) and proprietary LLMs paired with a grounding module.
Also compared against agent frameworks that add extra modules (e.g., step-wise summaries), to separate pure native-model strength from toolkits.

Scoreboard with context

AndroidControl-v2 (offline step accuracy): • GUI-Libra-3B beats its base model by +20.9 Pass@1 on high-level and +14.8 on low-level tasks, and big gains in Pass@4. That’s like moving from a C to a strong B+/A- with a much smaller model. • GUI-Libra-4B/8B also post strong margins over their baselines, rivaling much larger models.
MM-Mind2Web-v2 (offline step accuracy across cross-task/site/domain): • GUI-Libra-7B improves Pass@1 by +14.0 over its 7B baseline; GUI-Libra-3B jumps +19.3 on average—huge for a small model.
AndroidWorld (online, 20 steps): • GUI-Libra-4B/8B hit 42.6% success, beating many larger native models and matching or exceeding some agent frameworks with extra modules. Compared to baselines, GUI-Libra-8B gains +12.2 points; GUI-Libra-3B leaps from 3.5% to 25.2% (+21.7), like jumping from barely passing to a solid pass.
WebArena-Lite-v2 (online web sandbox, 15 steps): • Even with relatively little web training data, GUI-Libra-7B rises from 4.9% to 22.6%; GUI-Libra-8B from 15.3% to 26.6%. These are strong, parameter-efficient results, often matching or beating much larger systems.
Online-Mind2Web (live websites): • GUI-Libra-8B lifts overall scores from 19.3 to 28.0, topping all evaluated native models; GUI-Libra-7B moves from 15.8 to 25.5; GUI-Libra-4B from 21.7 to 25.7; and GUI-Libra-3B from 4.8 to 21.3.

Surprising/Important Findings

Longer thoughts hurt grounding unless you train with action-aware weighting and mixed supervision; then the harm is largely removed.
A small KL penalty doesn’t change training reward much, but it dramatically improves online performance and prevents policy collapse.
Offline-to-online correlation becomes strong with KL ( $Pearson ≈ 0$ .89), proving the method makes offline scores meaningful.
Mixing direct grounding data into RL boosts pure grounding benchmarks but can reduce performance on reasoning-heavy navigation tasks: a trade-off knob depending on your goal.
SNGS helps online generalization notably, with modest trade-offs on some low-level step metrics.

05Discussion & Limitations

🍞 Hook: No tool is perfect; it’s best to know when it shines and when to try something else.

🥬 Limitations

Data scope: GUI-Libra relies on curated open-source trajectories. Great for efficiency, but it doesn’t yet explore large-scale fully online RL where the agent collects new data live.
Web data volume: The training corpus includes fewer web trajectories than some rivals; even so, results are strong, but more web-rich data could help further.
Hyperparameters: SNGS brings gains but is sensitive to settings; careful tuning may be required for new domains.
Partial verifiability remains: Even with SNGS and KL, some ambiguity persists; better multi-action verifiers would help.

Required Resources

Vision–language backbones (3B–8B) and GPUs for SFT + RL.
Clean, action-aligned reasoning data (their released 81K set helps).
A reproducible evaluation harness for both offline (step) and online (task) testing.

When NOT to Use

If you only need single-step grounding without multi-step planning, a simpler grounding-only pipeline may be faster.
If your environment provides fully verifiable step rewards (only one true action), standard RLVR without KL may suffice.
If you cannot afford any RL runs, SFT with mixed supervision and token reweighting still helps—but you’ll miss RL’s online robustness.

Open Questions

Can we learn verifiers that credit multiple valid actions automatically, reducing ambiguity?
How to scale to fully online RL efficiently without heavy infrastructure?
Can we adaptively decide when to reason vs. act-directly at inference for best speed–accuracy trade-offs?
How to personalize agents that learn user-specific UI patterns while staying robust to layout shifts?
Can we fuse world models or simulators to predict future screens and plan even better?

06Conclusion & Future Work

Three-sentence summary

GUI-Libra shows that carefully curated, action-aligned reasoning data plus action-aware SFT and conservative, KL-regularized RL can make GUI agents both think clearly and click precisely.
It stabilizes learning under partially verifiable rewards and makes offline scores reliably predict real task success, without costly online data collection.
Across phone and web benchmarks, small-to-mid models trained with GUI-Libra rival or beat much larger systems.

Main Achievement

A unified, data-efficient post-training recipe—ASFT + KL-regularized GRPO + SNGS—backed by an 81K curated dataset that jointly improves reasoning and grounding, and tightens offline-to-online alignment.

Future Directions

Add multi-action credit verifiers to shrink ambiguity; scale web data; explore efficient fully online RL; learn when to show thoughts vs. act directly; and combine with world models to foresee screen changes.

Why Remember This

GUI-Libra turns a tricky trade-off—thinking vs. clicking—into a win–win by focusing training on what actually executes the task while keeping exploration safe and sensible. If you want an AI that can really get things done on screens, this is a practical path that works today.

Practical Applications

•Automate app workflows on phones (e.g., open, search, save items) with fewer misclicks.
•Assistive technology that reliably navigates websites and apps for users with motor or visual challenges.
•Enterprise IT automation for repetitive dashboards: log in, filter reports, export, and send.
•Customer support agents that perform multi-step account actions directly in web portals.
•Personal digital assistants that handle shopping lists, bookings, and reminders end-to-end.
•QA testing bots that click through app flows while explaining their reasoning for debugging.
•Robotic process automation (RPA) upgrades that need robust visual grounding and planning.
•Education tools that demonstrate step-by-step computer tasks with clear rationale and precise actions.
•Data labeling assistants that validate UI element locations by cross-checking targets and coordinates.
•Accessible interfaces that translate natural language goals into accurate, multi-step UI actions.

Version: 1