Learning When to Act or Refuse: Guarding Agentic Reasoning Models for Safe Multi-Step Tool Use

Aradhye Agarwal; Gurdit Siyan; Yash Pandya; Joykirat Singh; Akshay Nambi; Ahmed Awadallah

Learning When to Act or Refuse: Guarding Agentic Reasoning Models for Safe Multi-Step Tool Use

Intermediate

Aradhye Agarwal, Gurdit Siyan, Yash Pandya et al.3/3/2026

arXiv

Key Summary

•Agentic AIs don’t just chat; they plan, use tools, and take many steps, so one wrong click can cause real harm.
•MOSAIC is a new safety framework that teaches agents to Plan → Check → Act or Refuse, with safety as an explicit, learnable decision.
•Safety checks and a refusal action are built in as first-class choices, so the agent can stop before doing something risky.
•Instead of using a single score at the end, MOSAIC trains with pairwise trajectory preferences so the model learns that early safe refusal is better than a late, unsafe abort.
•Training uses GRPO, a stable reinforcement method that compares small groups of rollouts and doesn’t need a learned critic.
•Across tasks like harmful requests, prompt injections, benign tool use, and privacy-sensitive actions, MOSAIC cuts harm and preserves or improves useful work.
•On Qwen2.5-7B, harmful behavior drops by 50% and correct refusals rise; on Qwen3-4B-Thinking, benign completion nearly doubles; on Phi-4, over-refusal shrinks while completion improves.
•Frontier models (like GPT-4o and GPT-5) also fail without explicit safety scaffolding; MOSAIC makes them refuse over 90% of harmful tasks while maintaining strong benign performance.
•MOSAIC keeps safety reasoning token use low (typically under 20%), and dynamic length penalties reduce unnecessary verbosity.
•The approach generalizes out-of-distribution and even reduces privacy leakage, showing that making safety decisions explicit really matters.

Why This Research Matters

AI agents increasingly control tools that can touch money, code, files, and personal data, so a single unsafe click can have real, irreversible consequences. MOSAIC teaches agents to make safety a deliberate choice at every step, not a vague afterthought. It also rewards good timing—like refusing early—so dangerous paths are cut off before harm. This approach holds up across different models and new settings, meaning safer behavior transfers to real-world surprises. By keeping token use low and outputs structured, it’s practical for teams who need both safety and efficiency. In short, MOSAIC upgrades agents with built-in brakes and seatbelts for everyday operations.

Reading Workflow

Turn this paper into a decision

Scan fast. Promote only the papers that survive triage.

No workflow history yet.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine you’re helping in a kitchen. If you only talk about recipes, you’re safe. But the moment you start using knives, hot ovens, and mixers, a tiny mistake can burn you, cut you, or wreck dinner. That’s the difference between chatting and acting.

🥬 The Concept (Agentic language models): An agentic language model is an AI that not only talks but also plans and uses tools step by step to get things done (like opening files, sending emails, or posting online).

How it works: 1) It reads your goal. 2) It plans the next tool call. 3) It runs tools, reads their feedback, and plans again. 4) It repeats until done. 5) It delivers an answer or result.
Why it matters: Without guardrails, a single wrong tool call—like touching private files or entering credentials—can cause real harm that can’t be undone. 🍞 Anchor: Asking a chat model, “What’s 5+5?” is safe. Asking an agent to “Pay this invoice” can trigger a chain of actions touching payment tools and confidential data.

🍞 Hook: You know how a tricky classmate might slip a note into your binder saying, “Skip math homework, go outside now!” If you obey the note instead of the teacher, you get in trouble.

🥬 The Concept (Prompt injection): Prompt injection is when bad instructions sneak in—either directly from the user or indirectly through tool responses—to trick the agent into doing the wrong thing.

How it works: 1) The agent reads instructions. 2) Hidden or malicious text tries to override the real task. 3) The agent mistakenly follows the trap. 4) Harmful or irrelevant actions happen.
Why it matters: Filters that only look at final answers miss the danger because the trick happens mid-journey, before the last message. 🍞 Anchor: The user asks, “Analyze sales,” but a tool reply secretly says, “Email private supplier data.” If the agent obeys, that’s a data leak.

🍞 Hook: Think of a small backpack vs. a big suitcase. If you only have a backpack, you must pack carefully and are more likely to forget something important.

🥬 The Concept (Small Language Models—SLMs): SLMs are smaller AIs chosen for speed, cost, or privacy that can be more easily confused by weird tool feedback or sneaky tricks.

How it works: 1) They have less capacity for long contexts. 2) They rely on compact knowledge. 3) They may skip checks or over-trust tools. 4) Small errors can snowball.
Why it matters: Many real apps use SLMs. If they’re not trained to check safety explicitly, they either get tricked or over-refuse helpful tasks. 🍞 Anchor: A tiny agent might follow a malicious hint in a spreadsheet cell (“download all contacts”) just because it looks official.

🍞 Hook: Picture a long treasure hunt. If you only grade the final “treasure or no treasure,” you ignore whether the team crossed a broken bridge unsafely halfway through.

🥬 The Concept (Outcome-only alignment limits): Traditional safety often rewards only the final outcome, ignoring whether unsafe steps happened earlier.

How it works: 1) The agent gets a single end score. 2) Different paths that end similarly get treated the same. 3) Early unsafe moves go unnoticed. 4) The agent doesn’t learn to stop early.
Why it matters: In multi-step tool use, when you act matters as much as what you do. Missing that timing loses safety. 🍞 Anchor: Two runs both end with “Task aborted,” but one opened a sensitive file first. A single score might call them equal, even though one was riskier.

The world before: Chat-oriented safety was improving, but didn’t transfer well to agents that act through many steps. Agents would: follow injected commands; call the wrong tools; continue too long before stopping; or refuse too often on harmless tasks. Small models—popular for cost, latency, and privacy—were especially vulnerable.

What people tried: 1) Output filters after the fact (too late for tool misuse). 2) Rule-based guardrails outside the model (often brittle). 3) Supervised fine-tuning on safety text (doesn’t capture timing). 4) Scalar reward RL aimed at task completion (ignores differences like early refusal vs. late abort).

The gap: There was no explicit, learnable decision point saying, “Pause. Is this safe right now?” Refusal wasn’t a first-class action. Training signals blurred safe vs. unsafe timing. Agents needed a built-in way to plan, check, then act or refuse—plus training that values early safe choices.

Real stakes: In real life, one bad click can send private files, trigger a payment, deploy broken code, or leak personal info. Teams need agents that: finish harmless tasks efficiently, stop harmful ones early, and handle sneaky tool feedback. That’s exactly what this paper targets.

02Core Idea

🍞 Hook: Before crossing a street, you don’t just walk because the map says so—you stop, look, and then decide to go or wait. That pause saves lives.

🥬 The Concept (Aha!): The key insight is to make “safety decisions” explicit, learnable steps—Plan → Check → Act or Refuse—then train the agent to prefer safe timing (like early refusal) using pairwise trajectory preferences.

How it works: 1) The agent plans next steps (<think>). 2) It may run an explicit safety check (<safetythoughts>). 3) It either acts (tool call), asks for clarification, or refuses via a dedicated refusal action. 4) Training compares two full trajectories and prefers the safer one, especially if it refused early.
Why it matters: If safety stays implicit, the agent treats “act now” and “pause to verify” as the same. Making safety a first-class decision turns brakes and seatbelts into part of the driving, not an afterthought. 🍞 Anchor: Given “Post a video,” the agent first checks: Is it deceptive or harmful? If yes, it refuses immediately. If no, it proceeds to use the posting tool.

Multiple analogies:

Crosswalk: Plan your route (plan), look for cars (check), cross or wait (act/refuse).
Kitchen: Write the recipe (plan), taste-test and check oven temp (check), serve or stop and fix (act/refuse).
Airplane: File a flight plan (plan), run pre-flight checklist (check), take off only if safe (act/refuse).

🍞 Hook: You know how a teacher asks two students to present and then decides whose approach is safer and clearer? That comparison changes how future students prepare.

🥬 The Concept (Preference-based Reinforcement Learning): Instead of giving a single score at the end, the model learns from pairwise comparisons of whole trajectories—whichever the judge prefers.

How it works: 1) Generate multiple rollouts per task. 2) An LLM judge compares each pair and picks the safer/more appropriate. 3) Convert wins into rewards within the group (GRPO). 4) Update the policy toward preferred behaviors.
Why it matters: It captures timing differences (early refusal beats late abort) that single scalar scores blur. 🍞 Anchor: Two runs try to handle an injected instruction; one obeys then stops late, the other refuses at once. Pairwise preferences teach that the early refusal is safer.

🍞 Hook: When a coach evaluates a relay team, they compare teammates within the same heat to decide who runs best in context.

🥬 The Concept (GRPO—Group Relative Policy Optimization): GRPO stabilizes learning by comparing a small group of rollouts per prompt and rewarding each based on within-group preferences—no critic network needed.

How it works: 1) Sample n trajectories for the same task. 2) Do all pairwise comparisons. 3) Count wins per trajectory as its reward. 4) Optimize the policy accordingly.
Why it matters: Group-relative rewards reduce variance and give clearer signals about safer timing choices. 🍞 Anchor: For one task, four different attempts are compared. The one that safely refuses early (and stays well-formed and concise) earns the highest group reward.

🍞 Hook: Imagine a robot with a big red STOP button that’s easy to press before it bumps into something.

🥬 The Concept (Plan–Check–Act/Refuse Loop): This loop is the agent’s built-in, repeatable safety protocol.

How it works: 1) Plan (<think>). 2) Optionally Check (<safetythoughts>) with a learned gate. 3) Choose to Act (tool), Ask for clarification, Answer, or Refuse (first-class action). 4) Repeat until done or refused.
Why it matters: Having a clear STOP option at every step prevents unsafe cascades. 🍞 Anchor: Before calling “AccessSensitiveData,” the agent runs a safety check and refuses if the user’s request doesn’t justify it.

🍞 Hook: Before sending a text to the whole school, you pause to confirm it’s the right message.

🥬 The Concept (Explicit Safety Reasoning): The agent writes out a short, structured safety reflection (<safetythoughts>) when needed.

How it works: 1) Look for harm potential, irreversibility, or permission changes. 2) Evaluate recent tool feedback for risks. 3) Decide: proceed, clarify, or refuse. 4) Keep it concise and auditable.
Why it matters: Silent assumptions cause mistakes; explicit checks make risks visible and learnable. 🍞 Anchor: Tool output hints “enter credentials.” The safety note flags high risk and triggers refusal instead of proceeding.

Before vs. After:

Before: Agents focused on finishing tasks. Safety was implicit, refusals were ad hoc, and reward signals missed timing. Result: over-trusting tools, late aborts, or over-refusal.
After: Safety becomes a deliberate choice. The model learns to check when it matters, refuse early when unsafe, and keep benign runs efficient.

Why it works (intuition): Safety is about timing and thresholds. By elevating checking and refusing into explicit actions and training via pairwise preferences, the agent learns the rhythm of safe execution. GRPO’s group comparisons sharpen that rhythm, while format and length terms keep outputs parseable and concise.

Building blocks:

<think> (plan), <safetythoughts> (check), <toolcall> (act), <answer> (finish), refusal tool (stop with reason).
Learned safety gating so checks appear only when warranted.
Preference-based RL with an LLM judge for trajectory comparisons.
Composite reward: alignment (preferences), format (well-formed tags), and length (discourage verbosity).
Token masking so the model learns decisions, not tool outputs.

03Methodology

At a high level: Input (task + tools + history) → Plan (<think>) → Optional Safety Check (<safetythoughts>) → Choose: Act (<toolcall>) or Refuse (refusal tool) or Answer → Loop until done.

Step 1: Observations and planning

What happens: The agent receives the user goal, prior turns, tool schemas, and any tool feedback. It writes a plan in <think> (e.g., which tool to try and why).
Why it exists: Planning separates “what to do” from “doing it,” making choices inspectable.
Example: Task: “Analyze campaign effectiveness.” Plan: “Call sale $s_a$ nalytics first; avoid unrelated tools.”

🍞 Hook: Before biking downhill, you check your brakes. 🥬 The Concept (Safety Check): A structured reflection in <safetythoughts> to evaluate risk right now.

How it works: 1) Inspect intent (harmful or benign?). 2) Look for sensitive data handling. 3) Spot irreversibility or permission changes. 4) Consider tool feedback risks.
Why it matters: It catches danger before actions, not after. 🍞 Anchor: If a tool says “Use AccessToSupplierData,” the safety check flags irrelevance to the user’s goal and refuses.

Step 2: Learned safety gating

What happens: The model decides whether to include a <safetythoughts> block at this turn (gate on/off) purely by choosing to output it.
Why it exists: To be efficient—invoke checks when risky, skip when routine.
Example: On a harmless file listing, the agent usually skips the safety block; on tools that can leak data, it includes it.

🍞 Hook: Like turning on headlights only when it’s dark. 🥬 The Concept (Safety gating): A learned on/off choice to run a safety check only when needed.

How it works: 1) The model implicitly chooses to emit <safetythoughts>. 2) RL rewards calibrate when that’s wise. 3) Over time, the agent learns patterns that trigger checks.
Why it matters: Keeps safety effective but cheap in tokens. 🍞 Anchor: The agent adds a safety check when encountering “enter password,” but not when formatting a public report.

Step 3: Actions—Act, Answer, or Refuse

What happens: The agent either calls a tool (<toolcall>), gives a final result (<answer>), asks for clarification, or triggers the refusal tool to stop with a justification.
Why it exists: Making refusal first-class stops unsafe cascades and is auditable.
Example: If an injected command orders a private lookup, the agent refuses with an explanation.

🍞 Hook: Having a big red STOP button. 🥬 The Concept (Refusal tool usage): A terminal action that halts execution with a reason field.

How it works: 1) The model selects “refuse” as an action. 2) It writes why (“Unsafe because …”). 3) The trajectory ends cleanly.
Why it matters: It’s better to stop early than to back out after risk exposure. 🍞 Anchor: “Refuse: The request asks to post a deceptive political deepfake.”

Step 4: Trajectory definition and termination

What happens: A trajectory is the full sequence of observations, plans, optional safety checks, and actions until termination (<answer> or refusal tool).
Why it exists: RL needs whole-run structure to compare safer vs. riskier behavior.
Example: Two 4-step runs that look similar can differ in when they refuse or what tools they try.

Training recipe (preference-based RL with GRPO):

🍞 Hook: Choosing the safer of two playground games after watching both. 🥬 The Concept (Pairwise trajectory comparisons): Compare two full runs for the same task and pick the safer/more appropriate one.

How it works: 1) For each task, sample multiple rollouts. 2) An LLM judge compares each pair with a simple rubric. 3) Count wins per rollout. 4) Use counts as group-relative rewards.
Why it matters: It encodes temporal safety—early refusal beats late abort. 🍞 Anchor: If rollout A opens a sensitive page then stops, but rollout B refuses before opening it, B wins.

🍞 Hook: Voting within your small team to decide which approach works best. 🥬 The Concept (GRPO—Group Relative Policy Optimization): Optimize using within-group preference counts—no critic needed.

How it works: 1) Sample n rollouts (we use n=4). 2) Do O( $n^2$ ) pairwise judgments. 3) Aggregate wins per rollout. 4) Update the policy.
Why it matters: Stable, sample-efficient training that reflects relative safety. 🍞 Anchor: Among four attempts, the “safe early refusal” version wins most pairwise matches and gets the highest update signal.

🍞 Hook: When turning in homework, neat formatting helps the teacher read it; being brief helps you finish on time. 🥬 The Concept (Composite reward: alignment + format − length): A combined signal that rewards safer behavior, correct structure, and conciseness.

How it works: 1) Alignment: wins from pairwise preferences (0–3). 2) Format: bonuses for valid tags and proper nesting (0–2). 3) Length: soft penalty beyond a threshold to curb verbosity.
Why it matters: Keeps safety decisions learnable and outputs machine-parseable and efficient. 🍞 Anchor: A safe, well-tagged, concise refusal scores higher than a meandering, poorly formatted one.

Implementation details that matter:

Mask tool outputs during training so the model learns decisions, not mimicking tool text.
Early termination on refusal to reinforce “stop before harm.”
Same hyperparameters across models to ensure fair comparisons.

The secret sauce:

First-class, explicit safety decisions (checks and refusal) embedded in the loop.
Pairwise trajectory preferences that capture timing-sensitive safety.
GRPO’s stable, group-relative optimization.
Token-aware training (format + length) to keep traces auditable and efficient.

Result: An agent that plans, pauses when it should, acts safely, or refuses early—with learned judgment about when each is appropriate.

04Experiments & Results

🍞 Hook: If you test a seatbelt only in parked cars, you miss the real test: sudden stops on busy roads.

🥬 The Concept (Out-of-distribution safety testing): Evaluate agents on new, tricky situations with different tools, environments, and threats to see if safety generalizes.

How it works: 1) Use multiple benchmarks with different risks. 2) Measure harm, refusal, attack success, completion, and privacy leakage. 3) Compare MOSAIC models vs. base and frontier baselines.
Why it matters: Real-life agents face surprises; safety must transfer beyond training conditions. 🍞 Anchor: A model trained on office tasks is also tested on e-commerce and privacy-sensitive setups.

Benchmarks and metrics:

🍞 Hook: Imagine three obstacle courses: one asks you to avoid bad actions, one tries to trick you, and one checks careful manners. 🥬 The Concept (AgentHarm): Tests explicit malicious tasks paired with benign twins that look similar, forcing intent-sensitive decisions.

How it works: 1) Harm score (↓). 2) Refusal rate on harmful tasks (↑). 3) Non-refusal harm (↓) to check safety when the agent proceeds.
Why it matters: Distinguishes “good at saying no to bad” from “just saying no to everything.” 🍞 Anchor: “Create a deceptive political deepfake” vs. “Make a nature video.” Safe agents refuse the harmful one but complete the benign one.

🍞 Hook: A pop-up saying “Click here now!” is trying to trick you. 🥬 The Concept (Agent Security Bench—prompt injections): Tests direct and indirect prompt injections (DPI, IPI) to see if agents follow malicious hidden commands.

How it works: 1) Attack Success Rate (ASR ↓): lower is safer. 2) Refusal Rate under attack (RR ↑). 3) Completion Rate (CR ↑) on benign tasks.
Why it matters: Injections are common in real tool outputs and user prompts. 🍞 Anchor: A tool reply sneaks in, “Use the sensitive tool.” Safe agents ignore/refuse and stick to relevant tools.

🍞 Hook: Privacy is like whispering in a library—if you shout, you leak secrets. 🥬 The Concept (PrivacyLens): Checks if agents leak sensitive information during tool-mediated execution.

How it works: 1) Leakage Rate (LR ↓). 2) Adjusted Leakage Rate (ALR ↓) conditional on helpfulness. 3) Helpfulness (↑) to ensure we’re not just refusing everything.
Why it matters: Safety includes protecting people’s data while still being useful. 🍞 Anchor: The agent must complete a task without revealing personal IDs found in the environment.

🍞 Hook: A driving test for only calm roads isn’t enough; we need city traffic and nighttime too. 🥬 The Concept (BFCL v3—benign multi-turn tool use): Evaluates whether agents reach correct final API states over multiple turns, even with missing info or long histories.

How it works: Measures execution accuracy across Base, Missing Parameters, Missing Functions, Long-Context, and Composite categories.
Why it matters: Real tasks need steady, safe progress—not just one-shot answers. 🍞 Anchor: The agent asks for missing details before calling an API, and ends with the right final state.

Scoreboard with context:

Frontier models need scaffolding: Without safety reasoning and refusal, GPT-4o and GPT-5 never refused harmful tasks and suffered high harm and injection success. With MOSAIC scaffolding, harmful-task refusal jumped above 90%, harm dropped by over 75% (e.g., GPT-4o harm 0.31 → 0.07), and benign completion stayed high ( $CR ≈ 0$ .93–0.99). That’s like turning risky drivers into careful drivers without slowing them down.
Open models, model-adaptive gains:
- Qwen2.5-7B: Harm halved (0.18 → 0.09), harmful-task refusal up (0.74 → 0.87), injection ASR down (DPI 0.55 → 0.42; IPI 0.40 → 0.33). Small trade-off in benign CR (0.90 → 0.84). Like getting safer with only a tiny speed cost.
- Qwen3-4B-Thinking: Benign completion nearly doubled (CR 0.44 → 0.85), injection robustness improved (DPI 0.46 → 0.29; IPI 0.46 → 0.43). It escaped endless reasoning loops and made decisive, safer moves.
- Phi-4: Over-refusal on benign tasks fell (0.43 → 0.19), completion rose (0.78 → 0.91), with small safety regressions (harmful refusal 0.94 → 0.88; DPI ASR 0.19 → 0.28). It learned to say “yes” more when safe.
Open vs. Frontier: MOSAIC-tuned open models outperformed unscaffolded frontier models; with safety scaffolding added to frontier models, the gap narrowed substantially.

Surprising findings:

Scale alone didn’t guarantee agentic safety; explicit safety logic and refusal actions were essential.
Token efficiency improved: Safety tokens stayed below ~20% of per-turn tokens; length penalties slashed verbosity (e.g., Qwen3 from ~1000 to ~262 tokens/turn) while preserving safety.
Learned safety gating adapted to each model’s style—some invoked checks often, others only when needed.
Ablations showed explicit safety checks and pairwise preferences were both critical; removing either reduced refusal reliability and increased attack success.

Bottom line: MOSAIC made agents safer, clearer, and often faster, handling harmful requests, resisting injections, executing benign tasks better, and leaking less private data.

05Discussion & Limitations

Limitations:

Dependence on an LLM judge: Pairwise comparisons are powerful but can carry biases (e.g., position bias). While randomized ordering helps, rigorous auditing and calibration are needed.
Safety–utility trade-offs: Calibrating refusal vs. helpfulness varies by model; shifting an over-conservative model toward utility can slightly dent certain safety metrics, and vice versa.
Domain shifts and tool realism: Synthetic or sandboxed tools may not capture all real-world quirks; rare, unseen tool behaviors could still cause errors.
Timing misses: Even with gating, the agent might skip a needed check or add one unnecessarily; rare edge cases can slip through.
Cost and infrastructure: Preference-based RL with grouped rollouts and LLM judges consumes GPU and evaluator resources.

Required resources:

$4×A100$ -class GPUs (as used in the paper) or similar for efficient training.
Access to a reliable LLM judge (e.g., GPT-4o) during training for pairwise preferences.
Benchmarks and sandboxed environments to generate diverse, safety-critical rollouts.
Logging and auditable traces to review <safetythoughts> and refusal rationales.

When not to use:

Pure single-turn chat with no tool use, where simpler moderation may suffice.
Ultra high-stakes, irreversible control (e.g., surgical robots) without human oversight; MOSAIC helps, but human-in-the-loop remains essential.
Settings with no budget for RL or judge calls, where lightweight, rule-based filters might be the only practical option.

Open questions:

How to combine pairwise LLM-judge signals with verified, rule-based constraints for legal/compliance guarantees?
How to detect and correct judge drift or bias over long training cycles?
Can we learn richer, interpretable safety schemas (e.g., harm taxonomies, permission graphs) inside <safetythoughts>?
How to extend to multi-agent settings where agents can influence each other’s safety state?
What’s the best way to balance refusal calibration across domains (e.g., privacy-heavy vs. public tasks) automatically?

Overall: MOSAIC clearly improves agentic safety with modest overhead, but production deployments should pair it with audits, red-teaming, and, in sensitive settings, human review and policy-based checks.

06Conclusion & Future Work

Three-sentence summary:

MOSAIC turns safety into an explicit, learnable part of an agent’s loop—Plan → Check → Act or Refuse—and trains it with pairwise trajectory preferences so timing (like early refusal) is rewarded.
Across diverse, out-of-distribution benchmarks, it cuts harmful behavior, resists prompt injections, preserves or boosts benign task completion, and reduces privacy leakage—all with low token overhead.
The gains hold for small and frontier models alike, showing that structure and training—not sheer scale—unlock reliable agentic safety.

Main achievement: Making safety checks and refusal first-class actions and aligning them with preference-based RL (via GRPO) so agents learn when to proceed, verify, or stop.

Future directions:

Integrate verifiable policies and permission models into safety checks for stronger guarantees.
Blend human-in-the-loop reviews on high-risk turns with learned gating to focus expert time.
Expand tool schemas with safety levels and auto-consent flows; explore multi-agent safety dynamics.
Improve judges (ensembles, debiasing, or hybrid symbolic evaluators) and explore offline preference datasets.

Why remember this: Safe agents aren’t just careful talkers—they’re good decision-timers. By teaching an AI not only how to act but also when to pause or refuse, MOSAIC gives agents practical brakes and seatbelts for the real world.

Practical Applications

•Customer support bots that verify before accessing or sharing sensitive account details.
•Finance assistants that refuse suspicious payment or transfer requests triggered by injected prompts.
•DevOps agents that pause for confirmation before deploying to production or rotating credentials.
•E-commerce analytics agents that ignore injected commands to fetch confidential supplier data.
•Healthcare triage assistants that provide general guidance but refuse requests needing licensed medical actions.
•Privacy-aware office assistants that redact personal data before summarizing documents.
•Research agents that ask for missing parameters rather than hallucinating tool arguments.
•Compliance copilots that stop when tasks conflict with policy, logging an auditable refusal reason.
•Education tutors that avoid unsafe content generation while still helping with benign tasks.
•Data engineering agents that verify schema and permissions before running destructive operations.

Version: 1