Products
Key Summary
- •AI agents are computer helpers that can use tools and act on their own a little or a lot, and this paper measures how that happens in real life.
- •By studying millions of actions, the researchers found people let coding agents work longer without stopping them—extreme stretches grew from under 25 minutes to over 45 minutes in three months.
- •As users get more experienced, they click “go ahead” more often but also jump in to correct the agent more often, showing smarter oversight instead of less oversight.
- •On hard tasks, the agent itself pauses to ask questions more than twice as often as humans interrupt it, which is like the agent raising its hand when unsure.
- •Most agent actions seen through the public API were low-risk and could be undone, but some early uses touched higher-stakes areas like healthcare, finance, and cybersecurity.
- •Software engineering made up nearly half of all agent tool use, but other fields are starting to experiment, so risk and autonomy could rise in the future.
- •The main message: We need better, privacy-preserving monitoring after deployment and better designs that help humans and AIs share control safely.
- •Capability tests in labs and real-world behavior can look very different—together they show agents could handle more autonomy than they’re currently given.
- •The study has limits: it only covers one model provider, splits views between single actions (API) and full sessions (Claude Code), and relies on AI-assisted labeling.
- •The authors recommend investing in post-deployment monitoring, teaching models to recognize uncertainty, and designing products for easy human intervention.
Why This Research Matters
Real-world measurement shows how people and agents truly share control, which is key for safety and productivity. By seeing when users approve, interrupt, and let agents run, we can design products that are fast when tasks are safe and careful when tasks are risky. Teaching agents to ask for clarification at the right moments makes them safer teammates, especially as they enter areas like finance or healthcare. Policymakers can focus on outcomes—effective oversight—rather than forcing one-size-fits-all rules like approving every single step. Companies can spot where autonomy is underused or overused and adjust settings, permissions, and training. As agents spread beyond coding, these lessons help prevent rare but serious mistakes while keeping the helpful speed gains. In short, measuring real behavior is the compass for safe, useful autonomy.
Detailed Explanation
Tap terms for definitions01Background & Problem Definition
🍞 Hook: Imagine you have a super-smart helper robot that can use your laptop, send emails, and look things up online. Sometimes you watch it closely; other times you let it work on its own. How much should you trust it to do by itself?
🥬 The Concept: AI agents are computer programs that can use tools to take actions in the world, like running code or sending messages.
- How it works: (1) A person gives a goal, (2) the agent plans steps, (3) it uses tools to act (like code runners or web APIs), (4) it checks results and either continues, asks questions, or stops.
- Why it matters: Without a clear idea of how much freedom (autonomy) these agents use in real life, it’s hard to keep them helpful and safe.
🍞 Anchor: Think of a homework helper: sometimes you tell it exactly what to do; other times it figures things out. This paper measures how often it figures things out by itself and when you step in.
— 🍞 Hook: You know how your parents might first watch you closely when you cook, but as you get better, they hover less and just step in if the stove looks too hot?
🥬 The Concept: Autonomy means how independently an AI agent works without a human guiding each step.
- How it works: (1) Low autonomy = do exactly what the person says step-by-step, (2) medium autonomy = follow the goal but make small choices, (3) high autonomy = choose methods and steps with little checking.
- Why it matters: If autonomy is too low, agents waste human time. If it’s too high, they might make risky mistakes.
🍞 Anchor: Like a bike with training wheels (low autonomy), then a parent jogging beside you (medium), and finally you riding alone (high).
— 🍞 Hook: Imagine giving your helper a toolbox with a hammer, screwdriver, and tape. What it chooses tells you what it’s building.
🥬 The Concept: Tools are the special abilities agents use—like running code, calling services, or sending messages; a tool call is one use of one tool.
- How it works: (1) The agent decides a next action, (2) it calls a tool with inputs, (3) the tool returns a result, (4) the agent decides what to do next.
- Why it matters: By studying tool calls, we can see what agents actually do without reading private user data.
🍞 Anchor: If you see lots of “run code” tool calls, the agent is coding; if you see “send email,” it’s contacting people.
— 🍞 Hook: Picture a sports coach watching a team. Sometimes the coach shouts every play; other times they just call time-outs when needed.
🥬 The Concept: Human-in-the-loop oversight means a person can approve, deny, or redirect the agent’s actions.
- How it works: (1) The agent proposes an action, (2) the human can approve it, stop it, or change the plan, (3) the agent continues.
- Why it matters: Oversight catches mistakes before they cause trouble, but too much slows everything down.
🍞 Anchor: A teacher who lets a group project run but steps in when a team member goes off-topic.
—
- The World Before: Not long ago, many “agents” were just chatbots answering questions. They didn’t reliably use tools or act through long sequences on their own. Researchers mostly tested them in labs, not in the messy real world.
- The Problem: We didn’t know how people actually used agents day-to-day. How much freedom did they give them? Did humans watch every step or check in only sometimes? Were the actions risky?
- Failed Attempts: Lab tests (capability benchmarks) showed what models could do in perfect conditions—but missed how humans interrupt, approve, or change tasks in real life. Looking at logs without context also failed because it’s hard to link many single actions into a full story.
- The Gap: We needed real-world measurements that show both tiny actions (tool calls) and full, linked workflows, while keeping privacy.
- Real Stakes: This matters for safety (don’t send wrong emails or deploy bad code), speed (don’t require useless approvals), trust (humans need to feel in control), and policy (rules should fit how people and agents actually work together).
02Core Idea
🍞 Hook: You know how a coach judges a team not just by tryouts but by real games? Real games show who passes, who shoots, and when time-outs happen.
🥬 The Concept: The key insight is to measure autonomy and oversight where agents actually work—by analyzing tool calls across many users plus full sessions from a coding agent product.
- How it works: (1) Watch what tools agents use (public API) and how whole sessions flow (Claude Code), (2) label things like risk, autonomy, approvals, interruptions, and clarifying questions, (3) look for patterns across experience levels and task difficulty, (4) compare with lab benchmarks to see gaps.
- Why it matters: Real usage reveals how humans and agents co-manage autonomy—information you can’t get from lab tests alone.
🍞 Anchor: It’s like filming actual games to learn when players ask for help, when coaches step in, and which plays are too risky to try.
— Three analogies for the same idea:
- School project: Instead of only grading practice worksheets (lab tests), the teacher also watches how the group actually divides tasks, asks for help, and finishes the poster (real sessions + tool calls).
- Airplane autopilot: Pilots don’t fiddle with every control; they watch and step in at key moments. Measuring those handovers and alerts tells you about safe autonomy.
- Kitchen shift: A head chef observes which tasks helpers do solo, when they ask for a taste check, and when the chef grabs the pan—revealing a pattern of trust and control.
Before vs After:
- Before: People assumed tight supervision was always safer and that autonomy mainly came from how “smart” the model was.
- After: We see autonomy rising smoothly in practice, experienced users shifting from pre-approvals to monitoring-and-intervening, and agents themselves asking for help more on hard tasks.
- Net change: Autonomy isn’t just a model trait; it’s a dance between the model, the user, and the product design.
🍞 Hook: Think of a speedometer (how fast), a fuel gauge (how far you can go), and a brake light (when to stop). Together, they tell you how to drive safely.
🥬 The Concept: The paper’s building blocks are simple, trackable signals: turn duration, auto-approve, human interrupts, agent clarifications, and risk/autonomy scores for actions.
- How it works: (1) Turn duration hints at how long the agent runs per step, (2) auto-approve shows trust to continue without checks, (3) interrupts show targeted corrections, (4) clarifications show the agent’s uncertainty handling, (5) risk/autonomy scores describe the stakes and independence of each action.
- Why it matters: Each signal alone is limited, but together they paint a picture of practical autonomy.
🍞 Anchor: Like judging a basketball team by shots taken (actions), time holding the ball (turn duration), coach time-outs (interrupts), and players calling for a play (clarifications), plus how risky each play was.
— 🍞 Hook: You know how sometimes you can do more than you’re allowed to yet? Like you could ride the big roller coaster, but the rules say wait.
🥬 The Concept: Deployment overhang means models could handle more autonomy than they’re being granted in real life right now.
- How it works: (1) Lab tests suggest capability, (2) real-world use shows shorter runs and more pauses, (3) as trust and tools improve, actual autonomy should rise toward capability.
- Why it matters: If we only look at lab scores, we may push too fast; if we ignore them, we may underuse helpful autonomy.
🍞 Anchor: A student ready for algebra who’s still stuck doing arithmetic worksheets—the potential is there, but the classroom rules haven’t caught up yet.
Why it works (intuition):
- Autonomy is co-constructed: models offer choices, users grant freedom or step in, and product features (like approvals) shape both. Measuring all three together reveals the real pattern.
- The agent’s own uncertainty checks (clarification questions) act like built-in brakes—key for safety as autonomy rises.
- Risk isn’t uniform: most actions are harmless, but rare high-stakes ones need strong oversight—so we must measure and design for both ends.
03Methodology
At a high level: Real-world inputs → (A) Public API tool-call analysis → (B) Claude Code full-session analysis → (C) Privacy-preserving labeling of risk, autonomy, and human involvement → (D) Trend analysis and clustering → Outputs: findings and recommendations.
— 🍞 Hook: Imagine counting each time your helper uses a tool—like a hammer strike—across many workshops. You can learn a lot just from those strikes.
🥬 The Concept: Public API tool-call analysis means studying individual agent actions (tool uses) across thousands of customers.
- How it works: (1) Sample nearly a million tool calls, (2) look at the context (system prompt, conversation), (3) have an AI classifier estimate risk, autonomy, and human involvement for each, (4) group similar actions.
- Why it matters: It gives broad coverage of real deployments without needing to see a customer’s whole setup.
🍞 Anchor: Even if you can’t watch an entire house being built, counting nail strikes tells you where most building happens and which jobs need more care.
— 🍞 Hook: Now picture watching a single workshop from start to finish. You see the whole project, not just the hammer hits.
🥬 The Concept: Claude Code session analysis links actions across time to study turns, approvals, interruptions, clarifications, and outcomes.
- How it works: (1) Track start/stop of each turn, (2) record when users approve or interrupt, (3) note when the agent pauses to ask, (4) summarize how session difficulty and success change.
- Why it matters: Full workflows reveal how autonomy grows and where humans step in—data you can’t get from single actions.
🍞 Anchor: It’s like filming the entire cooking class, so you know when students ask questions, when the teacher steps in, and which dishes finish well.
— 🍞 Hook: Think of a stopwatch on every play in a game.
🥬 The Concept: Turn duration is how long the agent works in one step before pausing, asking, finishing, or being stopped.
- How it works: (1) Start timing when the agent begins a turn, (2) stop when the turn ends, (3) chart the distribution to see usual vs extreme lengths.
- Why it matters: Long turns hint at more autonomous stretches; very short turns suggest frequent check-ins.
🍞 Anchor: If a coder-agent often works for 30–45 minutes before pausing, it’s likely trusted to carry longer parts of the job.
— 🍞 Hook: Like giving your helper a green light so they don’t need to ask you for each tiny step.
🥬 The Concept: Auto-approve lets the agent act without manual approval for every action.
- How it works: (1) The product can require approvals by default, (2) users can switch to auto-approve, (3) the agent proceeds unless the human interrupts.
- Why it matters: Shows trust and speeds work—but needs safety nets.
🍞 Anchor: Setting your robot vacuum to clean the whole house while you keep an eye on it, instead of telling it to clean each room one by one.
— 🍞 Hook: Think of a referee’s whistle to stop play when something looks off.
🥬 The Concept: Human interruptions are mid-action stops to redirect or correct the agent.
- How it works: (1) User notices drift or a better path, (2) they hit pause/interrupt, (3) they give guidance, (4) the agent resumes.
- Why it matters: Interrupts show active monitoring and targeted oversight.
🍞 Anchor: You stop the agent before it deploys code, ask it to add tests, then continue.
— 🍞 Hook: Remember raising your hand in class when you weren’t sure? Agents can do that too.
🥬 The Concept: Agent-initiated clarifications happen when the agent pauses to ask a question or present options.
- How it works: (1) Agent detects uncertainty, missing info, or a fork in the road, (2) it asks for guidance, (3) waits, (4) continues safely.
- Why it matters: This self-check reduces mistakes as tasks get harder.
🍞 Anchor: The agent asks “Do you want speed or accuracy?” before picking a solution.
— 🍞 Hook: Traffic lights keep cars safe; some tools do that for agents.
🥬 The Concept: Safeguards are safety features like permission limits, human approvals, and restricted tools.
- How it works: (1) Define what the agent is allowed to do, (2) require checks for important actions, (3) log activity for review.
- Why it matters: Most real agents use safeguards so small slips don’t become big problems.
🍞 Anchor: An agent can read files but can’t email customers without a human click.
— 🍞 Hook: Imagine rating rides at a fair from 1 (gentle) to 10 (extreme) and how much the ride runs itself.
🥬 The Concept: Risk and autonomy scores are 1–10 ratings that describe how risky an action is and how independently the agent seems to be operating.
- How it works: (1) An AI classifier looks at each tool call’s context, (2) assigns a risk score (harm if wrong) and autonomy score (independence), (3) results are used for comparisons across clusters.
- Why it matters: Not all actions are equal; these scores help focus oversight where it counts.
🍞 Anchor: Retrieving a public document (low risk) vs changing medical records (higher risk) demand very different safeguards.
— 🍞 Hook: Sorting a big pile of Lego by shape makes patterns easy to see.
🥬 The Concept: Clustering groups similar tool calls so we can compare average risk and autonomy by type of action.
- How it works: (1) Extract features from context, (2) group similar actions, (3) compute mean scores per group, (4) visualize and track over time.
- Why it matters: It reveals which kinds of actions sit on the frontier of autonomy and risk.
🍞 Anchor: A cluster might be “deploy bug fixes,” another “send meeting reminders,” and their risk/autonomy look very different.
— Secret sauce:
- Two complementary lenses: API gives breadth; Claude Code gives depth. Together they show both single steps and whole journeys.
- Agent self-checks: Training agents to ask clarifying questions turns autonomy from a cliff into a staircase.
- Privacy-preserving labeling: Using the model to label context (with opt-outs) scales insight while respecting user privacy.
- Focus on real behavior: Measuring how people actually use agents uncovers patterns that lab tests alone miss.
04Experiments & Results
🍞 Hook: When you test a new bike, you don’t just weigh it—you ride it uphill, downhill, and over bumps to see what really happens.
🥬 The Concept: The study measured autonomy and oversight signals in real usage to see what changes with experience and task difficulty.
- How it works: (1) Track turn durations, approvals, interrupts, clarifications, (2) score risk and autonomy per tool call, (3) compare across user experience and complexity, (4) contrast with lab benchmarks.
- Why it matters: Numbers gain meaning when tied to behavior—how humans and agents actually share control.
🍞 Anchor: It’s not just “the bike is fast”; it’s “the bike stays steady on rough roads,” which is what you feel when you ride.
— The Tests and Why:
- Turn duration: Do agents get longer stretches of independent work? Because longer turns hint at more autonomy.
- Auto-approve rate: Do users trust the agent to run steps without asking every time? Because that speeds work but needs safety.
- Interrupt rate: Do experienced users step in more or less? Because targeted interventions signal smarter oversight.
- Clarification rate: Does the agent pause more on hard tasks? Because self-braking is crucial for safety.
- Risk/autonomy mapping: Where are actions low-risk and reversible vs higher-stakes? Because oversight should match stakes.
The Competition (Context):
- Lab capability: METR estimated a top model could complete tasks that would take a human about five hours—under ideal, no-interruption conditions.
- Real usage: In Claude Code, the 99.9th percentile turn lasted ~45 minutes, with a median around 45 seconds—showing big gaps between lab potential and in-the-wild patterns.
The Scoreboard—What Happened (with context):
- Longest turns nearly doubled: The 99.9th percentile of turn duration grew from under 25 minutes to over 45 minutes in about three months. Like going from a short study session to nearly a full class period without stopping.
- Median stayed short: Around 45 seconds, which fits rapid growth with many newer users who supervise closely.
- Experience reshapes oversight: New users used full auto-approve ~20% of sessions; experienced users (hundreds of sessions) did so over 40%—but they also interrupted more (from ~5% to ~9% of turns). That’s like trusting your teammate to run plays but still calling out when you spot an opening.
- Agent raises its hand: On complex tasks, the agent asked for clarification more than twice as often as humans interrupted—evidence of uncertainty awareness.
- Risk landscape: Most API tool calls were low-risk and reversible; 80% showed some safeguard, ~73% seemed to have a human in the loop, and only ~0.8% looked irreversible (like emailing a customer). Still, a few clusters touched higher-stakes areas (security, finance, medical data).
- Domain spread: Nearly 50% of tool calls were in software engineering; smaller but growing activity appeared in business ops, customer support, sales, finance, and e-commerce.
Surprising or Notable Findings:
- More experience = more auto-approve AND more interrupts: This seems contradictory but actually signals a smarter strategy—monitor and step in when it matters rather than approve every small step.
- Agent-initiated stops exceed human interrupts on hard tasks: The agent often limits its own autonomy when the job gets tricky, which can reduce harm.
- Smooth growth across model releases: Longest turns increased steadily rather than spiking only with new versions, suggesting product design and user trust also drive autonomy—not just raw capability.
🍞 Hook: It’s one thing to know a car can go 150 mph on a test track; it’s another to see how fast people actually drive to school.
🥬 The Concept: Lab benchmarks show ceiling capability, while real-world logs show chosen behavior under oversight.
- How it works: (1) Compare idealized success (no human, no consequences) vs (2) practical runs (humans approve/interrupt, agent asks questions).
- Why it matters: Safety and usefulness depend on chosen behavior, not just possible behavior.
🍞 Anchor: A roller coaster rated for extreme speed still runs at family-safe settings during the day; both facts are true and important in different ways.
05Discussion & Limitations
🍞 Hook: If you’re letting a friend drive your go-kart, you’d want a speedometer, good brakes, and a way to tap them on the shoulder—plus a track marshal watching.
🥬 The Concept: Limits and needs help us use agents wisely—know what the method can’t see, what it requires, and where it may fail.
- How it works: (1) Limits: One provider’s data, API shows single steps not whole sessions, coding-focused sessions may not generalize, AI-assisted labeling can err, time window is short.
- Why it matters: These constraints mean results are informative but not universal; we should treat them as early signals, not final truth.
🍞 Anchor: It’s like weather from one city this season—you learn a lot, but you don’t assume it’s the same everywhere all year.
Limitations (specific):
- Scope: Only Anthropic’s systems; other ecosystems may differ.
- Fragmented view: API shows isolated tool calls; Claude Code shows full sessions but mostly coding tasks.
- Labeling noise: AI-generated risk/autonomy/human-involvement scores include opt-outs but can overcount human involvement.
- Overrepresentation: Workflows with many steps (like repeated file edits) appear more often in the sample.
- Context gaps: Some “risky” actions may be simulations or evaluations; downstream human checks may exist where logs can’t see them.
Required Resources:
- Privacy-preserving monitoring infrastructure, safe data pipelines, and model-assisted labeling tools.
- Product features: clear logs, real-time steering, easy interrupts, and safe defaults for permissions.
- Teams to maintain oversight dashboards and iterate policies.
When NOT to Use High Autonomy:
- High-stakes, hard-to-undo actions without strong safeguards (e.g., medical record edits, financial transfers) and clear human accountability.
- Domains where verifying outputs needs rare expertise and quick feedback isn’t possible.
- Early pilots without monitoring—start with tighter oversight and expand gradually.
Open Questions:
- How to reliably link tool calls into sessions across diverse products while preserving privacy?
- How to calibrate agent clarifications so they ask at the right times—not too often, not too rarely?
- Will non-coding domains see the same oversight shift (approve-every-step → monitor-and-intervene)?
- How should policy focus on outcomes (effective oversight) instead of prescribing exact interaction patterns?
06Conclusion & Future Work
🍞 Hook: Imagine building a shared dashboard for you and your robot helper that shows what it’s doing, how risky each step is, and gives you a big pause button.
🥬 The Concept: This paper measures how people and agents actually share control in the wild and finds autonomy rising, oversight getting smarter, and agents asking for help more on hard tasks.
- How it works: (1) Study single actions across many customers (API) and full workflows in a coding agent (Claude Code), (2) score risk, autonomy, and human involvement, (3) track turn lengths, approvals, interrupts, and clarifications.
- Why it matters: Real-life data shows where safety and speed meet—so we can design better guardrails and grant autonomy where it’s earned.
🍞 Anchor: Like a coach learning when to let the team run and when to call time-out, guided by real game footage—not just practice drills.
3-Sentence Summary:
- The authors analyzed millions of agent actions to see how much independence agents actually get, how oversight changes with user experience, and where risks appear.
- They found extreme autonomous stretches are getting longer, experienced users both auto-approve more and interrupt more, and agents themselves pause to ask questions more often on hard tasks.
- Most actions are low-risk and reversible, but frontier uses in healthcare, finance, and security are emerging, so we need better post-deployment monitoring and designs that enable effective human intervention.
Main Achievement:
- A practical, privacy-aware way to measure agent autonomy, oversight, and risk in real deployments—and clear evidence that autonomy in practice is co-created by the model, the user, and product design.
Future Directions:
- Build cross-industry, privacy-preserving methods to link actions into sessions; improve uncertainty calibration so agents ask at the right times; and expand measurements beyond software into higher-stakes fields.
Why Remember This:
- Because the safest and most useful agents won’t come just from smarter models—they’ll come from measuring real behavior, teaching agents to raise their hands when unsure, and giving humans simple, strong controls to guide autonomy.
Practical Applications
- •Add real-time interrupt and approve controls with clear activity logs in any agent product.
- •Enable graded autonomy modes (low/medium/high) that expand as trust and performance improve.
- •Train agents to detect uncertainty and ask clarifying questions before high-impact actions.
- •Use risk and autonomy scoring to route tasks: low-risk tasks can be auto-approved; higher-risk tasks require human checks.
- •Build dashboards that track turn duration, interrupt rates, and clarifications to monitor autonomy drift over time.
- •Set default safeguards (permissions, irreversible-action blocks) and relax them only with evidence from monitoring data.
- •Pilot agents in low-risk domains first, then expand to higher-stakes uses as oversight proves effective.
- •Cluster common actions to find frontier areas needing extra guardrails or policy review.
- •Run A/B tests on oversight UX (e.g., plan previews, batched approvals) to reduce interruptions without losing safety.
- •Create playbooks for experienced users to shift from step-by-step approvals to monitor-and-intervene workflows.