"What Are You Doing?": Effects of Intermediate Feedback from Agentic LLM In-Car Assistants During Multi-Step Processing

Johannes Kirmayr; Raphael Wennmacher; Khanh Huynh; Lukas Stappen; Elisabeth André; Florian Alt

"What Are You Doing?": Effects of Intermediate Feedback from Agentic LLM In-Car Assistants During Multi-Step Processing

Beginner

Johannes Kirmayr, Raphael Wennmacher, Khanh Huynh et al.2/17/2026

arXiv

Key Summary

•The study tested how an in-car AI helper should talk while it works on long, multi-step tasks.
•They compared two styles: staying quiet until the end (No Intermediate) versus giving step-by-step plans and mini-results (Planning & Results).
•With 45 people in a car simulator, intermediate updates made the AI feel faster, more trustworthy, and more pleasant to use.
•Surprisingly, step-by-step updates also lowered mental load by reducing frustration compared to a big final info dump.
•These benefits held both when people were sitting still and when they were doing a simple driving task at the same time.
•Longer tasks felt slower overall, but intermediate updates buffered that slowdown and kept users feeling in the loop.
•Interviews showed people want an adaptive approach: start very transparent to build trust, then get briefer as the AI proves reliable.
•People also want transparency to come back for new, unclear, or high-stakes tasks, plus a quick way to mute or expand details on demand.
•Design takeaway: use content-bearing intermediate updates (not just 'working on it'), pace them sensibly, and let users adjust verbosity.
•These insights can guide other agentic assistants beyond cars, especially for tasks that take many seconds and use different attention channels.

Why This Research Matters

In cars, your eyes and mind should stay on the road, not worrying whether the assistant understood you. Short, meaningful updates during multi-step tasks make waiting feel shorter and reduce the stress of “Is it stuck?” This helps drivers feel safer and more in control without being buried in a big speech at the end. As assistants enter more parts of life—homes, workplaces, wearables—these findings offer a blueprint for talking just enough, at the right time. An adaptive approach means the assistant learns your comfort level, gets briefer as it proves itself, and becomes more transparent when the stakes or uncertainty rise. The result is technology that earns trust while respecting attention.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

You know how when you ask a friend to plan a trip, you don’t want silence for a long time—you want little check-ins like, “Found the flights,” or “Comparing hotels now”? Computers that use big language models (LLMs) are starting to act more like that friend. They don’t just answer one quick question; they can break a big job into smaller steps and do them for you. In a car, that might mean: find a contact, grab their address, check your battery, then add a charging stop.

🍞 Top Bread (Hook) Imagine ordering a pizza and hearing nothing for 30 minutes. Even if the pizza arrives hot, the wait feels worrying. Now imagine you get short updates: “We’re mixing the dough,” “It’s in the oven,” “Out for delivery.” The total time is the same, but it feels better.

🥬 Filling (New Concept 1: Large Language Model, LLM)

What it is: An LLM is a computer program that understands and produces human-like language.
How it works (simple steps): 1) Reads your words, 2) Predicts likely next words using patterns it learned, 3) Writes helpful, on-topic replies, 4) Can plan steps or call tools if it’s an agent.
Why it matters: Without LLMs, assistants can’t understand flexible requests like “Plan my trip and add a charging stop near the halfway point.” 🍞 Bottom Bread (Anchor): When you say, “Text Mom I’ll be there at 6,” the LLM figures out you want to message your mother and drafts a polite text.

🥬 Filling (New Concept 2: User Experience)

What it is: User experience (UX) is how a person feels when using a system.
How it works: 1) You interact, 2) You notice speed, clarity, and comfort, 3) You form an opinion: easy or annoying.
Why it matters: If UX feels bad—slow, confusing—you stop trusting or using the assistant. 🍞 Anchor: A voice assistant that explains progress calmly feels friendlier than one that’s silent and then dumps a huge speech at the end.

🥬 Filling (New Concept 3: Trust in AI)

What it is: Trust is believing the AI will do the right thing reliably.
How it works: 1) The AI behaves clearly, 2) It does what it says, 3) You see consistent good results, 4) You rely on it.
Why it matters: Without trust, people second-guess the AI or ignore its help. 🍞 Anchor: If your car assistant has often been right about traffic detours, you’ll accept its new route faster next time.

🥬 Filling (New Concept 4: Cognitive Load)

What it is: Cognitive load is how hard your brain is working.
How it works: 1) You take in info, 2) You hold it in memory, 3) You make choices; more info at once makes it heavier.
Why it matters: High load makes people tired, miss details, or make mistakes—dangerous when driving. 🍞 Anchor: Hearing 10 facts in 10 seconds is harder than hearing the same 10 facts spread out in short chunks.

🥬 Filling (New Concept 5: Feedback Timing)

What it is: Feedback timing is when the assistant gives you updates.
How it works: 1) It can stay quiet until the end, 2) Or it can share small updates along the way, 3) Or do a mix.
Why it matters: Bad timing feels like a black box; good timing keeps you calm and informed. 🍞 Anchor: A barista saying “Almost done, just frothing the milk” helps you wait more comfortably than silence.

The problem before this research: Many agent-like systems either talk too little (you wait in silence and worry) or too much (they narrate every tiny step and distract you). This is extra tricky in cars, where your attention must stay on the road. Prior tricks like spinning wheels or “working on it” beeps help a bit, but they don’t tell you what the AI is actually doing, so you still feel uncertain. And when tasks are longer and have more steps, the amount of information grows, which can overload the driver if dumped at once.

Failed attempts and gaps: Silent systems reduce interruptions but lead to “ambiguous silence” (is it stuck or still working?). Super-chatty systems are transparent but can overload your attention. What was missing: clear, tested guidance on how much to say and when to say it for multi-step tasks in attention-sensitive places like cars.

Real stakes in daily life: In a car, bad feedback design can increase stress, make waiting feel longer, and even distract you. Good design can keep you confident, informed, and safer—like hearing “Step 2/4: Found the address,” instead of nothing, or instead of a giant final speech when you’re trying to focus on driving.

02Core Idea

The “Aha!” in one sentence: Giving meaningful, bite-sized updates during a long task makes the assistant feel faster, more trustworthy, and easier to use than being silent and answering only at the end.

Three analogies:

Cooking assistant: Instead of vanishing and returning with a full meal, the chef says, “Chopping veggies… simmering sauce… pasta’s ready,” which keeps you relaxed and in sync.
Package tracking: Seeing “Shipped → In transit → Out for delivery” feels better than silence until the doorbell rings.
School project: A teammate who posts small progress notes helps the group more than one who uploads everything at midnight with no warning.

Before vs. After:

Before: Many agentic systems either stayed silent (you worried or checked repeatedly) or over-explained (you got distracted). Long tasks felt longer, and big final dumps were heavy to process.
After: Short, content-bearing updates during the task reduce the sting of waiting, make the system’s plan visible, and spread info into easier chunks. People feel it’s faster, safer, and more reliable.

Why it works (intuition):

Progress signals reduce uncertainty: Knowing “what’s happening now” makes the same wait feel shorter.
Grounding understanding: Updates confirm the assistant not only heard you but is doing the right steps (“I’m extracting your friend’s address now”).
Cognitive chunking: Small bites are easier than a single large info dump, lowering frustration and mental strain.
Trust calibration: Seeing steps and mini-results teaches you when to rely on the system and when to double-check.

Building blocks (with simple ‘sandwich’ intros for new concepts):

🍞 Top Bread (Hook) You know how trying to pat your head and rub your belly at the same time takes effort?

🥬 Filling (New Concept 6: Dual-Task Paradigm)

What it is: The dual-task paradigm is when you do two things at once to study how they affect each other.
How it works: 1) Set a main task, 2) Add a second task, 3) Measure how performance changes.
Why it matters: It reveals how much “brain budget” is left for assistant feedback when you’re busy. 🍞 Bottom Bread (Anchor): Keeping a car lane position while listening to a voice assistant tests how much attention you can share.

🍞 Top Bread (Hook) Think of a helpful robot that can plan and act without you micromanaging every step.

🥬 Filling (New Concept 7: Agentic AI Assistant)

What it is: An agentic AI assistant breaks your request into steps and takes actions (like searching, checking, planning) to finish the job.
How it works: 1) Understands your goal, 2) Plans steps, 3) Uses tools, 4) Combines results, 5) Reports back.
Why it matters: Without agency, complex requests would stall because the system couldn’t work through the parts. 🍞 Bottom Bread (Anchor): “Plan my route to Ella’s house and add a charging stop if my battery is under 20%.” The agent looks up Ella’s address, checks your battery, finds a charger, and builds the route.

🍞 Top Bread (Hook) You know puzzles with 3 pieces vs. 6 pieces feel different to solve?

🥬 Filling (New Concept 8: Task Complexity)

What it is: Task complexity is how many steps or decisions are needed.
How it works: 1) Count steps, 2) See how long each step takes, 3) Notice how much info you must track.
Why it matters: More complex tasks need more careful feedback to keep people oriented. 🍞 Bottom Bread (Anchor): 3-step planning (contact → address → route) is simpler than 6-step planning (plus battery check, charger choice, and schedule fit).

🍞 Top Bread (Hook) It’s nicer when a friend gives short updates than when they vanish and then tell you a whole story at once.

🥬 Filling (New Concept 9: Intermediate Feedback)

What it is: Intermediate feedback is when the assistant shares planned steps and mini-results during the task.
How it works: 1) Preview the plan, 2) Give short updates at steady intervals, 3) Show small wins, 4) End with a clean summary.
Why it matters: It reduces worry, keeps you aligned, and spreads the mental work. 🍞 Bottom Bread (Anchor): “Step 2/4: Got Ella’s address. Next, checking your battery level.”

🍞 Top Bread (Hook) Like a friend who knows when to be chatty and when to keep it short.

🥬 Filling (New Concept 10: Adaptive Verbosity)

What it is: Adaptive verbosity means changing how much the assistant says based on the situation and your history with it.
How it works: 1) Start transparent to build trust, 2) Get briefer as reliability is shown, 3) Expand again for new, unclear, or high-stakes tasks, 4) Let users quickly mute or ask for more detail.
Why it matters: It balances transparency with focus, so you get just the right amount of info. 🍞 Bottom Bread (Anchor): If you often ask for the same route, the assistant keeps it short; if today’s route is tricky, it gives more details and checks choices.

03Methodology

At a high level: Spoken request → Assistant processes a multi-step task → Gives either no updates or short updates → Final answer → People rate speed, workload, experience, and trust.

Study setup in plain steps:

Participants and environment: 45 people sat in a full-size car mockup. They used a voice assistant shown on a center screen (with sound) and, in some tasks, did a simple lane-keeping driving simulation with a mouse.
Two feedback styles:
- No Intermediate (NI): Acknowledge the request, then silence until the end, followed by one big final response.
- Planning & Results (PR): Share planned steps and mini-results during the task, with a short final summary.
Two task lengths:
- Medium (about 26 seconds total): 3 assistant steps.
- Long (about 45 seconds total): 6 assistant steps.
Two contexts:
- Stationary (single task): Just talk to the assistant.
- Driving (dual task): Keep the lane with the mouse while using the assistant.
Measures (after tasks or blocks of tasks):
- Perceived speed (1–7): How fast did it feel?
- Task load (NASA-RTLX: mental demand, time pressure, frustration): How hard did it feel?
- User experience (UEQ+: attractiveness, dependability, risk handling): How good and safe did it feel?
- Trust (S-TIAS): Confidence, reliability, trustworthiness.

Why each step exists (and what breaks without it):

Two feedback styles: Needed to test if intermediate updates truly help. Without the contrast, we can’t tell if step-by-step is better than final-only.
Two lengths: Needed to see if longer, more complex tasks make updates more valuable. Without this, we’d miss how time changes feelings.
Two contexts: Needed to check whether multitasking changes preferences. Without this, results might not apply to real driving.
Multiple measures: Speed, brain effort, experience, and trust each capture a different part of what “good” feels like. Without the mix, we might improve one thing (speed) but harm another (trust) and not notice.

Example with actual data flow (medium task, PR style):

Seconds 0–2: “I’m planning…” (click + visual) acknowledges hearing you.
5s: “Step 1/3: Found contact.”
10s: “Step 2/3: Extracted address.”
15s: “Step 3/3: Checked battery (20%). Planning a charging stop.”
20–26s: Final short summary: “Here’s your route with a charger near halfway.” If the same task ran NI style, you’d get the initial “planning” cue, then silence, then a longer final monologue starting a bit earlier so both styles finish at the same time.

Example with actual data flow (long task, PR style):

6 short updates (every ~5 seconds) showing plan + mini-results (e.g., contact found, address extracted, battery checked, 2 charging options compared, optimal stop chosen, route integrated), then a short final summary.

What makes this method clever (the secret sauce):

Fixed timing for updates (about every 5 seconds) keeps the experience predictable and under common attention thresholds.
Content-bearing updates (not just “working on it”) give grounding—proof the system is doing the right things.
Multimodal feedback (audio + on-screen text) supports different attention needs: voice for eyes-on-road, brief visuals for recall.
Final answers are synchronized so both styles end together, making the comparison fair.

Why the intermediate updates matter (what breaks without them):

Without them, people face ambiguous silence and worry: Is it stuck? Did it understand me?
Without content in updates, people still don’t know what’s happening; a spinner helps less than “Step 2/4: Extracting address.”
Without spreading information into chunks, the big final dump can feel heavy and frustrating, especially in dual-task settings.

Secret checks and balances:

Counterbalanced order (who saw which style first) avoids unfair advantages.
Task text and timings were standardized to ensure the two feedback styles were truly comparable.
Measures were placed after each task or block to capture fresh reactions without overloading participants.

04Experiments & Results

The test: Measure how feedback timing (NI vs. PR) affects perceived speed, task load, user experience, and trust, across different task lengths (medium vs. long) and contexts (stationary vs. driving).

The competition: Two strategies went head-to-head.

NI (No Intermediate): Quiet during processing, one big answer at the end.
PR (Planning & Results): Short updates along the way, then a brief wrap-up.

The scoreboard with context:

Perceived speed: PR felt much faster, with a large improvement (d_z ≈ 1.01). Think of it like jumping from a B- to an A+ in “feels fast.” Longer tasks made all systems feel slower, but PR cushioned that drop notably, especially when sitting still (no second task).
Task load: Surprisingly, PR lowered task load a bit (d_z ≈ -0.26), mainly by reducing frustration. Even though PR “talked more,” the chunking made the work feel lighter than one big monologue.
User experience: PR improved attractiveness, dependability, and risk handling, with a medium overall UX boost (d_z ≈ 0.54). People liked it more, felt more in control, and thought it handled risks better.
Trust: PR increased trust modestly (d_z ≈ 0.38), especially on reliability and trustworthiness. Confidence also improved, but the strongest gains were seeing the system as consistently dependable.

Surprising findings:

More updates did not increase mental workload; they actually reduced frustration. This suggests that short, meaningful updates fight the “black box” feeling.
Driving vs. stationary context didn’t change the overall pattern dramatically. The benefits of PR showed up in both, with a small trend toward higher workload while driving, as expected, but not enough to flip the story.
People more familiar with LLMs saw bigger jumps in trust and UX with PR, possibly because the step-by-step style matches how they imagine LLMs think (like chain-of-thought).

Concrete meaning of the numbers:

Large improvement in perceived speed means many users genuinely felt the assistant was snappier with the same total time—because visible progress beats silence.
Medium improvement in UX means it wasn’t just faster-feeling; it felt better designed, more dependable, and safer.
Small-to-medium improvements in trust mean users were more willing to rely on PR-style assistants.
The workload dip (mostly lower frustration) means the same information is easier to digest when it arrives in small chunks.

Bottom line: If your assistant does multi-step work that takes tens of seconds, short content-bearing updates make people feel calmer, quicker, and more confident than a single end-of-task speech.

05Discussion & Limitations

Limitations:

The participants were from one automotive company. They varied in age and tech familiarity, but results should still be confirmed with broader groups.
The driving was a controlled lane-keeping task in a simulator. It increased attention demand consistently but can’t match the twists and stresses of real traffic.
The update timing (about every 5 seconds) and content were fixed for fairness. Real assistants might need smarter, context-aware pacing.
Feedback was always audio plus visuals. Other mixes (like haptics or visual-only) weren’t tested and could change preferences.

Required resources to use this approach:

An agentic assistant that can plan steps and expose them cleanly (e.g., via tool-calling hooks).
A feedback scheduler that decides what to say and when, with guardrails to avoid talking over more important sounds.
Multimodal UI pieces: speech for eyes-on-road, brief visuals, plus a quick mute/expand control for users.
Simple logic for adaptive verbosity (start high, shrink with proven reliability, expand for ambiguity/high stakes).

When not to use (or use carefully):

Extremely long tasks (many minutes) where constant updates would be overwhelming; here, background mode with occasional checkpoints is better.
Cases where the assistant’s channel fights the main task (e.g., audio updates during intense conversation or critical listening tasks). Let users mute or switch channels.
Very low-stakes, super-short tasks where an update would cost more attention than it saves.

Open questions:

How to detect ambiguity and stakes reliably so the system knows when to expand details?
How to estimate “demonstrated reliability” cleanly from interaction history and tie it to verbosity settings?
What’s the best mix of audio, visual, and haptic cues for different driving and social scenarios?
Where is the time boundary between “keep me engaged with updates” and “just work silently in the background” for multi-minute agents?

06Conclusion & Future Work

Three-sentence summary: This study shows that during long, multi-step tasks, in-car assistants that share short, meaningful progress updates feel faster, more trustworthy, and easier to use than assistants that stay silent until the end. These updates also lower frustration by spreading information into smaller bites. People prefer an adaptive style: start transparent to build trust, get briefer as reliability is proven, and re-expand for new, unclear, or high-stakes tasks—with simple controls to mute or see more.

Main achievement: Clear, controlled evidence that content-bearing intermediate feedback improves perceived speed, user experience, trust, and even reduces task load compared to final-only responses, across both single-task and dual-task settings.

Future directions: Build real-time policies that pace updates smartly, estimate reliability from user history, detect ambiguity and stakes, and choose the best channel mix (audio/visual/haptic) for each moment. Also, study transitions for multi-minute agents where background processing makes more sense than continuous narration.

Why remember this: The way an agent talks while it works is as important as the final answer. Small, well-timed, meaningful updates make long tasks feel shorter, safer, and more trustworthy—especially when your attention is precious, like in a car.

Practical Applications

•Design in-car voice assistants to show a short plan and give mini-results about every 5 seconds during multi-step tasks.
•Use content-bearing updates (e.g., “Step 2/4: Extracting address”) instead of vague fillers like “Working on it.”
•Provide a big mute/brief/expand control so users can quickly adjust verbosity during music, conversations, or difficult driving.
•Start with high transparency for new users, then automatically shorten updates as the system demonstrates reliability.
•Detect high-stakes or ambiguous actions (e.g., messaging, payments) and increase confirmations and detail on those steps.
•Coordinate audio and brief on-screen text so drivers can hear updates hands-free and glance later if needed.
•Cap update length and number: prefer a few crisp sentences per update rather than dense paragraphs.
•For very long jobs (minutes), switch to background mode with periodic checkpoints or a visual timeline instead of constant narration.
•Log user accept/interrupt/correction patterns to estimate demonstrated reliability and tune verbosity over time.
•Offer a one-shot voice command like “Quieter,” “Details,” or “Summarize” to instantly override the current policy.

Version: 1