"What Are You Doing?": Effects of Intermediate Feedback from Agentic LLM In-Car Assistants During Multi-Step Processing
Key Summary
- âąThe study tested how an in-car AI helper should talk while it works on long, multi-step tasks.
- âąThey compared two styles: staying quiet until the end (No Intermediate) versus giving step-by-step plans and mini-results (Planning & Results).
- âąWith 45 people in a car simulator, intermediate updates made the AI feel faster, more trustworthy, and more pleasant to use.
- âąSurprisingly, step-by-step updates also lowered mental load by reducing frustration compared to a big final info dump.
- âąThese benefits held both when people were sitting still and when they were doing a simple driving task at the same time.
- âąLonger tasks felt slower overall, but intermediate updates buffered that slowdown and kept users feeling in the loop.
- âąInterviews showed people want an adaptive approach: start very transparent to build trust, then get briefer as the AI proves reliable.
- âąPeople also want transparency to come back for new, unclear, or high-stakes tasks, plus a quick way to mute or expand details on demand.
- âąDesign takeaway: use content-bearing intermediate updates (not just 'working on it'), pace them sensibly, and let users adjust verbosity.
- âąThese insights can guide other agentic assistants beyond cars, especially for tasks that take many seconds and use different attention channels.
Why This Research Matters
In cars, your eyes and mind should stay on the road, not worrying whether the assistant understood you. Short, meaningful updates during multi-step tasks make waiting feel shorter and reduce the stress of âIs it stuck?â This helps drivers feel safer and more in control without being buried in a big speech at the end. As assistants enter more parts of lifeâhomes, workplaces, wearablesâthese findings offer a blueprint for talking just enough, at the right time. An adaptive approach means the assistant learns your comfort level, gets briefer as it proves itself, and becomes more transparent when the stakes or uncertainty rise. The result is technology that earns trust while respecting attention.
Detailed Explanation
Tap terms for definitions01Background & Problem Definition
You know how when you ask a friend to plan a trip, you donât want silence for a long timeâyou want little check-ins like, âFound the flights,â or âComparing hotels nowâ? Computers that use big language models (LLMs) are starting to act more like that friend. They donât just answer one quick question; they can break a big job into smaller steps and do them for you. In a car, that might mean: find a contact, grab their address, check your battery, then add a charging stop.
đ Top Bread (Hook) Imagine ordering a pizza and hearing nothing for 30 minutes. Even if the pizza arrives hot, the wait feels worrying. Now imagine you get short updates: âWeâre mixing the dough,â âItâs in the oven,â âOut for delivery.â The total time is the same, but it feels better.
đ„Ź Filling (New Concept 1: Large Language Model, LLM)
- What it is: An LLM is a computer program that understands and produces human-like language.
- How it works (simple steps): 1) Reads your words, 2) Predicts likely next words using patterns it learned, 3) Writes helpful, on-topic replies, 4) Can plan steps or call tools if itâs an agent.
- Why it matters: Without LLMs, assistants canât understand flexible requests like âPlan my trip and add a charging stop near the halfway point.â đ Bottom Bread (Anchor): When you say, âText Mom Iâll be there at 6,â the LLM figures out you want to message your mother and drafts a polite text.
đ„Ź Filling (New Concept 2: User Experience)
- What it is: User experience (UX) is how a person feels when using a system.
- How it works: 1) You interact, 2) You notice speed, clarity, and comfort, 3) You form an opinion: easy or annoying.
- Why it matters: If UX feels badâslow, confusingâyou stop trusting or using the assistant. đ Anchor: A voice assistant that explains progress calmly feels friendlier than one thatâs silent and then dumps a huge speech at the end.
đ„Ź Filling (New Concept 3: Trust in AI)
- What it is: Trust is believing the AI will do the right thing reliably.
- How it works: 1) The AI behaves clearly, 2) It does what it says, 3) You see consistent good results, 4) You rely on it.
- Why it matters: Without trust, people second-guess the AI or ignore its help. đ Anchor: If your car assistant has often been right about traffic detours, youâll accept its new route faster next time.
đ„Ź Filling (New Concept 4: Cognitive Load)
- What it is: Cognitive load is how hard your brain is working.
- How it works: 1) You take in info, 2) You hold it in memory, 3) You make choices; more info at once makes it heavier.
- Why it matters: High load makes people tired, miss details, or make mistakesâdangerous when driving. đ Anchor: Hearing 10 facts in 10 seconds is harder than hearing the same 10 facts spread out in short chunks.
đ„Ź Filling (New Concept 5: Feedback Timing)
- What it is: Feedback timing is when the assistant gives you updates.
- How it works: 1) It can stay quiet until the end, 2) Or it can share small updates along the way, 3) Or do a mix.
- Why it matters: Bad timing feels like a black box; good timing keeps you calm and informed. đ Anchor: A barista saying âAlmost done, just frothing the milkâ helps you wait more comfortably than silence.
The problem before this research: Many agent-like systems either talk too little (you wait in silence and worry) or too much (they narrate every tiny step and distract you). This is extra tricky in cars, where your attention must stay on the road. Prior tricks like spinning wheels or âworking on itâ beeps help a bit, but they donât tell you what the AI is actually doing, so you still feel uncertain. And when tasks are longer and have more steps, the amount of information grows, which can overload the driver if dumped at once.
Failed attempts and gaps: Silent systems reduce interruptions but lead to âambiguous silenceâ (is it stuck or still working?). Super-chatty systems are transparent but can overload your attention. What was missing: clear, tested guidance on how much to say and when to say it for multi-step tasks in attention-sensitive places like cars.
Real stakes in daily life: In a car, bad feedback design can increase stress, make waiting feel longer, and even distract you. Good design can keep you confident, informed, and saferâlike hearing âStep 2/4: Found the address,â instead of nothing, or instead of a giant final speech when youâre trying to focus on driving.
02Core Idea
The âAha!â in one sentence: Giving meaningful, bite-sized updates during a long task makes the assistant feel faster, more trustworthy, and easier to use than being silent and answering only at the end.
Three analogies:
- Cooking assistant: Instead of vanishing and returning with a full meal, the chef says, âChopping veggies⊠simmering sauce⊠pastaâs ready,â which keeps you relaxed and in sync.
- Package tracking: Seeing âShipped â In transit â Out for deliveryâ feels better than silence until the doorbell rings.
- School project: A teammate who posts small progress notes helps the group more than one who uploads everything at midnight with no warning.
Before vs. After:
- Before: Many agentic systems either stayed silent (you worried or checked repeatedly) or over-explained (you got distracted). Long tasks felt longer, and big final dumps were heavy to process.
- After: Short, content-bearing updates during the task reduce the sting of waiting, make the systemâs plan visible, and spread info into easier chunks. People feel itâs faster, safer, and more reliable.
Why it works (intuition):
- Progress signals reduce uncertainty: Knowing âwhatâs happening nowâ makes the same wait feel shorter.
- Grounding understanding: Updates confirm the assistant not only heard you but is doing the right steps (âIâm extracting your friendâs address nowâ).
- Cognitive chunking: Small bites are easier than a single large info dump, lowering frustration and mental strain.
- Trust calibration: Seeing steps and mini-results teaches you when to rely on the system and when to double-check.
Building blocks (with simple âsandwichâ intros for new concepts):
đ Top Bread (Hook) You know how trying to pat your head and rub your belly at the same time takes effort?
đ„Ź Filling (New Concept 6: Dual-Task Paradigm)
- What it is: The dual-task paradigm is when you do two things at once to study how they affect each other.
- How it works: 1) Set a main task, 2) Add a second task, 3) Measure how performance changes.
- Why it matters: It reveals how much âbrain budgetâ is left for assistant feedback when youâre busy. đ Bottom Bread (Anchor): Keeping a car lane position while listening to a voice assistant tests how much attention you can share.
đ Top Bread (Hook) Think of a helpful robot that can plan and act without you micromanaging every step.
đ„Ź Filling (New Concept 7: Agentic AI Assistant)
- What it is: An agentic AI assistant breaks your request into steps and takes actions (like searching, checking, planning) to finish the job.
- How it works: 1) Understands your goal, 2) Plans steps, 3) Uses tools, 4) Combines results, 5) Reports back.
- Why it matters: Without agency, complex requests would stall because the system couldnât work through the parts. đ Bottom Bread (Anchor): âPlan my route to Ellaâs house and add a charging stop if my battery is under 20%.â The agent looks up Ellaâs address, checks your battery, finds a charger, and builds the route.
đ Top Bread (Hook) You know puzzles with 3 pieces vs. 6 pieces feel different to solve?
đ„Ź Filling (New Concept 8: Task Complexity)
- What it is: Task complexity is how many steps or decisions are needed.
- How it works: 1) Count steps, 2) See how long each step takes, 3) Notice how much info you must track.
- Why it matters: More complex tasks need more careful feedback to keep people oriented. đ Bottom Bread (Anchor): 3-step planning (contact â address â route) is simpler than 6-step planning (plus battery check, charger choice, and schedule fit).
đ Top Bread (Hook) Itâs nicer when a friend gives short updates than when they vanish and then tell you a whole story at once.
đ„Ź Filling (New Concept 9: Intermediate Feedback)
- What it is: Intermediate feedback is when the assistant shares planned steps and mini-results during the task.
- How it works: 1) Preview the plan, 2) Give short updates at steady intervals, 3) Show small wins, 4) End with a clean summary.
- Why it matters: It reduces worry, keeps you aligned, and spreads the mental work. đ Bottom Bread (Anchor): âStep 2/4: Got Ellaâs address. Next, checking your battery level.â
đ Top Bread (Hook) Like a friend who knows when to be chatty and when to keep it short.
đ„Ź Filling (New Concept 10: Adaptive Verbosity)
- What it is: Adaptive verbosity means changing how much the assistant says based on the situation and your history with it.
- How it works: 1) Start transparent to build trust, 2) Get briefer as reliability is shown, 3) Expand again for new, unclear, or high-stakes tasks, 4) Let users quickly mute or ask for more detail.
- Why it matters: It balances transparency with focus, so you get just the right amount of info. đ Bottom Bread (Anchor): If you often ask for the same route, the assistant keeps it short; if todayâs route is tricky, it gives more details and checks choices.
03Methodology
At a high level: Spoken request â Assistant processes a multi-step task â Gives either no updates or short updates â Final answer â People rate speed, workload, experience, and trust.
Study setup in plain steps:
- Participants and environment: 45 people sat in a full-size car mockup. They used a voice assistant shown on a center screen (with sound) and, in some tasks, did a simple lane-keeping driving simulation with a mouse.
- Two feedback styles:
- No Intermediate (NI): Acknowledge the request, then silence until the end, followed by one big final response.
- Planning & Results (PR): Share planned steps and mini-results during the task, with a short final summary.
- Two task lengths:
- Medium (about 26 seconds total): 3 assistant steps.
- Long (about 45 seconds total): 6 assistant steps.
- Two contexts:
- Stationary (single task): Just talk to the assistant.
- Driving (dual task): Keep the lane with the mouse while using the assistant.
- Measures (after tasks or blocks of tasks):
- Perceived speed (1â7): How fast did it feel?
- Task load (NASA-RTLX: mental demand, time pressure, frustration): How hard did it feel?
- User experience (UEQ+: attractiveness, dependability, risk handling): How good and safe did it feel?
- Trust (S-TIAS): Confidence, reliability, trustworthiness.
Why each step exists (and what breaks without it):
- Two feedback styles: Needed to test if intermediate updates truly help. Without the contrast, we canât tell if step-by-step is better than final-only.
- Two lengths: Needed to see if longer, more complex tasks make updates more valuable. Without this, weâd miss how time changes feelings.
- Two contexts: Needed to check whether multitasking changes preferences. Without this, results might not apply to real driving.
- Multiple measures: Speed, brain effort, experience, and trust each capture a different part of what âgoodâ feels like. Without the mix, we might improve one thing (speed) but harm another (trust) and not notice.
Example with actual data flow (medium task, PR style):
- Seconds 0â2: âIâm planningâŠâ (click + visual) acknowledges hearing you.
- 5s: âStep 1/3: Found contact.â
- 10s: âStep 2/3: Extracted address.â
- 15s: âStep 3/3: Checked battery (20%). Planning a charging stop.â
- 20â26s: Final short summary: âHereâs your route with a charger near halfway.â If the same task ran NI style, youâd get the initial âplanningâ cue, then silence, then a longer final monologue starting a bit earlier so both styles finish at the same time.
Example with actual data flow (long task, PR style):
- 6 short updates (every ~5 seconds) showing plan + mini-results (e.g., contact found, address extracted, battery checked, 2 charging options compared, optimal stop chosen, route integrated), then a short final summary.
What makes this method clever (the secret sauce):
- Fixed timing for updates (about every 5 seconds) keeps the experience predictable and under common attention thresholds.
- Content-bearing updates (not just âworking on itâ) give groundingâproof the system is doing the right things.
- Multimodal feedback (audio + on-screen text) supports different attention needs: voice for eyes-on-road, brief visuals for recall.
- Final answers are synchronized so both styles end together, making the comparison fair.
Why the intermediate updates matter (what breaks without them):
- Without them, people face ambiguous silence and worry: Is it stuck? Did it understand me?
- Without content in updates, people still donât know whatâs happening; a spinner helps less than âStep 2/4: Extracting address.â
- Without spreading information into chunks, the big final dump can feel heavy and frustrating, especially in dual-task settings.
Secret checks and balances:
- Counterbalanced order (who saw which style first) avoids unfair advantages.
- Task text and timings were standardized to ensure the two feedback styles were truly comparable.
- Measures were placed after each task or block to capture fresh reactions without overloading participants.
04Experiments & Results
The test: Measure how feedback timing (NI vs. PR) affects perceived speed, task load, user experience, and trust, across different task lengths (medium vs. long) and contexts (stationary vs. driving).
The competition: Two strategies went head-to-head.
- NI (No Intermediate): Quiet during processing, one big answer at the end.
- PR (Planning & Results): Short updates along the way, then a brief wrap-up.
The scoreboard with context:
- Perceived speed: PR felt much faster, with a large improvement (d_z â 1.01). Think of it like jumping from a B- to an A+ in âfeels fast.â Longer tasks made all systems feel slower, but PR cushioned that drop notably, especially when sitting still (no second task).
- Task load: Surprisingly, PR lowered task load a bit (d_z â -0.26), mainly by reducing frustration. Even though PR âtalked more,â the chunking made the work feel lighter than one big monologue.
- User experience: PR improved attractiveness, dependability, and risk handling, with a medium overall UX boost (d_z â 0.54). People liked it more, felt more in control, and thought it handled risks better.
- Trust: PR increased trust modestly (d_z â 0.38), especially on reliability and trustworthiness. Confidence also improved, but the strongest gains were seeing the system as consistently dependable.
Surprising findings:
- More updates did not increase mental workload; they actually reduced frustration. This suggests that short, meaningful updates fight the âblack boxâ feeling.
- Driving vs. stationary context didnât change the overall pattern dramatically. The benefits of PR showed up in both, with a small trend toward higher workload while driving, as expected, but not enough to flip the story.
- People more familiar with LLMs saw bigger jumps in trust and UX with PR, possibly because the step-by-step style matches how they imagine LLMs think (like chain-of-thought).
Concrete meaning of the numbers:
- Large improvement in perceived speed means many users genuinely felt the assistant was snappier with the same total timeâbecause visible progress beats silence.
- Medium improvement in UX means it wasnât just faster-feeling; it felt better designed, more dependable, and safer.
- Small-to-medium improvements in trust mean users were more willing to rely on PR-style assistants.
- The workload dip (mostly lower frustration) means the same information is easier to digest when it arrives in small chunks.
Bottom line: If your assistant does multi-step work that takes tens of seconds, short content-bearing updates make people feel calmer, quicker, and more confident than a single end-of-task speech.
05Discussion & Limitations
Limitations:
- The participants were from one automotive company. They varied in age and tech familiarity, but results should still be confirmed with broader groups.
- The driving was a controlled lane-keeping task in a simulator. It increased attention demand consistently but canât match the twists and stresses of real traffic.
- The update timing (about every 5 seconds) and content were fixed for fairness. Real assistants might need smarter, context-aware pacing.
- Feedback was always audio plus visuals. Other mixes (like haptics or visual-only) werenât tested and could change preferences.
Required resources to use this approach:
- An agentic assistant that can plan steps and expose them cleanly (e.g., via tool-calling hooks).
- A feedback scheduler that decides what to say and when, with guardrails to avoid talking over more important sounds.
- Multimodal UI pieces: speech for eyes-on-road, brief visuals, plus a quick mute/expand control for users.
- Simple logic for adaptive verbosity (start high, shrink with proven reliability, expand for ambiguity/high stakes).
When not to use (or use carefully):
- Extremely long tasks (many minutes) where constant updates would be overwhelming; here, background mode with occasional checkpoints is better.
- Cases where the assistantâs channel fights the main task (e.g., audio updates during intense conversation or critical listening tasks). Let users mute or switch channels.
- Very low-stakes, super-short tasks where an update would cost more attention than it saves.
Open questions:
- How to detect ambiguity and stakes reliably so the system knows when to expand details?
- How to estimate âdemonstrated reliabilityâ cleanly from interaction history and tie it to verbosity settings?
- Whatâs the best mix of audio, visual, and haptic cues for different driving and social scenarios?
- Where is the time boundary between âkeep me engaged with updatesâ and âjust work silently in the backgroundâ for multi-minute agents?
06Conclusion & Future Work
Three-sentence summary: This study shows that during long, multi-step tasks, in-car assistants that share short, meaningful progress updates feel faster, more trustworthy, and easier to use than assistants that stay silent until the end. These updates also lower frustration by spreading information into smaller bites. People prefer an adaptive style: start transparent to build trust, get briefer as reliability is proven, and re-expand for new, unclear, or high-stakes tasksâwith simple controls to mute or see more.
Main achievement: Clear, controlled evidence that content-bearing intermediate feedback improves perceived speed, user experience, trust, and even reduces task load compared to final-only responses, across both single-task and dual-task settings.
Future directions: Build real-time policies that pace updates smartly, estimate reliability from user history, detect ambiguity and stakes, and choose the best channel mix (audio/visual/haptic) for each moment. Also, study transitions for multi-minute agents where background processing makes more sense than continuous narration.
Why remember this: The way an agent talks while it works is as important as the final answer. Small, well-timed, meaningful updates make long tasks feel shorter, safer, and more trustworthyâespecially when your attention is precious, like in a car.
Practical Applications
- âąDesign in-car voice assistants to show a short plan and give mini-results about every 5 seconds during multi-step tasks.
- âąUse content-bearing updates (e.g., âStep 2/4: Extracting addressâ) instead of vague fillers like âWorking on it.â
- âąProvide a big mute/brief/expand control so users can quickly adjust verbosity during music, conversations, or difficult driving.
- âąStart with high transparency for new users, then automatically shorten updates as the system demonstrates reliability.
- âąDetect high-stakes or ambiguous actions (e.g., messaging, payments) and increase confirmations and detail on those steps.
- âąCoordinate audio and brief on-screen text so drivers can hear updates hands-free and glance later if needed.
- âąCap update length and number: prefer a few crisp sentences per update rather than dense paragraphs.
- âąFor very long jobs (minutes), switch to background mode with periodic checkpoints or a visual timeline instead of constant narration.
- âąLog user accept/interrupt/correction patterns to estimate demonstrated reliability and tune verbosity over time.
- âąOffer a one-shot voice command like âQuieter,â âDetails,â or âSummarizeâ to instantly override the current policy.