Learning Personalized Agents from Human Feedback

Kaiqu Liang; Julia Kruk; Shengyi Qian; Xianjun Yang; Shengjie Bi; Yuanshun Yao; Shaoliang Nie; Mingyang Zhang; Lijuan Liu; Jaime Fernández Fisac; Shuyan Zhou; Saghar Hosseini

Learning Personalized Agents from Human Feedback

Beginner

Kaiqu Liang, Julia Kruk, Shengyi Qian et al.2/18/2026

arXiv

Key Summary

•AI helpers often don’t know new users’ tastes and can’t keep up when those tastes change.
•This paper introduces PAHF, a simple loop where the agent asks before acting, acts using remembered preferences, then learns from corrections after acting.
•The key is explicit per-user memory plus two feedback channels: questions before action and corrections after action.
•Theory shows you need both: asking first avoids early mistakes, while post-action feedback fixes confident-but-wrong beliefs when preferences drift.
•They test PAHF in two worlds: a robot-like setting (bringing items, placing things) and online shopping with tricky, near-miss products.
•Across four phases (learn, test, drift, re-learn), PAHF beats no-memory and single-channel baselines in success and error reduction.
•Pre-action helps right away with new users; post-action enables fast adaptation after tastes change; together they work best.
•Memory is kept deliberately simple (notes + embeddings) to isolate the value of feedback; it works with SQLite or FAISS backends.
•Results: PAHF achieves top success rates across domains and adapts quickly after persona shifts.
•Takeaway: Keep the user in the loop, remember what matters, and update that memory when reality changes.

Why This Research Matters

Real people change their minds, and AI helpers must keep up to stay useful and trustworthy. PAHF turns everyday conversations—quick questions before acting and short corrections after acting—into a living, per-user memory. That means fewer wrong deliveries at home, smarter shopping choices online, and tools that feel tailored without constant micromanagement. Because the memory is explicit and simple, teams can audit, edit, and protect it, helping with transparency and privacy. Over time, this approach reduces user friction (fewer do-overs), increases satisfaction (the agent “gets me”), and enables safer, more aligned autonomy in both digital and physical settings.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Top Bread (Hook): Imagine you’re a new student in a school cafeteria. The lunch staff doesn’t know what you like yet, and even you might switch from loving pizza to loving salad next month. If the staff never asks you and never updates their notes about you, they’ll keep serving the wrong thing.

🥬 Filling (The Actual Concept — Feedback mechanisms):

What it is: Feedback mechanisms are ways an AI can ask questions and receive corrections from people.
How it works:
1. Before acting, the AI can ask a clarifying question when it’s unsure.
2. After acting, the AI can listen to corrections and learn from mistakes.
3. It stores what it learns so future actions are better.
Why it matters: Without feedback, the AI guesses. With feedback, the AI learns the user’s preferences and fixes errors quickly.

🍞 Bottom Bread (Anchor): A robot hears “Bring my favorite drink” but sees Coke and Sprite. It asks, “Which is your favorite?” The human says “Sprite,” so the robot brings Sprite and remembers that choice for next time.

🍞 Top Bread (Hook): You know how you might keep a sticky note on your desk that says, “Mom likes dark chocolate”? That small note saves you from guessing wrong.

🥬 Filling (The Actual Concept — Explicit memory):

What it is: Explicit memory is a personal notes drawer for each user that the agent can read and update.
How it works:
1. The agent retrieves a few relevant notes for the current task.
2. If new info appears, it either updates an old note or adds a new one.
3. Notes are kept short, clear, and per-user.
Why it matters: Without explicit memory, the agent forgets what you like and repeats mistakes. With it, the agent builds a living profile that grows over time.

🍞 Bottom Bread (Anchor): The agent stores “Kate’s favorite drink is Sprite” after a chat. Next time Kate says “Bring my favorite drink,” the agent instantly knows: grab Sprite.

🍞 Top Bread (Hook): Sometimes instructions are fuzzy, like “Put that snack away” when there are three snacks in front of you. You know there’s missing info.

🥬 Filling (The Actual Concept — Partial observability):

What it is: Partial observability means the agent can’t see or know everything it needs to act correctly.
How it works:
1. The agent checks memory for clues.
2. If it still lacks key details, it asks a targeted question.
3. It uses the answer to act and to update memory for later.
Why it matters: Without fixing missing info, the agent makes avoidable mistakes on new users or unclear tasks.

🍞 Bottom Bread (Anchor): “Put that drink in the drawer.” Two drinks are present. The agent asks, “Pepsi or orange soda?” and then picks the right one.

🍞 Top Bread (Hook): Your favorite cereal today might not be your favorite next semester. People change, and notes from last month can become wrong.

🥬 Filling (The Actual Concept — Preference drift):

What it is: Preference drift is when a person’s likes and dislikes change over time or with context (like being sleepy or it being morning).
How it works:
1. The agent acts using what it believes you like.
2. If you correct it (“I like coffee now”), the agent updates the note.
3. Future actions follow the updated preference.
Why it matters: Without handling drift, the agent stays confidently wrong, because old notes silently mislead it.

🍞 Bottom Bread (Anchor): Yesterday, “Avery’s favorite: herbal tea.” Today Avery says, “Now I prefer coffee.” The agent updates the note and brings coffee next time.

The world before: Lots of AI assistants depended on frozen, pre-collected logs or static user profiles. That worked only when users matched the past data and didn’t change. This caused three everyday problems: new users (no history to go on), learning from live corrections (most systems didn’t update on the fly), and evolving tastes (static profiles grew stale).

The problem: Real life is interactive and changing. An agent needs to (a) avoid early, obvious mistakes when it’s unsure, and (b) recover fast when its confident beliefs become outdated.

Failed attempts:

Only reading old logs: can’t help with new users or new tastes.
Fixed personas: too coarse and can’t capture fine-grained, evolving rules.
One-off fine-tuning: expensive, slow, and still goes stale.

The gap: We needed a simple, reliable way for agents to keep the user in the loop as the main source of truth, continuously turning Q&A and corrections into an up-to-date, per-user memory.

Real stakes:

At home: a helper robot that puts things where you like them today—not last winter.
Online shopping: fewer wrong purchases and returns because the agent checks details you care about.
Work tools: calendars, emails, and documents arranged to your evolving habits without constant micromanagement.
Accessibility: users with changing needs get timely adjustments.
Trust: when the agent asks smart questions first and learns from mistakes later, people feel heard and respected.

02Core Idea

🍞 Top Bread (Hook): Imagine a great waiter who first checks your order (“Still prefer lemonade?”), serves quickly, and if you say “Actually, switch to iced tea,” updates your regular order for next time.

🥬 Filling (The Actual Concept — PAHF, the key insight):

What it is: PAHF (Personalized Agents from Human Feedback) is a simple three-step loop—ask before acting if needed, act using memory, then learn from corrections to update memory.
How it works:
1. Pre-action: Retrieve notes; if unsure, ask a clarifying question and write down the answer.
2. Action: Combine the instruction, the scene, and the notes to choose and execute the best action.
3. Post-action: If the user corrects you, update or replace the note so it’s right next time.
Why it matters: Without pre-questions, you make avoidable early mistakes. Without post-corrections, you stay stuck on outdated beliefs. Together, you avoid both traps.

🍞 Bottom Bread (Anchor): “Bring my favorite drink.” Day 1: agent asks and learns “Sprite,” writes note. Day 30: user says “Now I like coffee,” agent updates note and serves coffee next time.

🍞 Top Bread (Hook): Think of learning to skateboard with a coach: you get tips before trying a move and feedback after you fall. Both are important.

🥬 Filling (The Actual Concept — Dual feedback channels):

What it is: Two ways for the agent to learn—pre-action clarifications (before) and post-action corrections (after).
How it works:
1. Before: ask a small, targeted question to fill missing info.
2. After: accept concise corrections and overwrite stale beliefs.
3. Keep both signals in explicit memory per user.
Why it matters: Only-before misses drift; only-after causes lots of trial-and-error. Both together minimize total mistakes.

🍞 Bottom Bread (Anchor): Before: “Do you prefer webOS or Roku TV?” After: “I’ve changed my mind—now it’s webOS,” so the agent updates and stops recommending Roku.

🍞 Top Bread (Hook): You know how you sometimes need to ask one tiny question to avoid a big mistake?

🥬 Filling (The Actual Concept — Pre-Action Interaction):

What it is: A brief, smart question before acting when the instruction is ambiguous or memory is empty.
How it works:
1. Check notes; if they don’t settle the choice, ask exactly what’s missing.
2. Record the answer as a concise note.
3. Use it immediately to act.
Why it matters: This prevents “oops” moments on brand-new users or fuzzy tasks.

🍞 Bottom Bread (Anchor): “Buy headphones that fit my style.” Agent asks: “Bluetooth, wired, or RF?” User: “RF.” None offered have RF, so the agent wisely chooses “Do not buy.”

🍞 Top Bread (Hook): Remember when you confidently grabbed the wrong lunch because your friend switched favorites and didn’t tell you?

🥬 Filling (The Actual Concept — Post-Action Feedback Integration):

What it is: Learning from corrections after acting, especially when old beliefs made you confidently wrong.
How it works:
1. Listen for a targeted correction (“I now prefer coffee”).
2. Detect if it’s real personalized info.
3. Update or replace the old note so future actions change.
Why it matters: Only this after-the-fact signal can fix silent, stale beliefs that feel “certain” to the agent.

🍞 Bottom Bread (Anchor): The agent brings tea because the note says “tea.” User says, “Now it’s coffee.” The note becomes “favorite drink: coffee,” and the next action matches.

🍞 Top Bread (Hook): A good tutor doesn’t just teach once; they keep adjusting as you grow.

🥬 Filling (The Actual Concept — Continual personalization):

What it is: The agent keeps learning and updating your profile over many tasks and days.
How it works:
1. Start from zero for a new user.
2. Add notes from clarifications and corrections.
3. Revise notes whenever preferences drift.
Why it matters: One-and-done training can’t keep up with real humans who change.

🍞 Bottom Bread (Anchor): Over weeks, the agent learns “Alex prefers recycling,” “likes herbal tea in the morning,” “coffee after workouts,” and keeps these rules fresh.

Before vs After: Previously, agents read static profiles and guessed; after PAHF, agents build and revise a living memory. Previously, agents stumbled with new users and changing tastes; after PAHF, they ask smartly up front and correct quickly afterward.

Why it works (intuition):

Ambiguity is cheap to fix with one good question.
Stale certainty is dangerous; only a post-action correction reveals that the old belief is wrong now.
Combining both minimizes total errors over time.

Building blocks:

Explicit, per-user notes store preferences.
A tiny retrieval step pulls the right notes for the current task.
A salience check filters out “thanks!” and keeps only real preferences.
An update rule merges or replaces notes when drift happens.

03Methodology

At a high level: Instruction + Observation + Memory → (A) Retrieve notes and maybe ask → (B) Choose and execute action → (C) Listen to corrections and update memory → Output (result + improved memory).

Step-by-step recipe with why each step exists and an example:

Read the scene and the request

What happens: The agent sees the user instruction (e.g., “Bring my favorite drink”) and the current environment (e.g., Coke and Sprite are on the counter) or product options (A/B/C/D in shopping).
Why it exists: You can’t personalize if you don’t know what task and choices you have.
Example: “Bring my favorite drink” with drinks Coke and Sprite available.

Retrieve a few relevant notes (lightweight memory read)

What happens: The agent searches the user’s note drawer for small, relevant preferences (top-k short notes). It then distills them into a tiny, task-focused summary.
Why it exists: If a note already answers the question (“favorite drink: Coke”), there’s no need to ask.
Example: It finds “Kate’s favorite drink is Coke” and summarizes that.

Decide: Is this ambiguous?

What happens: The agent checks if the memory and scene fully determine the right action. If not, it crafts one targeted question (budget: one question per task in the benchmarks).
Why it exists: Ambiguity is the main driver of early mistakes; one good question avoids a wrong move.
Example: No notes found for a new user? Ask: “Which drink is your favorite—Coke or Sprite?”

Pre-Action Interaction (if needed)

What happens: The agent asks the smallest clarifying question, receives the answer, and writes a concise note immediately (e.g., “favorite drink: Sprite”).
Why it exists: This prevents trial-and-error when info is simply missing.
Example: User: “Sprite.” Agent saves: “Favorite drink: Sprite.”

Choose and execute the action

What happens: The agent’s policy blends the instruction, scene, retrieved notes, and any pre-action answer to pick the final action.
Why it exists: Personalization must translate into a concrete, correct action.
Example: With “favorite drink: Sprite,” it picks up the Sprite.

Observe the outcome; listen for corrections

What happens: If the user says “Thanks,” nothing changes. If the user says “Actually, I like coffee now,” the agent treats this as post-action feedback.
Why it exists: Only after acting can you discover that a once-correct belief is now wrong (preference drift).
Example: “I now prefer coffee.”

Filter and write the feedback (post-action update)

What happens: A simple “salience detector” keeps only feedback with real preference info. Then the agent either updates a near-duplicate note (merge/replace) or adds a new one.
Why it exists: Keeps memory clean and up to date; prevents clutter from polite chatter.
Example: Replace “favorite drink: Sprite” with “favorite drink: coffee.”

Repeat over time (continual personalization)

What happens: Across many tasks, the memory grows and changes with the user.
Why it exists: People change; the system must too.
Example: Weeks later, the agent remembers coffee unless the user flips again.

The secret sauce: The dual-channel loop with explicit notes. Asking one good question early saves a lot of avoidable errors; accepting and writing down corrections stops you from staying confidently wrong. The memory is intentionally simple (short natural-language notes + embeddings, per user) so the improvement clearly comes from using both feedback channels, not from fancy memory tricks.

Under the hood details (kept simple):

Memory backends: either a tiny SQLite table on disk or a FAISS vector index in memory. Both expose the same API: add note, retrieve top-k, detect near-duplicates (for update vs add), replace in place, and list by ID.
Retrieval: Compute an embedding of the current task (instruction + scene or options) and find the most similar notes. Feed a short, cleaned summary into the agent’s context.
Writing: Any time a user gives personalized info (before or after action), use a judge to decide if it’s meaningful. If yes, summarize it concisely, then either merge with a similar note or add a new one.
Acting: The agent follows a reasoning-and-acting style (think: plan, then do), choosing either to ask, to select an option (A/B/C), or to abstain (D) in shopping if none match the user’s requirements.

End-to-end example (embodied):

Input: “Bring my favorite drink,” scene has Coke and Sprite, memory empty.
Retrieve: No notes → ambiguous.
Pre-action: Ask “Which drink is your favorite?” → “Sprite.” Write note.
Act: Pick up Sprite.
Post-action: Later, user says “Now I prefer coffee.” Update note.
Output: Correct action now and better actions in the future.

End-to-end example (shopping):

Input: “Help me buy a TV I’d like,” options list panel type and platform.
Retrieve: Knows user prefers OLED, but platform unknown → ambiguous.
Pre-action: Ask “webOS, Roku, Fire TV, or other?” → “webOS.”
Act: Pick the option with OLED + webOS; if none match all features, choose D (don’t buy).
Post-action: If user later flips to “I now prefer QD-OLED,” replace the note.

04Experiments & Results

🍞 Top Bread (Hook): Imagine testing a coach who both asks you smart questions before practice and listens to your feedback after games. You’d check if they learn your style quickly and if they adapt when your style changes.

🥬 Filling (The Actual Concept — The four-phase test):

What it is: A step-by-step evaluation to see if agents can learn from scratch and then adapt when preferences drift.
How it works:
1. Phase 1 (Initial Learning): Start with empty memory; interact and learn.
2. Phase 2 (Initial Personalization Test): Test what the agent learned—no feedback now.
3. Phase 3 (Adaptation to Drift): Same kinds of tasks, but user preferences are changed; the agent must fix stale beliefs using feedback.
4. Phase 4 (Adapted Personalization Test): Test again with the new, evolved preferences—no feedback.
Why it matters: Separating “learn from zero” and “adapt after change” shows whether the agent handles both real-world needs.

🍞 Bottom Bread (Anchor): Day 1: learn your pizza topping. Day 10: test if they remember. Day 20: you switch toppings; coach must adapt. Day 30: test if they now get the new topping right.

Domains and difficulty:

Embodied Manipulation: Everyday home/office tasks—choose the right item or put it in the right place—often with context rules (sleepy, cold weather, sharing, etc.). Personas are quirky to avoid generic guesses.
Online Shopping: Choose A/B/C or D (don’t buy) using multi-feature rules. Options are “near-misses” that almost fit but have one disqualifying feature—forcing careful checking.

Baselines compared:

No Memory: Acts without a notes drawer.
Pre-Action Only: Can clarify before acting, but cannot update after mistakes.
Post-Action Only: Never clarifies, only learns reactively after errors.
PAHF: Both pre-action questions and post-action updates.

Metrics made meaningful:

Success Rate (SR): Percent of tasks done right. Think: your report card grade.
Feedback Frequency (FF): How often the agent needed to ask or got corrected. Think: how much coaching happened.
Average Cumulative Personalization Error (ACPE): Average error over time within a phase. Think: how many wrong turns you made on the hike so far.

Key scoreboard highlights (selected):

Embodied domain (Phase 2/Phase 4 SR):
- No memory: 32.3% / 44.8%
- Pre-action only: 54.1% / 35.7%
- Post-action only: 67.9% / 68.3%
- PAHF: 70.5% / 68.8% (best overall)
Online shopping (Phase 2/Phase 4 SR):
- No memory: 27.8% / 27.0%
- Pre-action only: 34.4% / 56.0%
- Post-action only: 38.9% / 66.9%
- PAHF: 41.3% / 70.3% (best overall)

What the numbers mean:

Phase 1 (learning): Agents that can ask before acting get a head start—fewer early mistakes (lower ACPE). It’s like getting hints on a tricky puzzle.
Phase 3 (drift): Agents that can learn after mistakes adapt rapidly. Pre-action alone struggles here because it stops asking once it “thinks” it knows you; only post-action corrections can unstick stale beliefs.
Across all phases: PAHF combines both strengths—fast starts and fast adaptation—scoring the highest success rates.

Surprising findings:

Pre-action only can fall behind even the no-memory baseline after drift in embodied tasks. Why? It becomes confidently wrong and doesn’t ask again.
Post-action only nearly matches PAHF after drift because it updates quickly—but it pays with lots of early, avoidable trial-and-error.
Keeping memory simple (short notes + retrieval) was enough to show big gains—so the benefit truly comes from the dual feedback loop, not fancy memory tricks.

Conclusion from experiments: Asking small questions early prevents many mistakes; listening to corrections later fixes stale beliefs. The best agent does both and writes it all down per user.

05Discussion & Limitations

Limitations (honest view):

Noisy or inconsistent feedback: Real people sometimes change their minds mid-task, misspeak, or misremember. The current salience filter is basic; stronger conflict detection and “are you sure?” follow-ups would help.
One-question budget: The benchmarks allow only one clarification per task. That keeps user effort low but can leave some tricky, multi-feature shopping choices under-specified.
Simple memory: Short notes and embeddings work well and are easy to reproduce, but very long or highly structured histories might need more advanced, hierarchical memory.
Reasoning ceiling: The shopping benchmark is intentionally hard with near-miss options. Even with PAHF, success rates leave room to grow.

Required resources:

An LLM agent capable of reasoning-and-acting.
A small per-user memory store (SQLite or FAISS) for notes and embeddings.
Lightweight LLM prompts for salience detection and note merging.
Optional: a simulator for training, or real user interaction loops.

When not to use it:

If preferences are fixed and fully known upfront, simpler static profiles or fine-tuning may suffice.
If you cannot collect any feedback (no questions, no corrections), PAHF’s main advantage disappears.
If privacy rules forbid storing per-user notes and no privacy-preserving alternative is possible.

Open questions:

Robustness to noisy or even adversarial feedback: How to detect conflicts, ask follow-ups, and maintain calibrated confidence?
Multi-turn clarification: What is the right trade-off between more questions and user burden? Can the agent learn an optimal “question budget” policy?
Richer memory: How do we plug in structured or hierarchical memory without losing simplicity? Can we compress long histories while keeping key preference rules?
Generalization: How well does PAHF transfer across domains beyond robots and shopping (e.g., education, healthcare) while respecting safety and privacy?

06Conclusion & Future Work

Three-sentence summary: PAHF is a simple, powerful loop that lets AI agents ask before acting, act using an explicit per-user memory, and learn from corrections after acting. Theory proves that asking first prevents early ambiguity errors while post-action learning is essential to fix confident-but-wrong beliefs after preferences drift. Experiments in robots and shopping show that only the combination—explicit memory plus both feedback channels—delivers the strongest, most reliable personalization over time.

Main achievement: Turning live interaction into the primary learning signal and showing, with clean theory and data, that dual feedback channels are complementary and necessary for continual personalization.

Future directions: Make feedback handling robust to noise and conflicts, learn multi-turn questioning policies that balance accuracy and user effort, integrate richer memory architectures, and extend to sensitive domains with strong privacy guarantees.

Why remember this: It’s a recipe that matches real life—people change. By keeping the user in the loop, remembering what matters, and updating that memory when reality changes, AI can feel less like a guesser and more like a thoughtful helper.

Practical Applications

•Smart home robots that place, fetch, and tidy according to your current habits and room layouts.
•Shopping assistants that ask one key question (e.g., platform, mount, sensor) and avoid mismatched buys by selecting “don’t buy” when nothing fits.
•Scheduling/email agents that learn your evolving meeting, notification, and writing style preferences over time.
•Health and wellness coaches that adjust recommendations as sleep, energy, or goals shift, with explicit note updates.
•Educational tutors that ask clarifying questions on learning goals and adapt to changing interests or difficulties.
•IT helpdesk bots that learn team-specific tool settings, shortcuts, and exception rules, updating when standards change.
•Travel planners that learn evolving seat, meal, and layover preferences, asking before booking and updating after feedback.
•Customer support triage agents that learn per-customer history and preferred resolutions, adapting to policy changes.
•Content recommenders that keep explicit, user-editable notes, reducing stale suggestions and improving trust.
•Personal finance assistants that learn your risk and category preferences and adapt as your budget priorities change.

Version: 1