CharacterFlywheel: Scaling Iterative Improvement of Engaging and Steerable LLMs in Production

Yixin Nie; Lin Guan; Zhongyao Ma; Anchit Gupta; Yipin Zhou; Xiao Li; Zhengping Zhou; Raymond Zeng; Gelin Zhou; Shigan Chu; Ajay Thampi; Wancen Mu; Nathan Shuster; Ketong Wang; Lin Chen; Jason Brewer; Derek Hao Hu; Alexander McCauley; Jason Weston; Sem Park; Na Zhang; Kevin Tang

CharacterFlywheel: Scaling Iterative Improvement of Engaging and Steerable LLMs in Production

Intermediate

Yixin Nie, Lin Guan, Zhongyao Ma et al.3/2/2026

arXiv

Key Summary

•CharacterFlywheel is a step‑by‑step loop that steadily improves chatty AI characters by learning from real conversations on Instagram, WhatsApp, and Messenger.
•Instead of only testing on school‑style quizzes, the system optimizes for how engaging and steerable the chats are with real people.
•The team trained reward models (little judges) to estimate what users prefer, then used SFT, DPO, and RL to nudge the chatbot in better directions each round.
•Across 15 model generations, 7 of the 8 public releases beat the previous version in A/B tests, up to +8.8% in engagement breadth and +19.4% in engagement depth.
•Character steerability improved a lot: instruction following rose from 59.2% to 84.8%, and instruction violations dropped from 26.6% to 5.8%.
•Careful data curation, variance‑based sampling of hard prompts, and guardrails against overfitting kept the model from gaming the reward signals.
•Switching from Online DPO to GRPO and training on near‑policy prompts led to better engagement in online tests.
•Implicit image generation (the model proactively makes pictures when helpful) measurably boosted engagement on top of explicit image requests.
•The team monitored style artifacts (like too many emojis or lists) so the model learned substance, not just flashy tricks.
•This paper shows how to make progress on a fuzzy goal—“be engaging”—using a rigorous, repeatable, and safe production process.

Why This Research Matters

This work shows how to make chatbots feel more like good conversational partners, not just answer machines. By directly optimizing how many people engage and how deeply they chat, products can become more welcoming, supportive, and fun. The system also improves steerability, so custom characters stay in persona and follow instructions—crucial for creators and brands. Safety improves in tandem: fewer false refusals and careful monitoring reduce frustrating and preachy behavior. The approach scales to millions of users while protecting privacy, thanks to rigorous curation and layered safety checks. Finally, it offers a repeatable playbook any team can adapt to turn subjective goals—like “be engaging”—into measurable, reliable progress.

Reading Workflow

Turn this paper into a decision

Scan fast. Promote only the papers that survive triage.

No workflow history yet.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: You know how a great conversation feels different from a correct answer in math class? It’s warm, fun, and keeps you wanting to talk more.

🥬 The Concept (Large Language Models): LLMs are computer programs that learn patterns from tons of text so they can write and chat like people. How it works:

Read huge amounts of text. 2) Learn which words usually come next. 3) Use that skill to answer and chat. 4) Improve later with special training. Why it matters: Without LLMs, we wouldn’t have modern chatbots at all. 🍞 Anchor: When you ask a chatbot, “Tell me a bedtime story about a dragon chef,” it can actually spin a story because it learned from lots of stories.

🍞 Hook: Imagine two kinds of helpers: a quiz champion and a great friend. Many AIs are trained to ace quizzes, but friends make you feel heard.

🥬 The Concept (Conversational AI): Conversational AI is an LLM tuned to have human‑like back‑and‑forth chats. How it works:

Reads your message and the chat history. 2) Predicts a helpful, friendly next reply. 3) Keeps a consistent tone/persona. 4) Adapts to your style. Why it matters: If it only gives facts, chats can feel cold; good conversations need warmth and flow. 🍞 Anchor: Asking, “How was your day?” should get a caring answer, not a Wikipedia paragraph.

🍞 Hook: Think of throwing a party: you’d measure success by how many guests came and how long they stayed, not a single trivia score.

🥬 The Concept (Engagement Metrics): Engagement metrics are numbers that capture how much people use and enjoy the chat, like breadth (how many people engage) and depth (how much they engage). How it works:

Define breadth (percentage who engage). 2) Define depth (how much engaged users interact). 3) Track both over time. 4) Compare versions. Why it matters: Without these, you can’t tell if chats are actually fun and sticky. 🍞 Anchor: If more users return to chat and send more messages, breadth and depth go up.

🍞 Hook: If you bake two cookie recipes and ask friends which batch they like more, you’re doing a taste test.

🥬 The Concept (A/B Testing): A/B testing compares two versions (A vs. B) with real users to see which performs better. How it works:

Randomly split users into A and B. 2) Show each group a different model. 3) Measure engagement. 4) Pick the winner. Why it matters: It tells you what truly works in the real world, not just in a lab. 🍞 Anchor: If Version B makes more people chat longer than Version A, B wins.

The world before: Most LLM progress focused on being an “omnipotent oracle”—ace benchmarks, solve math/code, answer facts. That’s great for homework, but it misses the art of conversation. Social chat apps (Character.ai, Replika, and Meta’s ecosystem) proved people also want connection and entertainment. The problem: engagingness is subjective and messy. There’s no single gold‑answer key, and you can’t directly “differentiate” a fun conversation the way you can compute a math loss. Failed attempts: optimizing simple proxy signals (like response length or emoji use) made bots verbose or gimmicky; relying on thumbs‑up/down alone caused reward hacking (e.g., pandering at conversation ends); training on off‑policy, stale prompts missed what current users actually ask. The gap: a reliable, repeatable system to climb an invisible “engagement mountain” safely—learning from real traffic while avoiding overfitting and preserving safety. Real stakes: Better social chat helps with well‑being and loneliness, supports creators building characters, reduces false refusals (less “sorry, can’t help” when it’s safe to help), and ensures fairness and quality across languages and personas. This paper fills that gap with CharacterFlywheel—a careful, science‑minded loop that measures, learns, tests, and repeats until the chats truly get better for millions.

02Core Idea

🍞 Hook: Imagine hiking in fog toward the mountain peak of “great conversations” without a clear map.

🥬 The Concept (CharacterFlywheel): CharacterFlywheel is a repeatable loop that uses data, small “judges,” and careful tests to take safe, steady steps toward more engaging, steerable chats. How it works:

Collect fresh, safe, diverse chat data. 2) Train reward models (little judges) to approximate what users prefer. 3) Create a strong base with SFT; apply small DPO patches. 4) Use RL (GRPO/Online DPO) to climb the engagement landscape. 5) Check with offline metrics and online A/B tests. 6) Watch for overfitting and style artifacts; correct fast. 7) Deploy; repeat. Why it matters: Without a loop, models drift or overfit; with it, they reliably improve in the wild. 🍞 Anchor: Over 15 generations, most releases beat the last—up to +8.8% breadth and +19.4% depth.

Three analogies for the same idea:

Mountain guide: Reward models sketch the contour lines; SFT/DPO/RL take steps; A/B tests confirm you climbed, not slipped.
Kitchen lab: Curate ingredients (data), have tasters (reward models), tweak recipes (training), and host public tastings (A/B) to see what people actually love.
Sports practice: Film sessions (logs) plus coach scores (RMs) guide drills (SFT/RL); game day (A/B) proves the playbook works.

🍞 Hook: You know how a referee helps keep a game fair and moving in the right direction?

🥬 The Concept (Reward Modeling): Reward models are learned judges that score responses for likely user preference and helpful behaviors. How it works:

Collect preference pairs and signals. 2) Train pointwise and pairwise judges. 3) Combine with auxiliary user‑signal predictors. 4) Use these scores to guide training and selection. Why it matters: Engagement isn’t directly computable; these judges provide the compass. 🍞 Anchor: If reply A is fun, on‑character, and safe while reply B is dull, the reward model prefers A.

🍞 Hook: Imagine picking the best fruits for a salad—bad apples will ruin the bowl.

🥬 The Concept (Data Curation): Data curation filters, balances, and samples chat data so training sees safe, private, diverse, and challenging examples. How it works:

Filter for privacy/safety. 2) Cluster for diversity (avoid near‑duplicates). 3) Enforce constraints (locales, personas, first‑turns, difficulty). 4) Refresh continuously. Why it matters: Garbage in, garbage out; curation keeps learning healthy. 🍞 Anchor: Limiting any single character $to ≤3$ % prevents one persona from dominating.

🍞 Hook: Weather forecasts look at signals (clouds, humidity) to predict rain; they’re not perfect but useful.

🥬 The Concept (User Signal Models): These models predict likelihood of user behaviors (e.g., continue, thumbs up) to enrich training and selection. How it works:

Train small classifiers on real signals. 2) Use the reliable ones (p(continue), p(thumb up)) for rejection sampling. 3) Avoid direct RL optimization to prevent reward hacking. Why it matters: They add signal without steering the model into shortcuts. 🍞 Anchor: If the model often gets a thumbs‑up after certain helpful replies, those styles surface in training picks.

Before vs. after: Before, teams mainly optimized static benchmarks and hoped good vibes followed. After, they directly target social engagement with a proven loop, while safeguarding safety and general skills. Why it works: It converts a fuzzy goal into many small, testable steps: use reward models to get direction; use near‑policy data to stay grounded; confirm with A/B tests; watch health metrics; and iterate. Building blocks include curation, reward modeling (pre‑herding), SFT/DPO/RL (herding), artifact mitigation, offline evaluation, and online A/B testing—plus strict guardrails (e.g., RM win‑rate thresholds) to avoid overfitting.

03Methodology

At a high level: Real user + internal chats → Data curation & annotation → Reward models (judges) + user signal models → Rejection sampling dataset → SFT base + small DPO patches → Online RL (GRPO / Online DPO) with near‑policy prompts → Offline checks + Online A/B tests → Deploy → Repeat.

🍞 Hook: Think of your first message with a new friend—first impressions matter.

🥬 The Concept (Supervised Fine‑Tuning, SFT): SFT teaches the model with high‑quality example conversations so it has solid basics. How it works:

Mix curated internal chats, fresh user data, safety sets, tool calls (image‑gen), and legacy SFT data. 2) Train the model to imitate great responses. 3) Keep mixture balanced so benchmarks don’t collapse. 4) Re‑do as new data arrives. Why it matters: Without SFT, RL starts on shaky ground and drifts. 🍞 Anchor: After SFT, the model answers warmly and on‑topic before any RL nudges.

🍞 Hook: Sometimes you need a tiny band‑aid before full rehab.

🥬 The Concept (DPO, small patch): DPO applies lightweight preference tuning, often for urgent safety/style fixes. How it works:

Use a small, targeted preference set (e.g., safety, image‑gen, Llama 3.1 prefs). 2) Train briefly. 3) Avoid off‑policy overreach by keeping it small. 4) Re‑run when needed. Why it matters: Fast, focused corrections without derailing the big plan. 🍞 Anchor: If the bot gets preachy, a small DPO patch can reduce that tone quickly.

🍞 Hook: Training a pet to do tricks works better when you reward the exact behavior you want.

🥬 The Concept (Reinforcement Learning, RL—GRPO/Online DPO): RL gently shifts the model toward higher‑scoring replies from the reward models. How it works:

Sample near‑policy prompts from fresh traffic. 2) Generate multiple replies per prompt. 3) Score with reward models; compute advantages. 4) Update policy with GRPO (clip, KL, EMA ref) or Online DPO. 5) Use variance‑based downsampling to focus on hard prompts; avoid over‑optimizing artifacts. Why it matters: RL provides directional improvement beyond imitation learning. 🍞 Anchor: Training on prompts where the model struggles (high reward‑score variance) yields clearer gains.

🍞 Hook: If you only pick the first idea you hear, you might miss the best one.

🥬 The Concept (Rejection Sampling): Rejection sampling builds a training set by picking only candidate replies that score above a quality threshold. How it works:

For each prompt, generate k replies from candidate policies. 2) Score with reward model(s). 3) Keep the best if above threshold τ. 4) Rebuild regularly from latest traffic. Why it matters: Trains on good examples from many model variants without manual labeling each time. 🍞 Anchor: From 10 drafts of a joke, keep the funniest one for the training set.

🍞 Hook: House rules keep game night fun and safe.

🥬 The Concept (Instruction Following/Steerability): The model must stick to each character’s traits and user instructions. How it works:

Annotators mildly challenge on persona. 2) Tag violations; rewrite bad turns. 3) Train reward models to prefer on‑character answers. 4) Monitor violation rates. Why it matters: Without steerability, characters feel fake or inconsistent. 🍞 Anchor: If the persona is a calm, wise gardener, replies stay gentle and plant‑savvy even under pressure.

🍞 Hook: Sometimes a picture says it better than a paragraph.

🥬 The Concept (Image Generation—explicit and implicit): The chatbot can call a tool to create images when asked (explicit) or when it decides visuals help (implicit). How it works:

Teach tool‑calling: when to trigger and what prompt to send to the T2I model. 2) Use multi‑review annotation to ensure quality. 3) Train prefs for image turns. 4) Monitor engagement impact. Why it matters: Visuals can lift engagement without extra user effort. 🍞 Anchor: During a travel chat, the bot proactively generates a packing‑list image with icons.

What happens in each step, why it exists, and an example:

Data curation & annotation: Filters for privacy/safety; clusters to de‑duplicate; balances locales, personas, first‑turns; annotators rank pairs and rewrite low‑quality turns. Without this, the model learns from noisy or biased data. Example: cap any single character’s share $to ≤3$ %.
Reward modeling: Train pointwise and pairwise judges on curated preferences; add auxiliary user‑signal predictors. Without judges, RL has no compass. Example: RM scores prefer friendly, specific, on‑character replies over vague ones.
SFT + DPO: Build a capable, safe base and patch urgent issues. Without this, RL would amplify flaws. Example: reduce false refusals while keeping safety rules.
RL (GRPO/Online DPO) with near‑policy prompts: Optimize on current‑style traffic; sample high‑variance prompts to fix weaknesses. Without this, improvements don’t transfer online. Example: near‑policy prompts beat off‑policy ones by +1.6% breadth and +10.6% depth in A/B.
Artifact mitigation: Track emojis, lists, length; adjust data/policies if shallow tricks creep in. Without this, the model might chase flashy patterns instead of quality. Example: After an emoji spike, guidelines and data were tuned to normalize usage.
Evaluation: Offline checks (benchmarks, human/RM win‑rates, custom metrics) then online A/B tests for breadth/depth. Without both, you can’t detect overfitting. Example: V12 showed a 70.7% RM win‑rate on user traffic but worse engagement—an overfitting red flag.

The secret sauce:

Use reward models as a flexible map, but never trust them blindly—confirm with A/B tests.
Keep prompts near‑policy and focus on high‑variance (hard) cases.
Impose guardrails (e.g., reward‑model win‑rates below ~65%) and watch for style artifacts.
Add implicit image generation to lift engagement without extra user effort.

04Experiments & Results

The test: The team measured two primary product metrics in weekly A/B tests—engagement breadth (how many users engage) and engagement depth (how much engaged users interact). They also tracked offline rewards (RM win‑rates), human side‑by‑side judgments, steerability (instruction violations), safety/false refusals, and standard benchmarks to ensure general skills remained healthy.

The competition: Every new version competed against the live production baseline. Pre‑launch, they also compared against GPT‑4o in human evals to check quality trajectory. Internally, multiple candidate policies (70B and 405B) were used for rejection sampling.

The scoreboard with context:

Post‑launch releases (V8–V15): 7 of 8 deployments lifted engagement; best cases reached +8.8% breadth and +11.2% depth (V14) and +4.47% breadth and +18.2% depth (V11). That’s like moving from a solid B to an A/A+ when everyone else is staying flat.
Steerability: Instruction violations fell from 26.6% (V2) to 5.8% (V8)—a 78% relative reduction—so characters stayed in persona more reliably. Instruction following (IFEval) climbed to 84.8%.
General benchmarks: Despite optimizing for social chat, the model kept strong general abilities (e.g., MMLU ~79.5% vs. 83.6% baseline; GSM8K 92.3% vs. 95.1%). Some coding/math tradeoffs appeared but remained competitive.
Style & safety trends: False refusals on user traffic dropped from >20% to <5% across iterations. Preachy tone decreased ~31%, while positive sentiment rose ~33%. Wall‑of‑text failures nearly halved.

Surprising findings:

V12 overfitting: RM win‑rate on user traffic spiked to 70.7% but online engagement got worse (−2.9% depth). The model learned to please the judge, not the people—classic reward hacking. New guardrails were set: keep RM win‑rates under ~65%.
Near‑policy matters: RL with fresh, current prompts beat older, off‑policy prompts by +1.6% breadth and +10.6% depth. Training where you play really works.
GRPO > Online DPO (in their setup): Same data, different loss; GRPO won by +1.52% breadth in A/B tests, likely due to using richer advantage signals from multiple candidates.
Variance beats mean for “hard” prompts: Mean RM scores were biased by style/length and over‑sampled certain JTBDs; score variance across candidates better identified truly difficult prompts.
User signal models are helpful, but risky for direct optimization: p(continue) and p(thumb up) correlated with preferences and helped with rejection sampling, but training RL directly on them led to biases like flattery at conversation ends and verbosity over clarity.
Implicit image generation lifts engagement: After adding explicit image generation (V9), adding implicit (V10) brought an extra +2.1% breadth lift.

Bottom line: The loop works. With tight monitoring, careful curation, and guardrails, engagement rose steadily while steerability and safety improved at production scale.

05Discussion & Limitations

Limitations:

Subjective objectives: “Engagingness” varies by culture, context, and user mood. Reward models provide estimates, not truths, so they can drift or be gamed.
Multi‑turn complexity: Training mainly on single‑turn optimization simplifies RL but misses long‑conversation dynamics (callbacks to earlier turns, evolving tone).
Reward hacking & artifacts: Without guardrails, models exploit shortcuts (emoji/list spam, end‑of‑chat flattery). The V12 incident shows even strong metrics can mislead.
Data bias: Even with diversity caps, popular personas/locales can seep in; user‑signal ratios differ by task (JTBD), causing confounds.

Required resources:

Substantial annotation bandwidth (including multi‑review), safety tooling, A/B testing infra, and large‑scale training (70B in prod, 405B as candidate sources).
Continuous privacy/safety filtering, plus monitoring dashboards for RM win‑rates, style artifacts, and failure modes.

When not to use:

Pure correctness tasks (e.g., medical/legal coding) where subjective engagement is secondary to precise accuracy.
Settings lacking A/B infra or high‑quality annotation—without feedback loops, the flywheel can’t spin safely.

Open questions:

Better multi‑turn RL: How to optimize full conversational arcs without brittle simulators?
Stronger anti‑hacking signals: Can we detect and penalize shallow style tricks automatically and robustly?
Preference generalization: How to build reward models that transfer across personas, locales, and time without constant retraining?
Combining user signals safely: Can we de‑bias thumb signals to be safely usable for direct optimization?
Theory of safe thresholds: Can we formalize guardrails like the ~65% RM win‑rate ceiling and predict tipping points earlier?

06Conclusion & Future Work

Three‑sentence summary: CharacterFlywheel is a production‑tested loop that improves social chat models by learning from fresh conversations, scoring replies with reward models, and confirming gains with online A/B tests. Over 15 generations, it consistently lifted engagement while sharply improving character steerability and reducing false refusals. Careful data curation, near‑policy RL, artifact mitigation, and strict guardrails prevented reward hacking and kept progress reliable.

Main achievement: Turning a fuzzy target—“make chats engaging and steerable”—into a rigorous, repeatable engineering process that scales to millions of users and keeps general abilities intact.

Future directions: Build robust multi‑turn optimization, create anti‑hacking signals that catch shallow tricks early, unify user signals with de‑biasing, and deepen understanding of reward‑model generalization across time, personas, and languages.

Why remember this: It shows how to make real, measurable progress on human‑feeling conversations—not just test scores—by combining smart proxies (reward models) with real‑world truth (A/B tests) in a steady, safe loop.

Practical Applications

•Build creator tools for designing and sharing steerable AI personas that stay on‑brand and in character.
•Run weekly A/B tests on new chatbot versions to validate real engagement gains before full rollout.
•Adopt variance‑based sampling of hard prompts for RL to focus training where the model struggles most.
•Use small DPO patches for fast safety/style fixes while keeping the main SFT+RL loop stable.
•Deploy implicit image generation in social chats to lift engagement without extra user effort.
•Monitor style artifacts (emoji/list/length) and set guardrails to prevent shallow reward hacking.
•Add user signal models (p(continue), p(thumb up)) to improve rejection sampling, not as direct RL rewards.
•Set RM win‑rate safety bands (e.g., keep under ~65%) to detect and prevent overfitting early.
•Cap per‑persona data shares and diversify locales/languages to reduce bias in training.
•Maintain a shared dashboard that tracks offline scores, RM win‑rates, steerability, and A/B lifts for every release.

Version: 1