Towards Simulating Social Media Users with LLMs: Evaluating the Operational Validity of Conditioned Comment Prediction
Key Summary
- ā¢This paper tests whether AI can realistically guess what a specific social media user would comment when they see a new post.
- ā¢It introduces a task called Conditioned Comment Prediction (CCP) that compares an AIās predicted comment to the real comment a user actually wrote.
- ā¢Two ways to tell the AI about the user are tested: explicit conditioning (a written profile or ābioā) and implicit conditioning (the userās real past comments).
- ā¢Across English, German, and Luxembourgish, using real behavioral history works better than using a written biography.
- ā¢Fine-tuning the models helps a lot with format and length, but in low-resource languages it can hurt true meaning-matching (form-content decoupling).
- ā¢After fine-tuning, giving a separate biography becomes mostly unnecessary because the model can figure out the userās style and stance from history alone (latent inference).
- ā¢Llama 3.1-8B was the most stable baseline, while other models often wrote too much unless fine-tuned.
- ā¢The task is hard: exact comment prediction is a āhigh-entropyā problem where many answers could fit, so scores are modest even when models improve.
- ā¢The paper provides practical guidelines: prefer real behavioral traces over made-up personas, and use fine-tuning mainly to fix structure and formatāespecially in English.
- ā¢Ethical risks are real (e.g., impersonation), so the authors restrict release and urge careful, responsible use.
Why This Research Matters
Better user simulation can improve research on how ideas spread, how communities react to news, and how platform designs might help or harm conversations. Policymakers can test what-if scenarios more safely if simulations reflect real person-level behavior. Product teams can design moderation and recommendation tools that respect authentic styles and stances. Educators and social scientists get clearer, validated methods to study discourse without over-trusting just āplausibleā text. However, because these tools can also be misused for impersonation or manipulation, the paperās validation focus and practical guardrails help keep progress responsible.
Detailed Explanation
Tap terms for definitions01Background & Problem Definition
š Top Bread (Hook) Imagine youāre trying to copy how your best friend texts. You donāt just need to know theyāre 13 and like soccer. You need to see how they actually textāshort jokes, lots of emojis, or long debatesāso you can sound like them for real.
š„¬ Filling (The Actual Concept: Large Language Models, LLMs)
- What it is: Large Language Models (LLMs) are computer programs trained to read and write human-like text.
- How it works (recipe):
- Read tons of text from the internet and books.
- Learn patterns of words, styles, and topics.
- When given a prompt, continue with likely words that fit style and meaning.
- Why it matters: Without LLMs, we canāt even begin to simulate social media usersā comments at scale.
š Bottom Bread (Anchor) When you ask an LLM to write a comment to a news post, it can write a short, casual reply just like people do online.
š Top Bread (Hook) You know how a diary shows what someone really thinks, not just what they say about themselves?
š„¬ Filling (The Actual Concept: User History)
- What it is: User History is a collection of a personās past posts and replies.
- How it works:
- Gather a userās real past comments paired with the posts they replied to.
- Feed several of these examples to the model.
- Ask the model to predict the next reply for a new post.
- Why it matters: Without real examples, the model may write something that sounds okay but doesnāt match the specific person.
š Bottom Bread (Anchor) If a user always writes short, sarcastic replies, showing the model their past comments helps it keep the sarcasm and the shortness.
The World Before Before this work, many researchers used LLMs to create pretend online users. These pretend users were guided by short descriptions like āYou are a conservative voter who prefers short, polite comments.ā This can make text that looks believable at a glance. But it was unclear if the model could truly match what a particular real person would say to a specific post. Most evaluations were about āplausibilityā (does it look like a comment?) instead of āoperational validityā (does it match the exact personās real behavior?).
The Problem As LLMs move from exploration to being āsilicon subjectsā in social science, we must check if they can predict how actual people respond to content. The real test is: can a model, given a personās past behavior, guess their next comment to a new post? If not, any simulation of crowds, debates, or policy reactions may rest on shaky ground.
Failed Attempts
- Explicit-only prompting: Telling the model, āYou are Alex, age 30, who likes topic X,ā often creates long, neat paragraphs that fit the description, but not the platformās style (too long, too formal) and not the specific userās stance.
- General plausibility checks: Human raters might say, āThis looks fine,ā but that misses whether it matches what that real user actually wrote.
The Gap We need a direct, person-level test. Thatās what this paper creates: a benchmark task called Conditioned Comment Prediction (CCP), where the model predicts a userās reply to a given stimulus (post/article), and we compare it to the real reply the user made. This measures whether the model captures the individualās style and meaningānot just general plausibility.
Real Stakes
- Research accuracy: If models canāt match actual people, then studies on public opinion, discourse, or interventions could be misleading.
- Platform tools: Moderation, recommendation, and design experiments using simulated users need reliable behavior, not just nice-looking text.
- Policy and safety: Good validation prevents overconfidence in simulations that might affect real decisions.
- Ethics: Stronger simulation powers can be misused for impersonation or manipulation, so getting measurement and guardrails right matters.
š Top Bread (Hook) Imagine youāre asked to āact like a gamerā versus being shown 10 chat logs from that gamer. Which helps you copy them better?
š„¬ Filling (The Actual Concept: Explicit vs. Implicit Conditioning)
- What it is: Explicit conditioning gives the model a written profile (bio); implicit conditioning gives the model real past examples.
- How it works:
- Explicit: Provide a concise biography with traits, style, and beliefs.
- Implicit: Provide up to 30 real stimulusāresponse pairs.
- Ask the model to create a new reply for a new stimulus.
- Why it matters: Without the right conditioning, the model may ignore platform style, over-explain, or miss the userās true stance.
š Bottom Bread (Anchor) A bio may say āYou write short replies,ā but showing 10 real short replies makes the model reliably short and on-point.
Finally, the paper tests three 8B-parameter models (Llama 3.1, Qwen3, Ministral) across English (high-resource), German (medium), and Luxembourgish (low-resource). It compares prompting strategies and fine-tuning to see what truly improves person-level prediction.
02Core Idea
š Top Bread (Hook) You know how a music teacher can tell a pianistās style just by listening to a few songs? After that, they can guess how the pianist might play a new tune.
š„¬ Filling (The Actual Concept: Conditioned Comment Prediction, CCP)
- What it is: CCP is a task where an AI tries to predict how a specific user will comment on a new post, using either a bio, a history of past comments, or both.
- How it works:
- Give the model a stimulus (a post or news article).
- Provide conditioning: a generated biography (explicit), a set of past comments (implicit), both, or none (control).
- Ask the model to write the userās next comment.
- Compare the modelās comment to the real one the user wrote.
- Why it matters: Without CCP, we canāt judge if models truly simulate individuals; weād only know if they write āplausibleā comments, not āthe userāsā comment.
š Bottom Bread (Anchor) If a real user replied āLOL, sureā¦ā to a climate headline, CCP checks whether the AI also produces something short and skeptical for that user.
The āAha!ā Moment in One Sentence The key insight is that conditioning on real behavioral history lets models perform latent inferenceāquietly inferring the userās style and stanceāso well that explicit biographies become mostly redundant after fine-tuning.
Three Analogies
- Detective: Instead of reading a suspectās self-description, a detective watches their actions and infers who they are (implicit beats explicit).
- Chef: You donāt need the recipeās story if you already tasted 10 of the chefās dishes; you can predict their flavor choices.
- Coach: Film study (examples) reveals an athleteās real habits better than their bio; you can predict what theyāll do in the next play.
Before vs. After
- Before: People trusted neat personas in prompts and judged success by text that looked like a comment.
- After: We have a grounded test (CCP) and clear evidence: real behavior beats bios; fine-tuning fixes structure; but in low-resource languages it may harm meaning-match.
Why It Works (Intuition, not equations)
- Implicit conditioning gives direct evidence: the exact way this person responds to certain topics and tones. The model can align length, slang, stance, and rhythm naturally.
- Fine-tuning teaches the model the platformās rules (short, chatty, reply-focused) so it doesnāt drift into long essays.
- However, when the modelās language skills are weak (low-resource languages), fine-tuning may polish surface form (length, n-grams) without truly understanding the meaning, causing a form-content split.
š Top Bread (Hook) Imagine learning about someone by reading a short story about them versus scrolling through their actual messages.
š„¬ Filling (The Actual Concept: Generated Biography, Explicit Conditioning)
- What it is: A short text profile that describes how a user writes, what they believe, and how they behave online.
- How it works:
- Use a large model to read up to 30 real comments.
- Summarize key traits: basics, language habits, worldview, behavior.
- Give this bio to the simulation model as instructions.
- Why it matters: In theory, itās easy and cheap. But without examples, models may get the platform style wrong (too long, too formal) or miss fine-grained habits.
š Bottom Bread (Anchor) A bio might say āYouāre brief and ironic,ā but without examples, the model could still write a long paragraph that doesnāt feel like a short, snappy reply.
š Top Bread (Hook) Think of learning to dribble by watching game clipsāno one needs to explain it if youāve seen enough plays.
š„¬ Filling (The Actual Concept: User History, Implicit Conditioning)
- What it is: Feeding the model real pairs of posts and the userās actual replies.
- How it works:
- Show up to 30 examples of how the user reacted to different posts.
- Ask for a new reply to a new post.
- The model infers patterns (stance, style, length) on its own.
- Why it matters: The model learns true habits from reality, not guesses from a summary.
š Bottom Bread (Anchor) If the user often uses āLOLā and emojis, the model picks that up and keeps it in the new reply.
š Top Bread (Hook) Picture a friend who can figure out your mood just from how you textāwithout you saying it.
š„¬ Filling (The Actual Concept: Latent Inference)
- What it is: The model quietly figures out hidden traits (style, stance) from observed behavior, not from labels.
- How it works:
- See many examples of how one person responds.
- Spot stable patterns (shortness, humor, topics that trigger agreement/disagreement).
- Use those patterns to guide the next reply.
- Why it matters: It makes extra biographies unnecessary after fine-tuning, saving effort and often improving fidelity.
š Bottom Bread (Anchor) After reading 20 of your replies, the model knows you prefer two-sentence responses with a wink emoji, even if no bio ever said so.
š Top Bread (Hook) Imagine a teacher who helps a student practice exactly the kind of test theyāll take.
š„¬ Filling (The Actual Concept: Supervised Fine-Tuning, SFT)
- What it is: A way to train a model on many example inputs and the correct outputs, so it learns the task format and style.
- How it works:
- Collect many stimulusāreply examples.
- Train the model to map inputs to the right kind of outputs.
- Adjust the modelās behavior so it writes the right length and tone for platform replies.
- Why it matters: Without SFT, models can be too long or formal. With SFT, they stabilize and follow the rules of the platform reply format.
š Bottom Bread (Anchor) Before SFT, the model might write a mini-essay; after SFT, it writes a short, casual comment like a real reply on X or a news site.
03Methodology
High-Level Recipe At a high level: Stimulus (a post/article) ā [Provide conditioning: none, biography, history, or both] ā Model generates predicted reply ā Compare to the real reply using several metrics.
Step-by-Step
- Data Preparation
- What happens: The team collected real comments and replies from three places: English and German tweets/replies (X), and Luxembourgish news-site comments (RTL). They grouped each userās replies with the original post or article they answered.
- Why this step exists: We need matched pairs (what the user saw, what the user said) to teach and test prediction for a specific person.
- Example: For User A, keep the last 30 stimulusāreply pairs (if available). The 30th reply will be the hidden target to predict.
- Conditioning Strategies
- What happens: The model gets different types of context before predicting the target reply: a) Control: No biography, no history. b) Biography (Explicit): A profile generated from up to 30 past comments (but not including the target). c) History (Implicit): Up to 30 real stimulusāreply examples shown as chat turns. d) Bio + History: Both signals together.
- Why this step exists: Different strategies test whether the model learns better from descriptions or from real examples.
- Example: A History-only prompt might include 10 short past replies showing the user often says āLOLā and keeps it brief.
- Models and Training (SFT)
- What happens: Three open 8B models (Llama 3.1, Qwen3, Ministral) are tested. Each is also fine-tuned with one epoch using the same settings (e.g., sequence length, optimizer). This aligns models to the platform reply style.
- Why this step exists: To see if fine-tuning stabilizes structure and improves alignmentāand whether this depends on the language.
- Example: Before SFT, a model might produce 5 sentences for a reply; after SFT, it keeps to 1ā2 sentences like real users.
- Test Split and Generation
- What happens: Users are split so that all of a userās data lives in train or test, never both. For each test user, the last reply is withheld. The model sees their prior replies (when applicable) and must predict the held-out one.
- Why this step exists: This avoids leaking the answer and fairly tests whether the model can generalize a userās behavior.
- Example: If a user has 12 past replies, the model sees up to 11 examples, then predicts reply 12.
- Metrics
- What happens: The modelās predicted reply is compared to the real one using:
- BLEU: Overlap of short word sequences (n-grams).
- ROUGE-1: Overlap of single words (unigrams).
- Embedding Distance: How close the meanings are using a sentence embedding model (lower is better).
- Length Ratio: How close the length is to the real reply length.
- Why this step exists: We need to check both style/words (BLEU/ROUGE), meaning (embedding distance), and structure (length ratio).
- Example with math for Length Ratio: The formula is . For a numeric example, if the model generates 20 tokens and the real reply has 25 tokens, then .
- Repeated Runs
- What happens: Each model is run five times (same temperature and max tokens), and scores are averaged.
- Why this step exists: Text generation is a bit random; averaging captures stable performance, not lucky hits.
- Example: If BLEU scores are 0.07, 0.08, 0.09, 0.08, 0.08, the mean is 0.08.
Concrete Data Examples
- English X: 7.79M tweets; users chosen via political engagement cues; up to 3,200 tweets/replies per user available.
- German X: 3.38M tweets tied to political discourse.
- Luxembourgish RTL: 1.02M moderated comments from 21,427 users on news articles.
- Standardized filters: No URLs/images/GIFs; users need at least four interactions; keep only last 30 history items per user.
What Breaks Without Each Step
- Without user-level grouping: The model canāt learn person-specific habits.
- Without clean splits: Data leakage fakes high scores.
- Without history: The model may write plausible, but generic, off-format replies.
- Without SFT: Many models get too wordy or formal, misfitting platform style.
- Without multiple metrics: You might think good word overlap means good meaning, which isnāt always true.
The Secret Sauce
- Using real user histories lets the model perform latent inferenceāspotting hidden patterns (style, stance, length) without being told in a bio.
- SFT makes zero-history (cold start) behavior stable: even with no examples, the model writes in the right format and length.
- Careful multilingual testing reveals a crucial warning: in low-resource languages, SFT can polish form while hurting meaning alignment (form-content decoupling).
04Experiments & Results
The Test
- Goal: Predict a specific userās held-out reply to a real post/article.
- Why these metrics: BLEU and ROUGE-1 check word/phrase overlap; Embedding Distance checks meaning similarity; Length Ratio ensures the model fits platform length norms.
- Setup: For each language (English, German, Luxembourgish), train on 3,800 users, test on 650 users. Generate five runs per model/condition.
The Competition
- Base models: Llama 3.1-8B-Instruct, Qwen3-8B, Ministral-8B-Instruct.
- Fine-tuned versions: Same models after one-epoch SFT on the CCP task.
- Conditioning: Control (none), Biography-only, History-only, Bio+History.
The Scoreboard (with context)
- English (high-resource): After fine-tuning with Bio+History, Llama 3.1 reached .083 and embedding .397. Thatās like moving from āsounds okayā to ānoticeably closer to the real user,ā though still far from perfect because many different comments could fit a post.
- German (mid-resource): BLEU improved (ā 0.095), but embedding distance hovered .50. Translation: better word overlap and phrasing, but no clear gain in meaning match.
- Luxembourgish (low-resource): BLEU stayed tiny (ā 0.009), and embedding distance got worse after SFT (e.g., Llama 3.1: 0.579 ā 0.605). This shows form-content decoupling: cleaner length and n-grams but weaker meaning match.
- Length Ratio: Base Qwen3 and Ministral often wrote way too long (e.g., Ministral in .98), but SFT pulled them near 1.0, making them usable format-wise.
Surprising Findings
- Biography-only can fail badly without SFT: Base models may write giant, off-format replies (length .9 in English), showing that profiles alone donāt teach platform style.
- After SFT, History-only and Bio+History are nearly the same: explicit bios become mostly redundant because the model can infer the user persona from behavior (latent inference).
- More history keeps helping: Up to 29 examples showed no clear saturation; each extra real example adds signal.
Putting Numbers in Plain Words
- English gains feel like upgrading from a āgeneric commenterā to a āthis specific userābut still imperfect.ā BLEU ~0.08 is modest, but beating biographical baselines and control shows that user history matters.
- German shows style polishing without clear meaning gainsālike getting the rhythm and phrases right, but not always the stance.
- Luxembourgish shows the warning sign: models can look right (shorter, plausible surface) but mean wrong (worse embedding distance). Thatās a big deal for trust.
Ablations That Clarify
- Conditioning study (English, Llama 3.1):
- Control + SFT: Better ROUGE-1 (ā 0.207) from topic/style adaptation, but meaning (embedding .418) lags user-conditioned setups (ā 0.399 with History).
- Biography-only (Base): Structural meltdown (.907). Biography-only (FT): Fixed length (ā 0.935), still not beating History-only on meaning.
- History-only (FT) vs. Bio+History (FT): Nearly tied on meaningāredundancy of bios after SFT.
- History length sensitivity: Base models need a few examples (ā 5) to stop over-talking; FT models stay stable even at zero history (cold start solved), but more history still boosts scores.
Takeaways
- Use real behavioral histories. They carry the richest signal for both structure and stance.
- Use SFT to lock in reply format and lengthāespecially important in English. But in low-resource languages, watch for meaning degradation.
- Donāt rely on bios alone, especially without SFT; they can mislead structure and tone.
- Expect modest absolute scores because exact comment prediction is a many-possible-answers problem.
05Discussion & Limitations
Limitations
- Low-resource fragility: In Luxembourgish, SFT improved surface form but hurt semantic alignment (higher embedding distance). That means you can get comments that look right but miss the userās true intentāa serious risk for research conclusions.
- Metric coverage: BLEU/ROUGE/embedding distance are helpful but can miss subtle persona failures (tone shifts, irony misuse). Human checks or task-specific metrics might be needed for sensitive studies.
- Model scale: Only 8B models were tested. Larger models might have stronger multilingual grounding and avoid form-content decoupling, but that remains to be proven.
- Language comparability: Datasets differ (platforms, topics, moderation), so cross-language score gaps may partly reflect dataset unpredictability, not only model capability.
Required Resources
- Data: Enough real user histories (ideally 5ā30 examples per user) to let the model learn latent patterns.
- Compute: A single modern GPU (e.g., 48GB VRAM) was sufficient for SFT with careful settings (paged AdamW, 8-bit optimizers).
- Tooling: A reliable embedding model for semantic evaluation, plus code for clean user-level splits and formatting prompts.
When NOT to Use
- Extremely sparse users (fewer than 4 interactions): The model canāt infer stable habits; predictions become guessy.
- High-stakes meaning tasks in low-resource languages without careful validation: You might get well-shaped but wrong-meaning outputs.
- Persona-only setups without SFT: Expect off-format, overly long replies that donāt reflect real platform behavior.
Open Questions
- Scale effects: models fix form-content decoupling in low-resource languages?
- New objectives: Can training directly minimize semantic distance (e.g., DPO tailored to user histories) improve meaning match beyond cross-entropy?
- Richer actions: Can we simulate likes, shares, or multi-modal reactions reliably, not just text replies?
- Stability over time: How stable are simulated personas across many turns and topics? Do they drift?
- Data thresholds: Exactly how many examples per user are needed for reliable convergence in different languages and domains?
06Conclusion & Future Work
Three-Sentence Summary This paper introduces Conditioned Comment Prediction (CCP), a direct test of whether LLMs can predict a specific userās reply to a new post. It shows that conditioning on real behavioral history (implicit) beats written biographies (explicit), and that fine-tuning stabilizes format and lengthāyet in low-resource languages, fine-tuning can polish form while harming meaning alignment. The result is a practical, validated path: prefer authentic digital traces, use SFT for structure, and be cautious in low-resource settings.
Main Achievement The paper provides an operational, person-level benchmark (CCP) and strong evidence that latent inference from real histories is the most reliable way to simulate individual online behavior, rendering explicit bios largely unnecessary after fine-tuning.
Future Directions
- Test bigger models to see if they fix form-content decoupling in low-resource languages.
- Explore training objectives that directly optimize semantic alignment to the target user.
- Expand beyond text replies to multi-modal and non-verbal actions, and study long-run persona stability.
Why Remember This It reframes user simulation from āplausible pretend personasā to āmeasurable person-level prediction.ā It also delivers a safety warning: good-looking outputs can still miss the userās true meaning, especially in low-resource languages. Finally, it gives clear, practical guidanceāuse real history, fine-tune for structureāand sets a standard for more trustworthy social simulations.
Practical Applications
- ā¢Design social media experiments that rely on user histories to forecast reactions to new policy announcements.
- ā¢Evaluate content moderation strategies by simulating realistic reply behaviors rather than generic comments.
- ā¢Prototype recommendation changes and estimate how different user segments might respond before live A/B tests.
- ā¢Support public opinion research by validating person-level prediction rather than only aggregate plausibility.
- ā¢Build safer synthetic datasets by anchoring on real behavior patterns while masking identities.
- ā¢Improve customer support bots to mirror a clientās communication style learned from past interactions.
- ā¢Create robust classroom simulations to teach online discourse analysis with operational validity checks.
- ā¢Benchmark new LLMs for social tasks using CCP as a standard person-level test.
- ā¢Use SFT to stabilize reply format and length in domain-specific chat systems.
- ā¢Diagnose low-resource risks by testing for form-content decoupling before deployment.