Towards Simulating Social Media Users with LLMs: Evaluating the Operational Validity of Conditioned Comment Prediction

Nils Schwager; Simon Münker; Alistair Plum; Achim Rettinger

Towards Simulating Social Media Users with LLMs: Evaluating the Operational Validity of Conditioned Comment Prediction

Intermediate

Nils Schwager, Simon Münker, Alistair Plum et al.2/26/2026

arXiv

Key Summary

•This paper tests whether AI can realistically guess what a specific social media user would comment when they see a new post.
•It introduces a task called Conditioned Comment Prediction (CCP) that compares an AI’s predicted comment to the real comment a user actually wrote.
•Two ways to tell the AI about the user are tested: explicit conditioning (a written profile or ‘bio’) and implicit conditioning (the user’s real past comments).
•Across English, German, and Luxembourgish, using real behavioral history works better than using a written biography.
•Fine-tuning the models helps a lot with format and length, but in low-resource languages it can hurt true meaning-matching (form-content decoupling).
•After fine-tuning, giving a separate biography becomes mostly unnecessary because the model can figure out the user’s style and stance from history alone (latent inference).
•Llama 3.1-8B was the most stable baseline, while other models often wrote too much unless fine-tuned.
•The task is hard: exact comment prediction is a “high-entropy” problem where many answers could fit, so scores are modest even when models improve.
•The paper provides practical guidelines: prefer real behavioral traces over made-up personas, and use fine-tuning mainly to fix structure and format—especially in English.
•Ethical risks are real (e.g., impersonation), so the authors restrict release and urge careful, responsible use.

Why This Research Matters

Better user simulation can improve research on how ideas spread, how communities react to news, and how platform designs might help or harm conversations. Policymakers can test what-if scenarios more safely if simulations reflect real person-level behavior. Product teams can design moderation and recommendation tools that respect authentic styles and stances. Educators and social scientists get clearer, validated methods to study discourse without over-trusting just “plausible” text. However, because these tools can also be misused for impersonation or manipulation, the paper’s validation focus and practical guardrails help keep progress responsible.

Reading Workflow

Turn this paper into a decision

Scan fast. Promote only the papers that survive triage.

No workflow history yet.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Top Bread (Hook) Imagine you’re trying to copy how your best friend texts. You don’t just need to know they’re 13 and like soccer. You need to see how they actually text—short jokes, lots of emojis, or long debates—so you can sound like them for real.

🥬 Filling (The Actual Concept: Large Language Models, LLMs)

What it is: Large Language Models (LLMs) are computer programs trained to read and write human-like text.
How it works (recipe):
1. Read tons of text from the internet and books.
2. Learn patterns of words, styles, and topics.
3. When given a prompt, continue with likely words that fit style and meaning.
Why it matters: Without LLMs, we can’t even begin to simulate social media users’ comments at scale.

🍞 Bottom Bread (Anchor) When you ask an LLM to write a comment to a news post, it can write a short, casual reply just like people do online.

🍞 Top Bread (Hook) You know how a diary shows what someone really thinks, not just what they say about themselves?

🥬 Filling (The Actual Concept: User History)

What it is: User History is a collection of a person’s past posts and replies.
How it works:
1. Gather a user’s real past comments paired with the posts they replied to.
2. Feed several of these examples to the model.
3. Ask the model to predict the next reply for a new post.
Why it matters: Without real examples, the model may write something that sounds okay but doesn’t match the specific person.

🍞 Bottom Bread (Anchor) If a user always writes short, sarcastic replies, showing the model their past comments helps it keep the sarcasm and the shortness.

The World Before Before this work, many researchers used LLMs to create pretend online users. These pretend users were guided by short descriptions like “You are a conservative voter who prefers short, polite comments.” This can make text that looks believable at a glance. But it was unclear if the model could truly match what a particular real person would say to a specific post. Most evaluations were about “plausibility” (does it look like a comment?) instead of “operational validity” (does it match the exact person’s real behavior?).

The Problem As LLMs move from exploration to being “silicon subjects” in social science, we must check if they can predict how actual people respond to content. The real test is: can a model, given a person’s past behavior, guess their next comment to a new post? If not, any simulation of crowds, debates, or policy reactions may rest on shaky ground.

Failed Attempts

Explicit-only prompting: Telling the model, “You are Alex, age 30, who likes topic X,” often creates long, neat paragraphs that fit the description, but not the platform’s style (too long, too formal) and not the specific user’s stance.
General plausibility checks: Human raters might say, “This looks fine,” but that misses whether it matches what that real user actually wrote.

The Gap We need a direct, person-level test. That’s what this paper creates: a benchmark task called Conditioned Comment Prediction (CCP), where the model predicts a user’s reply to a given stimulus (post/article), and we compare it to the real reply the user made. This measures whether the model captures the individual’s style and meaning—not just general plausibility.

Real Stakes

Research accuracy: If models can’t match actual people, then studies on public opinion, discourse, or interventions could be misleading.
Platform tools: Moderation, recommendation, and design experiments using simulated users need reliable behavior, not just nice-looking text.
Policy and safety: Good validation prevents overconfidence in simulations that might affect real decisions.
Ethics: Stronger simulation powers can be misused for impersonation or manipulation, so getting measurement and guardrails right matters.

🍞 Top Bread (Hook) Imagine you’re asked to “act like a gamer” versus being shown 10 chat logs from that gamer. Which helps you copy them better?

🥬 Filling (The Actual Concept: Explicit vs. Implicit Conditioning)

What it is: Explicit conditioning gives the model a written profile (bio); implicit conditioning gives the model real past examples.
How it works:
1. Explicit: Provide a concise biography with traits, style, and beliefs.
2. Implicit: Provide up to 30 real stimulus–response pairs.
3. Ask the model to create a new reply for a new stimulus.
Why it matters: Without the right conditioning, the model may ignore platform style, over-explain, or miss the user’s true stance.

🍞 Bottom Bread (Anchor) A bio may say “You write short replies,” but showing 10 real short replies makes the model reliably short and on-point.

Finally, the paper tests three 8B-parameter models (Llama 3.1, Qwen3, Ministral) across English (high-resource), German (medium), and Luxembourgish (low-resource). It compares prompting strategies and fine-tuning to see what truly improves person-level prediction.

02Core Idea

🍞 Top Bread (Hook) You know how a music teacher can tell a pianist’s style just by listening to a few songs? After that, they can guess how the pianist might play a new tune.

🥬 Filling (The Actual Concept: Conditioned Comment Prediction, CCP)

What it is: CCP is a task where an AI tries to predict how a specific user will comment on a new post, using either a bio, a history of past comments, or both.
How it works:
1. Give the model a stimulus (a post or news article).
2. Provide conditioning: a generated biography (explicit), a set of past comments (implicit), both, or none (control).
3. Ask the model to write the user’s next comment.
4. Compare the model’s comment to the real one the user wrote.
Why it matters: Without CCP, we can’t judge if models truly simulate individuals; we’d only know if they write “plausible” comments, not “the user’s” comment.

🍞 Bottom Bread (Anchor) If a real user replied “LOL, sure…” to a climate headline, CCP checks whether the AI also produces something short and skeptical for that user.

The “Aha!” Moment in One Sentence The key insight is that conditioning on real behavioral history lets models perform latent inference—quietly inferring the user’s style and stance—so well that explicit biographies become mostly redundant after fine-tuning.

Three Analogies

Detective: Instead of reading a suspect’s self-description, a detective watches their actions and infers who they are (implicit beats explicit).
Chef: You don’t need the recipe’s story if you already tasted 10 of the chef’s dishes; you can predict their flavor choices.
Coach: Film study (examples) reveals an athlete’s real habits better than their bio; you can predict what they’ll do in the next play.

Before vs. After

Before: People trusted neat personas in prompts and judged success by text that looked like a comment.
After: We have a grounded test (CCP) and clear evidence: real behavior beats bios; fine-tuning fixes structure; but in low-resource languages it may harm meaning-match.

Why It Works (Intuition, not equations)

Implicit conditioning gives direct evidence: the exact way this person responds to certain topics and tones. The model can align length, slang, stance, and rhythm naturally.
Fine-tuning teaches the model the platform’s rules (short, chatty, reply-focused) so it doesn’t drift into long essays.
However, when the model’s language skills are weak (low-resource languages), fine-tuning may polish surface form (length, n-grams) without truly understanding the meaning, causing a form-content split.

🍞 Top Bread (Hook) Imagine learning about someone by reading a short story about them versus scrolling through their actual messages.

🥬 Filling (The Actual Concept: Generated Biography, Explicit Conditioning)

What it is: A short text profile that describes how a user writes, what they believe, and how they behave online.
How it works:
1. Use a large model to read up to 30 real comments.
2. Summarize key traits: basics, language habits, worldview, behavior.
3. Give this bio to the simulation model as instructions.
Why it matters: In theory, it’s easy and cheap. But without examples, models may get the platform style wrong (too long, too formal) or miss fine-grained habits.

🍞 Bottom Bread (Anchor) A bio might say “You’re brief and ironic,” but without examples, the model could still write a long paragraph that doesn’t feel like a short, snappy reply.

🍞 Top Bread (Hook) Think of learning to dribble by watching game clips—no one needs to explain it if you’ve seen enough plays.

🥬 Filling (The Actual Concept: User History, Implicit Conditioning)

What it is: Feeding the model real pairs of posts and the user’s actual replies.
How it works:
1. Show up to 30 examples of how the user reacted to different posts.
2. Ask for a new reply to a new post.
3. The model infers patterns (stance, style, length) on its own.
Why it matters: The model learns true habits from reality, not guesses from a summary.

🍞 Bottom Bread (Anchor) If the user often uses “LOL” and emojis, the model picks that up and keeps it in the new reply.

🍞 Top Bread (Hook) Picture a friend who can figure out your mood just from how you text—without you saying it.

🥬 Filling (The Actual Concept: Latent Inference)

What it is: The model quietly figures out hidden traits (style, stance) from observed behavior, not from labels.
How it works:
1. See many examples of how one person responds.
2. Spot stable patterns (shortness, humor, topics that trigger agreement/disagreement).
3. Use those patterns to guide the next reply.
Why it matters: It makes extra biographies unnecessary after fine-tuning, saving effort and often improving fidelity.

🍞 Bottom Bread (Anchor) After reading 20 of your replies, the model knows you prefer two-sentence responses with a wink emoji, even if no bio ever said so.

🍞 Top Bread (Hook) Imagine a teacher who helps a student practice exactly the kind of test they’ll take.

🥬 Filling (The Actual Concept: Supervised Fine-Tuning, SFT)

What it is: A way to train a model on many example inputs and the correct outputs, so it learns the task format and style.
How it works:
1. Collect many stimulus–reply examples.
2. Train the model to map inputs to the right kind of outputs.
3. Adjust the model’s behavior so it writes the right length and tone for platform replies.
Why it matters: Without SFT, models can be too long or formal. With SFT, they stabilize and follow the rules of the platform reply format.

🍞 Bottom Bread (Anchor) Before SFT, the model might write a mini-essay; after SFT, it writes a short, casual comment like a real reply on X or a news site.

03Methodology

High-Level Recipe At a high level: Stimulus (a post/article) → [Provide conditioning: none, biography, history, or both] → Model generates predicted reply → Compare to the real reply using several metrics.

Step-by-Step

Data Preparation

What happens: The team collected real comments and replies from three places: English and German tweets/replies (X), and Luxembourgish news-site comments (RTL). They grouped each user’s replies with the original post or article they answered.
Why this step exists: We need matched pairs (what the user saw, what the user said) to teach and test prediction for a specific person.
Example: For User A, keep the last 30 stimulus–reply pairs (if available). The 30th reply will be the hidden target to predict.

Conditioning Strategies

What happens: The model gets different types of context before predicting the target reply: a) Control: No biography, no history. b) Biography (Explicit): A profile generated from up to 30 past comments (but not including the target). c) History (Implicit): Up to 30 real stimulus–reply examples shown as chat turns. d) Bio + History: Both signals together.
Why this step exists: Different strategies test whether the model learns better from descriptions or from real examples.
Example: A History-only prompt might include 10 short past replies showing the user often says “LOL” and keeps it brief.

Models and Training (SFT)

What happens: Three open 8B models (Llama 3.1, Qwen3, Ministral) are tested. Each is also fine-tuned with one epoch using the same settings (e.g., sequence length, optimizer). This aligns models to the platform reply style.
Why this step exists: To see if fine-tuning stabilizes structure and improves alignment—and whether this depends on the language.
Example: Before SFT, a model might produce 5 sentences for a reply; after SFT, it keeps to 1–2 sentences like real users.

Test Split and Generation

What happens: Users are split so that all of a user’s data lives in train or test, never both. For each test user, the last reply is withheld. The model sees their prior replies (when applicable) and must predict the held-out one.
Why this step exists: This avoids leaking the answer and fairly tests whether the model can generalize a user’s behavior.
Example: If a user has 12 past replies, the model sees up to 11 examples, then predicts reply 12.

Metrics

What happens: The model’s predicted reply is compared to the real one using:
- BLEU: Overlap of short word sequences (n-grams).
- ROUGE-1: Overlap of single words (unigrams).
- Embedding Distance: How close the meanings are using a sentence embedding model (lower is better).
- Length Ratio: How close the length is to the real reply length.
Why this step exists: We need to check both style/words (BLEU/ROUGE), meaning (embedding distance), and structure (length ratio).
Example with math for Length Ratio: The formula is $\text{Length Ratio} = \frac{\text{generated length}}{\text{reference length}}$ . For a numeric example, if the model generates 20 tokens and the real reply has 25 tokens, then $\frac{20}{25} = 0.8$ .

Repeated Runs

What happens: Each model is run five times (same temperature and max tokens), and scores are averaged.
Why this step exists: Text generation is a bit random; averaging captures stable performance, not lucky hits.
Example: If BLEU scores are 0.07, 0.08, 0.09, 0.08, 0.08, the mean is 0.08.

Concrete Data Examples

English X: 7.79M tweets; users chosen via political engagement cues; up to 3,200 tweets/replies per user available.
German X: 3.38M tweets tied to political discourse.
Luxembourgish RTL: 1.02M moderated comments from 21,427 users on news articles.
Standardized filters: No URLs/images/GIFs; users need at least four interactions; keep only last 30 history items per user.

What Breaks Without Each Step

Without user-level grouping: The model can’t learn person-specific habits.
Without clean splits: Data leakage fakes high scores.
Without history: The model may write plausible, but generic, off-format replies.
Without SFT: Many models get too wordy or formal, misfitting platform style.
Without multiple metrics: You might think good word overlap means good meaning, which isn’t always true.

The Secret Sauce

Using real user histories lets the model perform latent inference—spotting hidden patterns (style, stance, length) without being told in a bio.
SFT makes zero-history (cold start) behavior stable: even with no examples, the model writes in the right format and length.
Careful multilingual testing reveals a crucial warning: in low-resource languages, SFT can polish form while hurting meaning alignment (form-content decoupling).

04Experiments & Results

The Test

Goal: Predict a specific user’s held-out reply to a real post/article.
Why these metrics: BLEU and ROUGE-1 check word/phrase overlap; Embedding Distance checks meaning similarity; Length Ratio ensures the model fits platform length norms.
Setup: For each language (English, German, Luxembourgish), train on 3,800 users, test on 650 users. Generate five runs per model/condition.

The Competition

Base models: Llama 3.1-8B-Instruct, Qwen3-8B, Ministral-8B-Instruct.
Fine-tuned versions: Same models after one-epoch SFT on the CCP task.
Conditioning: Control (none), Biography-only, History-only, Bio+History.

The Scoreboard (with context)

English (high-resource): After fine-tuning with Bio+History, Llama 3.1 reached $BLEU ≈ 0$ .083 and embedding $distance ≈ 0$ .397. That’s like moving from “sounds okay” to “noticeably closer to the real user,” though still far from perfect because many different comments could fit a post.
German (mid-resource): BLEU improved (≈ 0.095), but embedding distance hovered $around ≈ 0$ .50. Translation: better word overlap and phrasing, but no clear gain in meaning match.
Luxembourgish (low-resource): BLEU stayed tiny (≈ 0.009), and embedding distance got worse after SFT (e.g., Llama 3.1: 0.579 → 0.605). This shows form-content decoupling: cleaner length and n-grams but weaker meaning match.
Length Ratio: Base Qwen3 and Ministral often wrote way too long (e.g., Ministral in $Luxembourgish ≈ 2$ .98), but SFT pulled them near 1.0, making them usable format-wise.

Surprising Findings

Biography-only can fail badly without SFT: Base models may write giant, off-format replies (length $ratio ≈ 4$ .9 in English), showing that profiles alone don’t teach platform style.
After SFT, History-only and Bio+History are nearly the same: explicit bios become mostly redundant because the model can infer the user persona from behavior (latent inference).
More history keeps helping: Up to 29 examples showed no clear saturation; each extra real example adds signal.

Putting Numbers in Plain Words

English gains feel like upgrading from a “generic commenter” to a “this specific user—but still imperfect.” BLEU ~0.08 is modest, but beating biographical baselines and control shows that user history matters.
German shows style polishing without clear meaning gains—like getting the rhythm and phrases right, but not always the stance.
Luxembourgish shows the warning sign: models can look right (shorter, plausible surface) but mean wrong (worse embedding distance). That’s a big deal for trust.

Ablations That Clarify

Conditioning study (English, Llama 3.1):
- Control + SFT: Better ROUGE-1 (≈ 0.207) from topic/style adaptation, but meaning (embedding $distance ≈ 0$ .418) lags user-conditioned setups (≈ 0.399 with History).
- Biography-only (Base): Structural meltdown ( $LR ≈ 4$ .907). Biography-only (FT): Fixed length (≈ 0.935), still not beating History-only on meaning.
- History-only (FT) vs. Bio+History (FT): Nearly tied on meaning—redundancy of bios after SFT.
History length sensitivity: Base models need a few examples (≈ 5) to stop over-talking; FT models stay stable even at zero history (cold start solved), but more history still boosts scores.

Takeaways

Use real behavioral histories. They carry the richest signal for both structure and stance.
Use SFT to lock in reply format and length—especially important in English. But in low-resource languages, watch for meaning degradation.
Don’t rely on bios alone, especially without SFT; they can mislead structure and tone.
Expect modest absolute scores because exact comment prediction is a many-possible-answers problem.

05Discussion & Limitations

Limitations

Low-resource fragility: In Luxembourgish, SFT improved surface form but hurt semantic alignment (higher embedding distance). That means you can get comments that look right but miss the user’s true intent—a serious risk for research conclusions.
Metric coverage: BLEU/ROUGE/embedding distance are helpful but can miss subtle persona failures (tone shifts, irony misuse). Human checks or task-specific metrics might be needed for sensitive studies.
Model scale: Only 8B models were tested. Larger models might have stronger multilingual grounding and avoid form-content decoupling, but that remains to be proven.
Language comparability: Datasets differ (platforms, topics, moderation), so cross-language score gaps may partly reflect dataset unpredictability, not only model capability.

Required Resources

Data: Enough real user histories (ideally 5–30 examples per user) to let the model learn latent patterns.
Compute: A single modern GPU (e.g., 48GB VRAM) was sufficient for SFT with careful settings (paged AdamW, 8-bit optimizers).
Tooling: A reliable embedding model for semantic evaluation, plus code for clean user-level splits and formatting prompts.

When NOT to Use

Extremely sparse users (fewer than 4 interactions): The model can’t infer stable habits; predictions become guessy.
High-stakes meaning tasks in low-resource languages without careful validation: You might get well-shaped but wrong-meaning outputs.
Persona-only setups without SFT: Expect off-format, overly long replies that don’t reflect real platform behavior.

Open Questions

Scale effects: $Do ≥70B$ models fix form-content decoupling in low-resource languages?
New objectives: Can training directly minimize semantic distance (e.g., DPO tailored to user histories) improve meaning match beyond cross-entropy?
Richer actions: Can we simulate likes, shares, or multi-modal reactions reliably, not just text replies?
Stability over time: How stable are simulated personas across many turns and topics? Do they drift?
Data thresholds: Exactly how many examples per user are needed for reliable convergence in different languages and domains?

06Conclusion & Future Work

Three-Sentence Summary This paper introduces Conditioned Comment Prediction (CCP), a direct test of whether LLMs can predict a specific user’s reply to a new post. It shows that conditioning on real behavioral history (implicit) beats written biographies (explicit), and that fine-tuning stabilizes format and length—yet in low-resource languages, fine-tuning can polish form while harming meaning alignment. The result is a practical, validated path: prefer authentic digital traces, use SFT for structure, and be cautious in low-resource settings.

Main Achievement The paper provides an operational, person-level benchmark (CCP) and strong evidence that latent inference from real histories is the most reliable way to simulate individual online behavior, rendering explicit bios largely unnecessary after fine-tuning.

Future Directions

Test bigger models to see if they fix form-content decoupling in low-resource languages.
Explore training objectives that directly optimize semantic alignment to the target user.
Expand beyond text replies to multi-modal and non-verbal actions, and study long-run persona stability.

Why Remember This It reframes user simulation from “plausible pretend personas” to “measurable person-level prediction.” It also delivers a safety warning: good-looking outputs can still miss the user’s true meaning, especially in low-resource languages. Finally, it gives clear, practical guidance—use real history, fine-tune for structure—and sets a standard for more trustworthy social simulations.

Practical Applications

•Design social media experiments that rely on user histories to forecast reactions to new policy announcements.
•Evaluate content moderation strategies by simulating realistic reply behaviors rather than generic comments.
•Prototype recommendation changes and estimate how different user segments might respond before live A/B tests.
•Support public opinion research by validating person-level prediction rather than only aggregate plausibility.
•Build safer synthetic datasets by anchoring on real behavior patterns while masking identities.
•Improve customer support bots to mirror a client’s communication style learned from past interactions.
•Create robust classroom simulations to teach online discourse analysis with operational validity checks.
•Benchmark new LLMs for social tasks using CCP as a standard person-level test.
•Use SFT to stabilize reply format and length in domain-specific chat systems.
•Diagnose low-resource risks by testing for form-content decoupling before deployment.

Version: 1