šŸŽ“How I Study AIHISA
šŸ“–Read
šŸ“„PapersšŸ“°BlogsšŸŽ¬Courses
šŸ’”Learn
šŸ›¤ļøPathsšŸ“šTopicsšŸ’”ConceptsšŸŽ“Shorts
šŸŽÆPractice
ā±ļøCoach🧩Problems🧠ThinkingšŸŽÆPrompts🧠Review
SearchSettings
How Do Decoder-Only LLMs Perceive Users? Rethinking Attention Masking for User Representation Learning | How I Study AI

How Do Decoder-Only LLMs Perceive Users? Rethinking Attention Masking for User Representation Learning

Intermediate
Jiahao Yuan, Yike Xu, Jinyong Wen et al.2/11/2026
arXiv

Key Summary

  • •Decoder-only language models can be great at making user profiles (embeddings), but how we let them look at the sequence—called attention masking—changes how smart those profiles are.
  • •Causal attention (only looking left) is safe for generation but misses future clues; bidirectional attention (looking both ways) makes richer profiles but can train unstably if you switch too fast.
  • •This paper compares causal, hybrid, and full bidirectional masks in one fair setup using contrastive learning and large real-world Alipay data.
  • •The key idea is Gradient-Guided Soft Masking (GG-SM), a warm-up that slowly opens ā€œfuture visionā€ based on which tokens the gradients say are most important.
  • •GG-SM makes the switch from causal to bidirectional smooth and stable, leading to better user embeddings on 9 practical tasks (average AUC 0.7745), beating popular general embedding models and strong user-modeling baselines.
  • •Hybrid masks help a bit, but GG-SM with bidirectional attention works best for quality while keeping compatibility with decoder pretraining.
  • •The method is parameter-efficient and especially good on tasks that need long-range context, like preference and sensitivity predictions.
  • •Training curves show GG-SM converges more steadily than simple schedulers, meaning the model learns faster and more reliably.
  • •The study highlights that not just the final mask, but the transition path to it, is crucial for decoder-only LLMs used as user encoders.

Why This Research Matters

Better user embeddings help apps understand people more fairly and accurately. This means smarter recommendations (fewer annoying suggestions), earlier detection of churn or loss of interest, and more helpful guidance at the right time. For businesses, it improves campaign relevance and reduces waste, since models can capture long-range habits instead of just recent clicks. For users, it enhances privacy-aware personalization by making more sense from less data noise. The key innovation—opening attention gradually and wisely—can be reused in other sequence problems, like healthcare timelines or learning progress, improving decisions that touch everyday life. Overall, smoother, smarter learning turns messy behavior logs into helpful, respectful experiences.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

šŸž Hook: Imagine you’re trying to guess what your friend will do next at a theme park. If you only remember the last ride they took, you’ll miss patterns like ā€œshe always eats before a roller coaster.ā€ You need the whole story, not just the latest bit.

🄬 The Concept (User Representation Learning): It’s a way for computers to build a compact profile (an embedding) of a person from many clues—purchases, app taps, searches—so the system can understand and help them better.

  • How it works: (1) Collect a user’s many activities over time and across apps; (2) Turn each activity into numbers; (3) Mix them into one vector that captures habits and preferences; (4) Use that vector to make predictions (like which features they’ll use or what they’ll like next).
  • Why it matters: Without a good user embedding, apps make poor guesses, like recommending bus tickets to someone who always takes the subway.

šŸž Anchor: Your music app remembering you love upbeat pop on weekdays and calm jazz on Sundays is user representation learning at work.

šŸž Hook: You know how a storyteller makes up a tale one sentence at a time, always using what was just said to decide what comes next?

🄬 The Concept (Decoder-Only LLMs): These are language models that generate or read sequences one token at a time, mainly using what came before.

  • How it works: (1) Read the previous tokens; (2) Predict the next one; (3) Repeat; (4) Build understanding step by step.
  • Why it matters: They’re great for interactive systems where new signals come in continuously (like real-time app activity).

šŸž Anchor: Chatbots that keep answering as you keep typing rely on decoder-only LLMs to understand and respond on the fly.

šŸž Hook: Think of using a flashlight in a dark room. You choose where to shine it and where not to.

🄬 The Concept (Attention Masking): It tells the model where it’s allowed to ā€œlookā€ in the sequence and where it must ā€œlook away.ā€

  • How it works: (1) Mark which tokens each position can see; (2) Hide some tokens (mask them); (3) Let attention focus only on visible parts; (4) Compute an output using those focused views.
  • Why it matters: Without the right mask, the model might cheat during training or miss vital context during understanding.

šŸž Anchor: In a quiz, covering the answer with your hand keeps you honest; that’s attention masking preventing peeking at the future.

šŸž Hook: Reading a mystery book only page by page without peeking ahead keeps things fair, but you might miss big-picture hints.

🄬 The Concept (Causal Attention): The model can only use tokens to the left (past) and can’t see the future.

  • How it works: (1) For each word, allow attention only to earlier words; (2) Block all future words; (3) Predict or embed using past-only info; (4) Repeat for every position.
  • Why it matters: It matches how decoder-only models were pretrained and keeps generation stable—no future peeking.

šŸž Anchor: When composing a message, you can’t read the reply yet; you write using only what has been said.

The world before: People used encoder models (which see both past and future) to learn rich user embeddings, but those models need the full sequence upfront. That’s hard in real apps where data arrives little by little. Decoder-only LLMs can work in streams, but they were trained with causal attention, which might limit how well they combine far-away clues into one powerful user profile.

The problem: No one had carefully compared how different masks (causal, hybrid, and bidirectional) affect user embeddings in one fair setup. Also, suddenly switching a decoder-only model from ā€œpast-onlyā€ to ā€œsee everythingā€ can make training bumpy and the results worse.

Failed attempts: People tried (1) staying causal—safe but misses future context; (2) going fully bidirectional right away—rich context but unstable training; (3) hybrid tricks—better but complicated and not consistently strong; (4) simple schedulers that slowly open the mask—helps, but still not smooth enough.

The gap: We needed a stable, data-driven way to transition from causal to bidirectional so decoder-only LLMs could learn truly strong user embeddings without losing their pretraining benefits.

Real stakes: Better embeddings mean apps can predict churn, tailor recommendations, plan marketing fairly, and avoid irrelevant nudges. That makes experiences smoother for users and decisions smarter for businesses.

02Core Idea

šŸž Hook: Picture teaching a new puppy to cross a busy street. You don’t drop the leash and say ā€œrun anywhere.ā€ You loosen it bit by bit, paying attention to where the puppy feels safest.

🄬 The Concept (Bidirectional Attention): Letting the model see both past and future tokens to build a fuller understanding.

  • How it works: (1) For each position, allow attention to all tokens; (2) Weigh what matters most from both sides; (3) Form a richer, context-aware representation.
  • Why it matters: Without looking both ways, the model misses future clues—like a later search or purchase—that make a user’s pattern clearer.

šŸž Anchor: When guessing what a story is really about, you look at both earlier and later chapters, not just what came before one page.

šŸž Hook: Think of mixing hot and cold water to get the perfect bath temperature—you don’t turn the faucets from cold to hot instantly.

🄬 The Concept (Hybrid Attention): A blend—often bidirectional in the user-history block but causal after, so some tokens see more, others stay sequential.

  • How it works: (1) Pick a segment (like history) to see both ways; (2) Keep future tokens causal; (3) Let a helper (like an MLP or a global token) guide attention; (4) Train embeddings with this mixed view.
  • Why it matters: It tries to balance rich understanding and compatibility with generation, but tuning it can be tricky.

šŸž Anchor: It’s like letting a class discuss freely within their table group (history) while still following turn-taking rules for the rest of the room (future).

šŸž Hook: Imagine dimmer switches on ceiling lights. Instead of flipping everything to full brightness, you brighten the bulbs that help you see best first.

🄬 The Concept (Gradient-Guided Soft Masking, GG-SM): A warm-up that opens future attention softly, guided by gradient signals about which tokens matter most, before smoothly moving to full bidirectional.

  • How it works: (1) Start with causal attention; (2) Measure which future tokens push learning the most (using gradient strength); (3) Give those tokens partial visibility first (soft masks); (4) After warm-up, linearly open to full bidirectionality.
  • Why it matters: Jumping straight to full future visibility can confuse a model trained to look only left. GG-SM teaches it ā€œwhere to lookā€ and ā€œhow much,ā€ making training steady and embeddings stronger.

šŸž Anchor: When learning a new song, you first focus on the bars where you make the most mistakes. Then, once those are comfy, you play the whole song smoothly.

Aha! moment in one sentence: Don’t just decide which tokens the model can see—decide how to open that visibility over time using the model’s own gradients as a guide.

Multiple analogies:

  • Traffic cop: Causal is a one-way street; bidirectional is a two-way boulevard; GG-SM is timing the green lights so traffic starts flowing safely and faster.
  • Puzzle building: Instead of dumping all puzzle pieces at once, GG-SM lets in the most helpful future pieces first, so the picture forms without overwhelm.
  • Flashlight dimmer: You brighten the spots that matter (high-gradient tokens) before turning on all the lights.

Before vs. after:

  • Before: Causal is safe but incomplete; immediate bidirectional can be shaky; hybrids are fussy; schedulers help but aren’t smart about which tokens to open first.
  • After: GG-SM gives a smart, stable path to full context, consistently yielding better user embeddings across many tasks.

Why it works (intuition): Gradients act like a pressure map showing which hidden tokens would most reduce the training error. By letting those spots become visible first, the model learns to rely on the most informative future clues, preventing chaos from a sudden context flood.

Building blocks:

  • Datasets that pair user histories with future behaviors or QA alignments.
  • A dual-tower encoder (same backbone) to embed users and answers.
  • A contrastive objective that pulls matched pairs together and pushes others apart.
  • Mask recipes: causal, hybrid, bidirectional.
  • GG-SM: gradient-guided warm-up, then a linear schedule to full bidirectionality.

03Methodology

At a high level: Multimodal user history (and optional query) → modality encoders + adapters → decoder-only LLM tower to get a user embedding at the <USER> token; answer text → same LLM tower to get an answer embedding; apply attention masking (causal/hybrid/bidirectional) with GG-SM warm-up; train with contrastive loss; output: robust user embeddings.

šŸž Hook: Think of packing a suitcase. You group clothes (shirts, pants, socks) before zipping the bag.

🄬 The Concept (Modality-Specific Encoding): Different data types (bills, app taps, searches, tables) are encoded by small specialists, then mapped into the LLM’s space.

  • How it works: (1) Each modality goes through its own encoder; (2) Lightweight adapters align them to the LLM’s token space; (3) Concatenate in a standard template with tags; (4) Feed into the LLM.
  • Why it matters: Without this, the LLM would see a jumble of unrelated formats and learn a messy embedding.

šŸž Anchor: It’s like labeling boxes ā€œkitchen,ā€ ā€œbooks,ā€ and ā€œtoysā€ before loading the moving truck so everything fits the new house.

Step-by-step recipe:

  1. Standardized input template
  • What happens: Wrap each modality with clear tags: <bill>...</bill>, <search>...</search>, etc., append optional user query, then a special <USER> token where we read out the embedding.
  • Why: Tags teach the model structure; <USER> marks the ā€œcollect everything hereā€ spot.
  • Example: <bill>[grocery:$30, 2024-11-20]</bill><search>[ā€˜movie tickets’]</search> ... <USER>
  1. Dual-tower encoding with a shared LLM backbone
  • What happens: The left tower reads user+query up to <USER> and outputs the user embedding; the right tower reads the answer text and outputs the answer embedding. Both towers share weights but process inputs separately.
  • Why: Contrastive learning needs two views—user side and answer side—and the shared backbone keeps them compatible.
  • Example: Left produces a 1024-d user vector; right produces a 1024-d answer vector.
  1. Attention masking strategies
  • What happens: Choose a mask: causal (left-only), hybrid (mixed), or bidirectional (see all). With GG-SM, start causal, softly open future tokens by gradient importance, then linearly reach full bidirectionality.
  • Why: The mask controls how context is combined; GG-SM avoids a rough jump that can hurt learning.
  • Example: Early in training, only a few high-gradient future tokens near <USER> get partial visibility; later, all tokens are visible.

šŸž Hook: Learning which questions to peek at on a test can make you overconfident unless you peek the right way.

🄬 The Concept (Contrastive Learning): Learn by pulling matching pairs close and pushing mismatched ones apart.

  • How it works: (1) Compute similarity between the user embedding and its correct answer (positive); (2) Compare against other answers and other users (negatives); (3) Increase the positive score, decrease negatives; (4) Repeat across the batch.
  • Why it matters: Without contrast, embeddings collapse or fail to capture what makes users uniquely similar or different.

šŸž Anchor: Sorting socks: you find the right pair (pull together) and keep different socks apart (push away).

  1. InfoNCE loss with smart negatives
  • What happens: Use InfoNCE to compute a softmax over similarities, including same-side negatives (user–user, answer–answer) and mask out false negatives with a margin.
  • Why: Same-side negatives make embeddings sharper; masking false negatives prevents punishing true similarities.
  • Example: If two users are extremely similar and pass the margin, they aren’t forced apart.

šŸž Hook: If you try to sprint from zero to top speed, you stumble; a warm-up helps.

🄬 The Concept (Scheduler vs. Gradient-Guided Soft Masking): A scheduler opens attention on a timer; GG-SM opens it where gradients say it helps most.

  • How it works: (1) During warm-up, compute gradients to see which future tokens would reduce loss most; (2) Assign soft weights (not all-or-nothing); (3) Freeze those weights at warm-up end; (4) Interpolate to full open with a simple linear schedule.
  • Why it matters: Time-based opening can expose noisy or unhelpful tokens too early; gradient-guided opening focuses learning where it counts.

šŸž Anchor: A coach watches which muscles are tight and designs stretches for those first, then expands to a full workout.

  1. Training details that keep things stable
  • What happens: Large batch contrastive training, AdamW optimizer, cosine decay, LoRA for efficient fine-tuning; identical settings across masks for a fair comparison.
  • Why: Fair apples-to-apples tests show differences come from masking, not from extra tricks.
  • Example: Same backbone, same steps, different masking recipes.

Concrete walk-through with data

  • Input: A user’s 90-day history: paid utilities, rode public transit, searched ā€œdiscount movies,ā€ used a food app; optional query: ā€œAny weekend movie deals?ā€; then <USER>.
  • Encoding: Modality encoders produce embeddings, adapters align them, LLM processes sequence under the current mask.
  • GG-SM warm-up: Early on, gradients show ā€œdiscount moviesā€ and ā€œtransit ridesā€ near weekends matter most; those future tokens get soft visibility first.
  • Contrastive step: Pair with an answer like ā€œYes—half-price shows after 6 pm near you.ā€ The model pulls this pair closer than random answers.
  • Over time: Visibility opens to all tokens; the user embedding becomes a compact, context-rich vector.

Secret sauce

  • GG-SM uses the model’s own learning pressure (gradients) to decide where to look next. That makes the transition smoother than a blind schedule, stabilizes optimization, and yields better final bidirectional embeddings.

04Experiments & Results

šŸž Hook: If two students both get 87%, it means more when the rest of the class got around 75%—context turns numbers into meaning.

🄬 The Concept (AUC – Area Under the ROC Curve): A score from 0.5 (coin flip) to 1.0 (perfect) showing how well a model separates positives from negatives across thresholds.

  • How it works: (1) Rank examples by confidence; (2) Sweep a threshold; (3) Plot true-positive vs. false-positive rates; (4) Measure the area under that curve.
  • Why it matters: It’s threshold-free and shows overall discriminative power, great for imbalanced tasks.

šŸž Anchor: If you can almost always place the ā€œwill clickā€ users above the ā€œwon’t clickā€ users in a list, your AUC is high.

The test: 9 real-world, binary user cognition tasks from Alipay in three domains: (1) User Prediction (concert click, login, MAU loss), (2) Behavior Preference (transit, consumption power, food, movie), (3) Marketing Sensitivity (achievement, physical). Models are trained once to make embeddings, then we do linear probing—a simple classifier on top.

šŸž Hook: Imagine you learn to summarize books, and we test your summaries by seeing how well a simple quiz uses them to pick the right answer.

🄬 The Concept (Linear Probing): Train a tiny classifier on frozen embeddings to see how much useful info they hold.

  • How it works: (1) Freeze embeddings; (2) Fit a linear model per task; (3) Evaluate AUC; (4) Compare across models.
  • Why it matters: It tests representation quality, not just model capacity to memorize.

šŸž Anchor: It’s like checking if good notes help you pass a quiz without letting you rewrite the whole textbook.

Competition: We compare against strong general-purpose embedding models (e.g., Llama-embed-nemotron, KaLM-Embedding, Qwen3-Embedding), classic user models (MSDP, One4all, CPC), and LLM-based user models (FOUND, InstructUE). We also test the same decoder backbone under masking variants: Causal, Hybrid (several flavors), Bidirectional, Bidirectional + Scheduler, and Bidirectional + GG-SM (ours).

Scoreboard with context:

  • Our GG-SM reaches an average AUC of 0.7745 across all 9 tasks. Think of that as getting an A while many strong alternatives land more like B/B+.
  • It consistently edges out bidirectional without warm-up and scheduler-only methods, showing that how you open future attention matters, not just that you open it.
  • It also beats larger, general embedding models on these user tasks, highlighting that domain alignment and transition stability can trump raw parameter count.

Task-level color:

  • Behavior Preference: Notable gains on food/movie interest and transit, where long-range context (e.g., weekend searches aligning with purchases) matters.
  • Marketing Sensitivity: GG-SM is strong where latent intent is subtle; the gradient-guided opening seems to capture nuanced traits.
  • User Prediction: Solid improvements over causal and hybrid masks; bidirectional context helps anticipate near-future actions.

Surprising findings:

  • Hybrid masks provide only modest and inconsistent boosts—likely because extra parameters or control tokens add complexity without reliably improving alignment.
  • Training stability is visibly better with GG-SM: loss curves are smoother and converge faster than with a plain scheduler.
  • More parameters do not automatically win. On sparse, non-linguistic behavioral logs, carefully guided attention opening can beat much larger general models.

Takeaway: The transition path is a first-class design choice. GG-SM’s gradient-informed warm-up turns decoder-only LLMs into stronger bidirectional encoders for user understanding.

05Discussion & Limitations

Limitations

  • Domain scope: Experiments center on Alipay’s ecosystems and tasks; results may shift in domains with very different behavior patterns (e.g., education or health logs).
  • Data synthesis reliance: The QA alignment set uses LLM-generated pairs and post-processing; quality depends on prompt design and the base generator.
  • Full bidirectionality at inference: This maximizes embedding quality but is not directly generative; hybrid setups are better if you need generation and embeddings at the same time.
  • Gradient noise: Early gradients can be noisy; while soft masks smooth things, extremely short sequences or tiny batches may reduce the benefit.
  • Compute: Large-batch contrastive training and gradient-based masking add overhead; smaller setups may need lighter approximations.

Required resources

  • A capable decoder-only backbone and modality encoders/adapters.
  • Substantial GPU time (e.g., multi-GPU training with large batches) for stable contrastive learning.
  • Access to diverse user histories with careful privacy and compliance controls.
  • Engineering for a standardized input template and dual-tower retrieval pipeline.

When not to use

  • Real-time text generation where strict causality must be preserved at inference.
  • Ultra-short histories where bidirectional context adds little.
  • Settings where privacy prevents assembling multi-modal histories into one view.

Open questions

  • Can we estimate token importance without full gradients (e.g., using attention entropy or cheap saliency) to cut costs?
  • How does GG-SM interact with very long contexts (128k+) and memory-efficient attention variants?
  • Can we co-train for both generation and embeddings with a shared hybrid policy that adapts per task at inference?
  • How does the approach generalize to other domains (e.g., healthcare trajectories) and languages with different tokenization patterns?
  • Could curriculum learning pair with GG-SM to schedule which data, not just which tokens, to reveal when?

06Conclusion & Future Work

Three-sentence summary

  • Decoder-only LLMs can produce excellent user embeddings, but attention masking—and especially how we transition from causal to bidirectional—shapes their ultimate quality.
  • Gradient-Guided Soft Masking uses gradients to softly and smartly open future attention before a simple linear schedule completes the move to full bidirectionality.
  • This stable transition delivers stronger, more transferable user representations across nine real-world tasks, outperforming larger general embedding models and prior user-modeling baselines.

Main achievement

  • Turning the transition path itself into a learnable lever: GG-SM consistently stabilizes training and boosts bidirectional embedding quality without discarding decoder pretraining benefits.

Future directions

  • Cheaper importance signals (beyond gradients), better hybrid policies for joint generation+embedding, scaling to ultra-long contexts, and replication across very different behavior domains.

Why remember this

  • Not just what the model sees, but when and how it’s allowed to see it, can make the difference between average and best-in-class user understanding. GG-SM shows that gentle, guided visibility creates smarter, steadier learning for decoder-only LLMs used as user encoders.

Practical Applications

  • •Personalized news or product feeds that consider both recent and upcoming seasonal patterns.
  • •Early churn prediction for subscription services using long-range behavior cues.
  • •Marketing sensitivity modeling to avoid over-targeting users who dislike frequent ads.
  • •Context-aware search ranking that blends history with inferred future needs (e.g., weekends).
  • •Financial app guidance (budget tips, bill reminders) personalized by multi-month activity trends.
  • •Smart notifications that prioritize moments when a user is most receptive (e.g., transit top-ups).
  • •Cold-to-warm onboarding that adapts as more user signals arrive, improving steadily over days.
  • •Preference profiling for media (movies, food) that respects subtle, periodic interests.
  • •A/B testing of campaigns using stable embeddings to detect real lifts, not noise.
  • •Cross-domain transfer of embeddings to new tasks (e.g., from click prediction to retention).
#decoder-only LLM#attention masking#causal attention#bidirectional attention#hybrid attention#contrastive learning#user embedding#InfoNCE#gradient-guided soft masking#scheduler#dual-tower encoder#AUC#linear probing#LoRA#multimodal user behavior
Version: 1

Notes

0/2000
Press Cmd+Enter to submit