How Do Decoder-Only LLMs Perceive Users? Rethinking Attention Masking for User Representation Learning
Key Summary
- ā¢Decoder-only language models can be great at making user profiles (embeddings), but how we let them look at the sequenceācalled attention maskingāchanges how smart those profiles are.
- ā¢Causal attention (only looking left) is safe for generation but misses future clues; bidirectional attention (looking both ways) makes richer profiles but can train unstably if you switch too fast.
- ā¢This paper compares causal, hybrid, and full bidirectional masks in one fair setup using contrastive learning and large real-world Alipay data.
- ā¢The key idea is Gradient-Guided Soft Masking (GG-SM), a warm-up that slowly opens āfuture visionā based on which tokens the gradients say are most important.
- ā¢GG-SM makes the switch from causal to bidirectional smooth and stable, leading to better user embeddings on 9 practical tasks (average AUC 0.7745), beating popular general embedding models and strong user-modeling baselines.
- ā¢Hybrid masks help a bit, but GG-SM with bidirectional attention works best for quality while keeping compatibility with decoder pretraining.
- ā¢The method is parameter-efficient and especially good on tasks that need long-range context, like preference and sensitivity predictions.
- ā¢Training curves show GG-SM converges more steadily than simple schedulers, meaning the model learns faster and more reliably.
- ā¢The study highlights that not just the final mask, but the transition path to it, is crucial for decoder-only LLMs used as user encoders.
Why This Research Matters
Better user embeddings help apps understand people more fairly and accurately. This means smarter recommendations (fewer annoying suggestions), earlier detection of churn or loss of interest, and more helpful guidance at the right time. For businesses, it improves campaign relevance and reduces waste, since models can capture long-range habits instead of just recent clicks. For users, it enhances privacy-aware personalization by making more sense from less data noise. The key innovationāopening attention gradually and wiselyācan be reused in other sequence problems, like healthcare timelines or learning progress, improving decisions that touch everyday life. Overall, smoother, smarter learning turns messy behavior logs into helpful, respectful experiences.
Detailed Explanation
Tap terms for definitions01Background & Problem Definition
š Hook: Imagine youāre trying to guess what your friend will do next at a theme park. If you only remember the last ride they took, youāll miss patterns like āshe always eats before a roller coaster.ā You need the whole story, not just the latest bit.
š„¬ The Concept (User Representation Learning): Itās a way for computers to build a compact profile (an embedding) of a person from many cluesāpurchases, app taps, searchesāso the system can understand and help them better.
- How it works: (1) Collect a userās many activities over time and across apps; (2) Turn each activity into numbers; (3) Mix them into one vector that captures habits and preferences; (4) Use that vector to make predictions (like which features theyāll use or what theyāll like next).
- Why it matters: Without a good user embedding, apps make poor guesses, like recommending bus tickets to someone who always takes the subway.
š Anchor: Your music app remembering you love upbeat pop on weekdays and calm jazz on Sundays is user representation learning at work.
š Hook: You know how a storyteller makes up a tale one sentence at a time, always using what was just said to decide what comes next?
š„¬ The Concept (Decoder-Only LLMs): These are language models that generate or read sequences one token at a time, mainly using what came before.
- How it works: (1) Read the previous tokens; (2) Predict the next one; (3) Repeat; (4) Build understanding step by step.
- Why it matters: Theyāre great for interactive systems where new signals come in continuously (like real-time app activity).
š Anchor: Chatbots that keep answering as you keep typing rely on decoder-only LLMs to understand and respond on the fly.
š Hook: Think of using a flashlight in a dark room. You choose where to shine it and where not to.
š„¬ The Concept (Attention Masking): It tells the model where itās allowed to ālookā in the sequence and where it must ālook away.ā
- How it works: (1) Mark which tokens each position can see; (2) Hide some tokens (mask them); (3) Let attention focus only on visible parts; (4) Compute an output using those focused views.
- Why it matters: Without the right mask, the model might cheat during training or miss vital context during understanding.
š Anchor: In a quiz, covering the answer with your hand keeps you honest; thatās attention masking preventing peeking at the future.
š Hook: Reading a mystery book only page by page without peeking ahead keeps things fair, but you might miss big-picture hints.
š„¬ The Concept (Causal Attention): The model can only use tokens to the left (past) and canāt see the future.
- How it works: (1) For each word, allow attention only to earlier words; (2) Block all future words; (3) Predict or embed using past-only info; (4) Repeat for every position.
- Why it matters: It matches how decoder-only models were pretrained and keeps generation stableāno future peeking.
š Anchor: When composing a message, you canāt read the reply yet; you write using only what has been said.
The world before: People used encoder models (which see both past and future) to learn rich user embeddings, but those models need the full sequence upfront. Thatās hard in real apps where data arrives little by little. Decoder-only LLMs can work in streams, but they were trained with causal attention, which might limit how well they combine far-away clues into one powerful user profile.
The problem: No one had carefully compared how different masks (causal, hybrid, and bidirectional) affect user embeddings in one fair setup. Also, suddenly switching a decoder-only model from āpast-onlyā to āsee everythingā can make training bumpy and the results worse.
Failed attempts: People tried (1) staying causalāsafe but misses future context; (2) going fully bidirectional right awayārich context but unstable training; (3) hybrid tricksābetter but complicated and not consistently strong; (4) simple schedulers that slowly open the maskāhelps, but still not smooth enough.
The gap: We needed a stable, data-driven way to transition from causal to bidirectional so decoder-only LLMs could learn truly strong user embeddings without losing their pretraining benefits.
Real stakes: Better embeddings mean apps can predict churn, tailor recommendations, plan marketing fairly, and avoid irrelevant nudges. That makes experiences smoother for users and decisions smarter for businesses.
02Core Idea
š Hook: Picture teaching a new puppy to cross a busy street. You donāt drop the leash and say ārun anywhere.ā You loosen it bit by bit, paying attention to where the puppy feels safest.
š„¬ The Concept (Bidirectional Attention): Letting the model see both past and future tokens to build a fuller understanding.
- How it works: (1) For each position, allow attention to all tokens; (2) Weigh what matters most from both sides; (3) Form a richer, context-aware representation.
- Why it matters: Without looking both ways, the model misses future cluesālike a later search or purchaseāthat make a userās pattern clearer.
š Anchor: When guessing what a story is really about, you look at both earlier and later chapters, not just what came before one page.
š Hook: Think of mixing hot and cold water to get the perfect bath temperatureāyou donāt turn the faucets from cold to hot instantly.
š„¬ The Concept (Hybrid Attention): A blendāoften bidirectional in the user-history block but causal after, so some tokens see more, others stay sequential.
- How it works: (1) Pick a segment (like history) to see both ways; (2) Keep future tokens causal; (3) Let a helper (like an MLP or a global token) guide attention; (4) Train embeddings with this mixed view.
- Why it matters: It tries to balance rich understanding and compatibility with generation, but tuning it can be tricky.
š Anchor: Itās like letting a class discuss freely within their table group (history) while still following turn-taking rules for the rest of the room (future).
š Hook: Imagine dimmer switches on ceiling lights. Instead of flipping everything to full brightness, you brighten the bulbs that help you see best first.
š„¬ The Concept (Gradient-Guided Soft Masking, GG-SM): A warm-up that opens future attention softly, guided by gradient signals about which tokens matter most, before smoothly moving to full bidirectional.
- How it works: (1) Start with causal attention; (2) Measure which future tokens push learning the most (using gradient strength); (3) Give those tokens partial visibility first (soft masks); (4) After warm-up, linearly open to full bidirectionality.
- Why it matters: Jumping straight to full future visibility can confuse a model trained to look only left. GG-SM teaches it āwhere to lookā and āhow much,ā making training steady and embeddings stronger.
š Anchor: When learning a new song, you first focus on the bars where you make the most mistakes. Then, once those are comfy, you play the whole song smoothly.
Aha! moment in one sentence: Donāt just decide which tokens the model can seeādecide how to open that visibility over time using the modelās own gradients as a guide.
Multiple analogies:
- Traffic cop: Causal is a one-way street; bidirectional is a two-way boulevard; GG-SM is timing the green lights so traffic starts flowing safely and faster.
- Puzzle building: Instead of dumping all puzzle pieces at once, GG-SM lets in the most helpful future pieces first, so the picture forms without overwhelm.
- Flashlight dimmer: You brighten the spots that matter (high-gradient tokens) before turning on all the lights.
Before vs. after:
- Before: Causal is safe but incomplete; immediate bidirectional can be shaky; hybrids are fussy; schedulers help but arenāt smart about which tokens to open first.
- After: GG-SM gives a smart, stable path to full context, consistently yielding better user embeddings across many tasks.
Why it works (intuition): Gradients act like a pressure map showing which hidden tokens would most reduce the training error. By letting those spots become visible first, the model learns to rely on the most informative future clues, preventing chaos from a sudden context flood.
Building blocks:
- Datasets that pair user histories with future behaviors or QA alignments.
- A dual-tower encoder (same backbone) to embed users and answers.
- A contrastive objective that pulls matched pairs together and pushes others apart.
- Mask recipes: causal, hybrid, bidirectional.
- GG-SM: gradient-guided warm-up, then a linear schedule to full bidirectionality.
03Methodology
At a high level: Multimodal user history (and optional query) ā modality encoders + adapters ā decoder-only LLM tower to get a user embedding at the <USER> token; answer text ā same LLM tower to get an answer embedding; apply attention masking (causal/hybrid/bidirectional) with GG-SM warm-up; train with contrastive loss; output: robust user embeddings.
š Hook: Think of packing a suitcase. You group clothes (shirts, pants, socks) before zipping the bag.
š„¬ The Concept (Modality-Specific Encoding): Different data types (bills, app taps, searches, tables) are encoded by small specialists, then mapped into the LLMās space.
- How it works: (1) Each modality goes through its own encoder; (2) Lightweight adapters align them to the LLMās token space; (3) Concatenate in a standard template with tags; (4) Feed into the LLM.
- Why it matters: Without this, the LLM would see a jumble of unrelated formats and learn a messy embedding.
š Anchor: Itās like labeling boxes ākitchen,ā ābooks,ā and ātoysā before loading the moving truck so everything fits the new house.
Step-by-step recipe:
- Standardized input template
- What happens: Wrap each modality with clear tags: <bill>...</bill>, <search>...</search>, etc., append optional user query, then a special <USER> token where we read out the embedding.
- Why: Tags teach the model structure; <USER> marks the ācollect everything hereā spot.
- Example: <bill>[grocery:$30, 2024-11-20]</bill><search>[āmovie ticketsā]</search> ... <USER>
- Dual-tower encoding with a shared LLM backbone
- What happens: The left tower reads user+query up to <USER> and outputs the user embedding; the right tower reads the answer text and outputs the answer embedding. Both towers share weights but process inputs separately.
- Why: Contrastive learning needs two viewsāuser side and answer sideāand the shared backbone keeps them compatible.
- Example: Left produces a 1024-d user vector; right produces a 1024-d answer vector.
- Attention masking strategies
- What happens: Choose a mask: causal (left-only), hybrid (mixed), or bidirectional (see all). With GG-SM, start causal, softly open future tokens by gradient importance, then linearly reach full bidirectionality.
- Why: The mask controls how context is combined; GG-SM avoids a rough jump that can hurt learning.
- Example: Early in training, only a few high-gradient future tokens near <USER> get partial visibility; later, all tokens are visible.
š Hook: Learning which questions to peek at on a test can make you overconfident unless you peek the right way.
š„¬ The Concept (Contrastive Learning): Learn by pulling matching pairs close and pushing mismatched ones apart.
- How it works: (1) Compute similarity between the user embedding and its correct answer (positive); (2) Compare against other answers and other users (negatives); (3) Increase the positive score, decrease negatives; (4) Repeat across the batch.
- Why it matters: Without contrast, embeddings collapse or fail to capture what makes users uniquely similar or different.
š Anchor: Sorting socks: you find the right pair (pull together) and keep different socks apart (push away).
- InfoNCE loss with smart negatives
- What happens: Use InfoNCE to compute a softmax over similarities, including same-side negatives (userāuser, answerāanswer) and mask out false negatives with a margin.
- Why: Same-side negatives make embeddings sharper; masking false negatives prevents punishing true similarities.
- Example: If two users are extremely similar and pass the margin, they arenāt forced apart.
š Hook: If you try to sprint from zero to top speed, you stumble; a warm-up helps.
š„¬ The Concept (Scheduler vs. Gradient-Guided Soft Masking): A scheduler opens attention on a timer; GG-SM opens it where gradients say it helps most.
- How it works: (1) During warm-up, compute gradients to see which future tokens would reduce loss most; (2) Assign soft weights (not all-or-nothing); (3) Freeze those weights at warm-up end; (4) Interpolate to full open with a simple linear schedule.
- Why it matters: Time-based opening can expose noisy or unhelpful tokens too early; gradient-guided opening focuses learning where it counts.
š Anchor: A coach watches which muscles are tight and designs stretches for those first, then expands to a full workout.
- Training details that keep things stable
- What happens: Large batch contrastive training, AdamW optimizer, cosine decay, LoRA for efficient fine-tuning; identical settings across masks for a fair comparison.
- Why: Fair apples-to-apples tests show differences come from masking, not from extra tricks.
- Example: Same backbone, same steps, different masking recipes.
Concrete walk-through with data
- Input: A userās 90-day history: paid utilities, rode public transit, searched ādiscount movies,ā used a food app; optional query: āAny weekend movie deals?ā; then <USER>.
- Encoding: Modality encoders produce embeddings, adapters align them, LLM processes sequence under the current mask.
- GG-SM warm-up: Early on, gradients show ādiscount moviesā and ātransit ridesā near weekends matter most; those future tokens get soft visibility first.
- Contrastive step: Pair with an answer like āYesāhalf-price shows after 6 pm near you.ā The model pulls this pair closer than random answers.
- Over time: Visibility opens to all tokens; the user embedding becomes a compact, context-rich vector.
Secret sauce
- GG-SM uses the modelās own learning pressure (gradients) to decide where to look next. That makes the transition smoother than a blind schedule, stabilizes optimization, and yields better final bidirectional embeddings.
04Experiments & Results
š Hook: If two students both get 87%, it means more when the rest of the class got around 75%ācontext turns numbers into meaning.
š„¬ The Concept (AUC ā Area Under the ROC Curve): A score from 0.5 (coin flip) to 1.0 (perfect) showing how well a model separates positives from negatives across thresholds.
- How it works: (1) Rank examples by confidence; (2) Sweep a threshold; (3) Plot true-positive vs. false-positive rates; (4) Measure the area under that curve.
- Why it matters: Itās threshold-free and shows overall discriminative power, great for imbalanced tasks.
š Anchor: If you can almost always place the āwill clickā users above the āwonāt clickā users in a list, your AUC is high.
The test: 9 real-world, binary user cognition tasks from Alipay in three domains: (1) User Prediction (concert click, login, MAU loss), (2) Behavior Preference (transit, consumption power, food, movie), (3) Marketing Sensitivity (achievement, physical). Models are trained once to make embeddings, then we do linear probingāa simple classifier on top.
š Hook: Imagine you learn to summarize books, and we test your summaries by seeing how well a simple quiz uses them to pick the right answer.
š„¬ The Concept (Linear Probing): Train a tiny classifier on frozen embeddings to see how much useful info they hold.
- How it works: (1) Freeze embeddings; (2) Fit a linear model per task; (3) Evaluate AUC; (4) Compare across models.
- Why it matters: It tests representation quality, not just model capacity to memorize.
š Anchor: Itās like checking if good notes help you pass a quiz without letting you rewrite the whole textbook.
Competition: We compare against strong general-purpose embedding models (e.g., Llama-embed-nemotron, KaLM-Embedding, Qwen3-Embedding), classic user models (MSDP, One4all, CPC), and LLM-based user models (FOUND, InstructUE). We also test the same decoder backbone under masking variants: Causal, Hybrid (several flavors), Bidirectional, Bidirectional + Scheduler, and Bidirectional + GG-SM (ours).
Scoreboard with context:
- Our GG-SM reaches an average AUC of 0.7745 across all 9 tasks. Think of that as getting an A while many strong alternatives land more like B/B+.
- It consistently edges out bidirectional without warm-up and scheduler-only methods, showing that how you open future attention matters, not just that you open it.
- It also beats larger, general embedding models on these user tasks, highlighting that domain alignment and transition stability can trump raw parameter count.
Task-level color:
- Behavior Preference: Notable gains on food/movie interest and transit, where long-range context (e.g., weekend searches aligning with purchases) matters.
- Marketing Sensitivity: GG-SM is strong where latent intent is subtle; the gradient-guided opening seems to capture nuanced traits.
- User Prediction: Solid improvements over causal and hybrid masks; bidirectional context helps anticipate near-future actions.
Surprising findings:
- Hybrid masks provide only modest and inconsistent boostsālikely because extra parameters or control tokens add complexity without reliably improving alignment.
- Training stability is visibly better with GG-SM: loss curves are smoother and converge faster than with a plain scheduler.
- More parameters do not automatically win. On sparse, non-linguistic behavioral logs, carefully guided attention opening can beat much larger general models.
Takeaway: The transition path is a first-class design choice. GG-SMās gradient-informed warm-up turns decoder-only LLMs into stronger bidirectional encoders for user understanding.
05Discussion & Limitations
Limitations
- Domain scope: Experiments center on Alipayās ecosystems and tasks; results may shift in domains with very different behavior patterns (e.g., education or health logs).
- Data synthesis reliance: The QA alignment set uses LLM-generated pairs and post-processing; quality depends on prompt design and the base generator.
- Full bidirectionality at inference: This maximizes embedding quality but is not directly generative; hybrid setups are better if you need generation and embeddings at the same time.
- Gradient noise: Early gradients can be noisy; while soft masks smooth things, extremely short sequences or tiny batches may reduce the benefit.
- Compute: Large-batch contrastive training and gradient-based masking add overhead; smaller setups may need lighter approximations.
Required resources
- A capable decoder-only backbone and modality encoders/adapters.
- Substantial GPU time (e.g., multi-GPU training with large batches) for stable contrastive learning.
- Access to diverse user histories with careful privacy and compliance controls.
- Engineering for a standardized input template and dual-tower retrieval pipeline.
When not to use
- Real-time text generation where strict causality must be preserved at inference.
- Ultra-short histories where bidirectional context adds little.
- Settings where privacy prevents assembling multi-modal histories into one view.
Open questions
- Can we estimate token importance without full gradients (e.g., using attention entropy or cheap saliency) to cut costs?
- How does GG-SM interact with very long contexts (128k+) and memory-efficient attention variants?
- Can we co-train for both generation and embeddings with a shared hybrid policy that adapts per task at inference?
- How does the approach generalize to other domains (e.g., healthcare trajectories) and languages with different tokenization patterns?
- Could curriculum learning pair with GG-SM to schedule which data, not just which tokens, to reveal when?
06Conclusion & Future Work
Three-sentence summary
- Decoder-only LLMs can produce excellent user embeddings, but attention maskingāand especially how we transition from causal to bidirectionalāshapes their ultimate quality.
- Gradient-Guided Soft Masking uses gradients to softly and smartly open future attention before a simple linear schedule completes the move to full bidirectionality.
- This stable transition delivers stronger, more transferable user representations across nine real-world tasks, outperforming larger general embedding models and prior user-modeling baselines.
Main achievement
- Turning the transition path itself into a learnable lever: GG-SM consistently stabilizes training and boosts bidirectional embedding quality without discarding decoder pretraining benefits.
Future directions
- Cheaper importance signals (beyond gradients), better hybrid policies for joint generation+embedding, scaling to ultra-long contexts, and replication across very different behavior domains.
Why remember this
- Not just what the model sees, but when and how itās allowed to see it, can make the difference between average and best-in-class user understanding. GG-SM shows that gentle, guided visibility creates smarter, steadier learning for decoder-only LLMs used as user encoders.
Practical Applications
- ā¢Personalized news or product feeds that consider both recent and upcoming seasonal patterns.
- ā¢Early churn prediction for subscription services using long-range behavior cues.
- ā¢Marketing sensitivity modeling to avoid over-targeting users who dislike frequent ads.
- ā¢Context-aware search ranking that blends history with inferred future needs (e.g., weekends).
- ā¢Financial app guidance (budget tips, bill reminders) personalized by multi-month activity trends.
- ā¢Smart notifications that prioritize moments when a user is most receptive (e.g., transit top-ups).
- ā¢Cold-to-warm onboarding that adapts as more user signals arrive, improving steadily over days.
- ā¢Preference profiling for media (movies, food) that respects subtle, periodic interests.
- ā¢A/B testing of campaigns using stable embeddings to detect real lifts, not noise.
- ā¢Cross-domain transfer of embeddings to new tasks (e.g., from click prediction to retention).