RecGPT-V2 Technical Report

Chao Yi; Dian Chen; Gaoyang Guo; Jiakai Tang; Jian Wu; Jing Yu; Mao Zhang; Wen Chen; Wenjun Yang; Yujie Luo; Yuning Jiang; Zhujin Gao; Bo Zheng; Binbin Cao; Changfa Wu; Dixuan Wang; Han Wu; Haoyi Hu; Kewei Zhu; Lang Tian; Lin Yang; Qiqi Huang; Siqi Yang; Wenbo Su; Xiaoxiao He; Xin Tong; Xu Chen; Xunke Xi; Xiaowei Huang; Yaxuan Wu; Yeqiu Yang; Yi Hu; Yujin Yuan; Yuliang Yan; Zile Zhou

RecGPT-V2 Technical Report

Intermediate

Chao Yi, Dian Chen, Gaoyang Guo et al.12/16/2025

arXiv

Key Summary

•RecGPT‑V2 turns a recommender system into a smart team: a planner, several specialists, and a fair judge that all work together.
•It shrinks long user histories into tiny “atoms” the AI can read faster, cutting GPU use by about 60% without losing meaning.
•A hierarchical multi‑agent setup avoids duplicated thinking and boosts unique finds (exclusive recall) from 9.39% to 10.99%.
•Meta‑prompting creates explanations that adapt to seasons, trends, and user style, raising diversity by 7.3% and human acceptance by 13.0%.
•A constrained reinforcement learning strategy teaches the model to hit accuracy first while keeping diversity and length within smart limits (+24.1% tag accuracy).
•An Agent‑as‑a‑Judge breaks judging into steps (like humans do) and then distills those judgments into rewards to train the model further.
•Online A/B tests on Taobao show steady gains: +3.01% CTR, +3.64% IPV, +2.11% TV, and +11.46% novelty exposure.
•Engineering upgrades (prefill–decode separation, FP8 attention kernels) raise utilization and throughput massively.
•The system catches seasonal shifts (like Halloween and winter gear) earlier and explains recommendations more helpfully.
•Overall, RecGPT‑V2 makes LLM‑powered intent reasoning both practical at scale and more aligned with what people value.

Why This Research Matters

RecGPT‑V2 makes recommendations both smarter and kinder to your time by understanding intent, season, and style—not just past clicks. It reduces computation dramatically, which cuts costs and energy use while enabling fast, frequent updates as trends change. Explanations become clearer and more engaging, building trust because you see the reason behind each suggestion. Novelty exposure increases, so users discover fresh items instead of being stuck in filter bubbles. A fairer, multi‑step judge turns human‑like evaluation into steady model improvements, keeping quality high as tastes evolve. All of this has been proven in live traffic on a massive platform, showing it’s practical, not just theoretical.

Reading Workflow

Turn this paper into a decision

Scan fast. Promote only the papers that survive triage.

No workflow history yet.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: You know how a friend who really knows you can suggest the perfect movie for tonight—not just any movie you liked before, but one that fits your mood, the weather, and what’s trending?

🥬 The Concept (Recommender systems): A recommender system is a tool that predicts what you’ll like based on your past actions and context. How it works:

Collect your past clicks, searches, and purchases.
Learn patterns from many users and items.
Predict which items you’re likely to enjoy now. Why it matters: Without recommender systems, you’d scroll forever; with them, you get relevant choices quickly. 🍞 Anchor: When Taobao shows you winter boots as the temperature drops, that’s the recommender noticing context, not just history.

The world before: For years, recommenders mainly matched patterns in logs. First came matrix factorization (finding hidden tastes in big tables), then deep neural networks (finding complex patterns). These systems were great at “who liked what,” but not so great at “why this right now?” They usually didn’t reason about your intent—like, “You’re planning a Halloween party” or “It’s turning cold.”

🍞 Hook: Imagine writing a book report by copying sentences instead of explaining the story in your own words.

🥬 The Concept (LLM reasoning in recommendation): Using large language models (LLMs) lets recommenders explain and reason about user intent in natural language. How it works:

Read user history and context as text.
Think step‑by‑step about likely goals (fitness, gifting, seasonal needs).
Generate tags (like topics) and explanations that align with those goals. Why it matters: Without reasoning, the system treats all clicks the same and misses the “why,” leading to generic or late recommendations. 🍞 Anchor: Asking “Why recommend hot cocoa mix?” gets the answer “Because it’s getting colder and you searched for ‘cozy mugs’ yesterday.”

RecGPT‑V1 was a big step: it brought LLMs into user interest mining and item tag prediction, turning guesses into explainable intents. But four problems slowed it down in the real world:

Computational waste and duplicated thinking: Multiple LLM routes each re‑read long histories (often 32K tokens) and produced overlapping candidates (about 13.46% duplicates).
Boring explanations: Fixed templates made one‑size‑fits‑all messages that ignored real‑time signals like weather or holidays.
Training on static, supervised data: The world changes quickly; fixed datasets don’t capture live tradeoffs like relevance vs. diversity vs. novelty.
One‑shot judging: LLM‑as‑a‑Judge gave a single score without doing human‑like, step‑by‑step reviewing across multiple dimensions, so it missed nuances.

Failed attempts:

“Just add more routes” broadened coverage but amplified duplication and cost.
“Use one template for explanations” stayed safe but stale.
“Supervised fine‑tune harder” learned yesterday’s patterns, not tomorrow’s shifts.
“Let a single LLM score the result” was cheap, but less aligned with human grading.

🍞 Hook: Think of packing your whole room into a tiny suitcase for a trip.

🥬 The Concept (Item tags): An item tag is a short, meaningful label that captures what an item is or why it fits a user (e.g., “wool blend cardigan,” “kids’ Halloween costume”). How it works:

Read item info and user context.
Predict short, specific labels.
Use these tags to fetch matching items fast. Why it matters: Without clear tags, retrieval is noisy and misses intent. 🍞 Anchor: Labeling “adjustable dumbbell set” beats a vague tag like “workout gear” when matching a home‑fitness seeker.

The gap RecGPT‑V2 fills:

Organize thinking with a hierarchy (planner → experts → arbiter) to avoid duplication.
Compress long histories into compact “atoms” so the AI reads faster with little meaning loss.
Generate adaptive prompts to craft timely, informative explanations.
Train with constrained reinforcement learning so accuracy rises without sacrificing diversity and clarity.
Judge like humans (multi‑step, multi‑dimension), then distill that judgment into rewards to keep improving automatically.

Real stakes:

For users: more timely, varied, and understandable suggestions (less scrolling, more delight).
For sellers: better matching and higher conversion (more of the right eyes on the right items).
For platforms: lower compute bills (≈60% GPU savings), better novelty exposure (NER +11.46%), and steadier long‑term engagement.
For the planet: fewer wasted GPU cycles means less energy use.
For everyone: clearer explanations build trust—you see not just the “what,” but the “why now.”

02Core Idea

Aha in one sentence: Organize the recommender like a well‑coached team, compress the playbook, and train with a fair, step‑by‑step judge so the system is faster, smarter, and better aligned with people.

Three analogies:

Orchestra: a conductor (planner) guides sections (experts), then a final ear (arbiter) blends them; a critic (judge) explains what worked; the orchestra practices under clear rules (constrained RL) while reading compact scores (compressed tokens).
Kitchen: a head chef assigns dishes to specialists, a pass checks plates, a taster rates by flavor, freshness, and presentation; recipes are shortened notes, not essays.
Sports team: a coach sets plays, players cover lanes without colliding, a referee calls fouls, and training focuses on scoring while meeting conditions (stay in bounds, pass enough).

🍞 Hook: You know how a school project goes smoother when one person plans, classmates handle parts they’re best at, and someone checks the final poster?

🥬 The Concept (Hierarchical Multi‑Agent System): A planner assigns focused tasks to expert agents, and an arbiter picks the best combined result. How it works:

Planner reads compressed user/context signals and creates intent “personas.”
Each expert generates tags for its persona (e.g., fashion, kids, health).
Arbiter chooses a non‑redundant, high‑value set of tags for retrieval. Why it matters: Without a hierarchy, experts duplicate work and miss coverage; with it, reasoning is coordinated and efficient. 🍞 Anchor: For a parent before Halloween, experts might propose “kids’ costume,” “hydrating lotion,” and “cardigan”; the arbiter picks the top mix.

🍞 Hook: Imagine replacing a 12‑word item title with a single powerful sticker that means the same thing to the model.

🥬 The Concept (Hybrid Representation Inference): Compress long texts into “atomic” vectors the LLM can ingest alongside normal words. How it works:

Embed item/search texts into dense vectors.
Project vectors via a small adaptor so the frozen LLM understands them.
Mix these atoms with natural‑language context (profile, time, weather).
Train the adaptor so compressed prompts produce the same answers as full text. Why it matters: Without compression, the model wastes compute rereading long histories; with atoms, it runs faster at similar quality. 🍞 Anchor: A 21,349‑token history shrinks to about 5,158 tokens while keeping what matters.

🍞 Hook: If you tell a storyteller “make it playful and timely,” their story changes without you writing the whole script.

🥬 The Concept (Meta‑Prompting): First write a style guide from context, then write the explanation following that guide. How it works:

Synthesize a style (tone, audience, season/trend cues).
Generate the explanation conditioned on that style.
Evaluate across seven dimensions (incl. timeliness, informativeness, attractiveness). Why it matters: Without meta‑prompts, explanations sound samey and miss the moment. 🍞 Anchor: “Playful, holiday‑friendly” yields “Trick or Treat? Here comes the dark magic!” for a Halloween item.

🍞 Hook: Think of a game where you only score when you play well and follow key rules.

🥬 The Concept (Constrained Reinforcement Learning with CRS): Optimize for accuracy, but only count points when diversity, alignment, and length pass thresholds. How it works:

Define rewards: accuracy (main), plus gates for alignment, diversity, and length.
If gates pass, accuracy reward flows; if not, it’s zeroed.
Train with GRPO to improve while staying near the supervised base.
Use the same idea for explanations (alignment main, diversity gated). Why it matters: Without constraints, easy objectives steal training; with constraints, the model improves what matters first. 🍞 Anchor: The system won’t celebrate 10 similar tags; it learns to be right and varied.

🍞 Hook: When teachers grade essays, they check relevance, clarity, facts, and style before giving a final grade.

🥬 The Concept (Agent‑as‑a‑Judge): Multiple small evaluators score different dimensions, then a senior reviewer combines them into Superior/Average/Bad. How it works:

Sub‑judges rate relevance, diversity, coherence, etc.
Detect defects first; if any critical fail, mark Bad.
Otherwise, decide between Superior vs. Average by thresholds. Why it matters: One‑shot scores miss nuance; step‑by‑step judging matches humans better. 🍞 Anchor: A great explanation that’s timely and factual gets “Superior”; one with a fact error drops to “Bad.”

🍞 Hook: Gold stars from a fair judge guide you to improve next time.

🥬 The Concept (Judge‑as‑a‑Reward): Distill agent judgments into a lightweight model that outputs smooth reward scores for training. How it works:

Start from the judge, replace the head with a scalar scorer.
Train listwise: S > A > B across groups.
Use scores as rewards in RL; share prefixes to speed training. Why it matters: Without dense rewards, RL is noisy or slow; distilled rewards speed learning and preserve human preferences. 🍞 Anchor: The scorer consistently ranks “season‑aware, specific, safe” explanations above generic ones, pushing the model to write better.

Before vs. After:

Before: Multiple routes reread everything, explanations were template‑like, training fought itself, and judging was one‑shot.
After: A coordinated team reads compact histories, writes adaptive explanations, learns under clear constraints, and judges like humans—leading to +3.01% CTR, +11.46% novelty, and ~60% compute savings.

Why it works (intuition): Divide‑and‑conquer reduces duplicate reasoning; compression preserves meaning while cutting cost; constraints stop easy-but-wrong progress; process‑oriented judging provides reliable, trainable feedback. Together, they align the system with how people decide, not just how logs look.

03Methodology

At a high level: User/context input → Hybrid Representation Inference → Hierarchical Multi‑Agent Reasoning (Planner → Experts → Arbiter) → Retrieval + Dynamic Explanations → Agentic Evaluation → Reward Distillation → Constrained RL fine‑tuning → Output (ranked items + explanations).

Step 1: Hybrid Representation Inference 🍞 Hook: Packing a big library into a set of smart index cards you can skim fast.

🥬 The Concept (Atomized Entity Compression): Turn long item titles and queries into single atomic vectors the LLM can read like tokens. How it works:

Embed each entity (item/query) with a strong embedding model.
Project it with a tiny adaptor so the frozen LLM understands it.
Replace long texts in the prompt with these [entity] atoms while keeping key natural‑language context (e.g., age, location, weather).
Train the adaptor so compressed prompts reproduce the same answers as full‑text prompts across QA and production tasks. Why it matters: Without atoms, 10K–30K token prompts throttle throughput; with atoms, prompts shrink ~ $7× while$ preserving function. 🍞 Anchor: A 12‑token Chinese title becomes one [entity]; an entire 21,349‑token history compresses to ~5,158 tokens.

Infrastructure add‑ons (disaggregated serving + fast attention): 🍞 Hook: Think of a pit stop crew—one team changes tires fast (prefill), another handles the driver’s requests steadily (decode).

🥬 The Concept (Disaggregated Prefill–Decode): Run the long, parallel prefill on a big GPU pool, then hand off KV caches to decode workers for autoregressive generation. How it works:

Prefill computes attention over long inputs in parallel (compute‑heavy).
Transfer KV cache efficiently to decode workers.
Decode generates tokens sequentially (memory‑heavy) with fewer GPUs.
Swap in FP8‑friendly attention kernels (e.g., XQA) for speed on modern GPUs. Why it matters: Without separation, one phase bottlenecks the other; with it, MFU and throughput jump dramatically. 🍞 Anchor: Prefill QPS up ~ $69× and$ decode TPS up ~ $7× versus$ the old stack; MFU improves ~53% overall with other changes.

Step 2: Hierarchical Multi‑Agent System (HMAS) 🍞 Hook: A coach (planner) studies the playbook, assigns roles to players (experts), and a captain (arbiter) picks the winning combination.

🥬 The Concept (Planner → Experts → Arbiter): Coordinate reasoning so experts cover different angles without overlap. How it works:

Global Planner reads compressed behaviors + profile + environment (weather/trends/season) and generates K personas.
Distributed Experts use their persona to propose item tags (fashion, kids, health, etc.).
Decision Arbiter jointly selects top‑N tags that maximize relevance, specificity, and complementarity. Why it matters: Without this structure, experts repeat one another; with it, exclusive recall rises and compute drops. 🍞 Anchor: Cooler weather + Halloween incoming → personas propose “wool blend cardigan,” “kids’ hydrating lotion,” “kids’ costume,” “adjustable dumbbells”; arbiter picks a balanced set.

Step 3: Training the Experts (SFT → Constrained RL) 🍞 Hook: First learn the rules with examples, then practice under a scoreboard that only counts good‑and‑legal points.

🥬 The Concept (Constrained RL with CRS for tags): Improve accuracy only when diversity, alignment, and length pass gates. How it works:

Supervised fine‑tune on persona‑aligned data (mix of behaviors, trends, weather, and general instruction following).
Switch to GRPO: sample multiple outputs per prompt, score them, and update relative to a base model with a KL guardrail.
Use rewards: accuracy (main), alignment (RM), diversity (embedding distance), and length. Multiply accuracy by binary gates for the others.
Train until accuracy and diversity both improve stably. Why it matters: Without gates, easy rewards hijack training; with gates, the model becomes both right and rich. 🍞 Anchor: HR@30 climbs from 26.29% (V1) to 32.60% with CRS.

Step 4: Retrieval with Multi‑Interest Users and Traffic Allocation

Multi‑interest encoding collects several user vectors to match different intent strands.
A quadratic program balances exploratory items (from cognitive channel) with business targets (utility channel) under exposure and revenue constraints.

Step 5: Dynamic Explanation Generation (Meta‑Prompting + RL) 🍞 Hook: Set the mood board first, then write the caption.

🥬 The Concept (Meta‑Prompting): Two‑stage generation—style synthesis then style‑conditioned explanation. How it works:

Build a style guide from user interests, item traits, and situational signals (timeliness, audience, tone).
Generate the explanation following that guide.
Evaluate on seven dimensions (adds timeliness, informativeness, attractiveness to V1’s four). Why it matters: Without styles, texts repeat and ignore the moment. 🍞 Anchor: For a kids’ costume near Halloween: “Playful, festive, parent‑friendly” → “Trick or treat? Here comes the dark magic!”

🍞 Hook: Like poetry night where fresh words get brownie points—but the poem still has to be meaningful.

🥬 The Concept (Preference‑Aware RL for explanations): Optimize alignment as main reward, gate by diversity. How it works:

Keep a small FIFO memory of recent texts; rare tokens score higher (IDF‑like diversity).
Use a reward model (from judge distillation) to score human‑aligned quality.
CRS gates: only count alignment when diversity passes a threshold.
Train with GRPO for stable gains. Why it matters: Without this, you get flowery sameness or bland safety; with it, texts are varied, timely, and useful. 🍞 Anchor: Diversity rises ~7.3%, quality acceptance ~13.0%.

Step 6: Agentic Evaluation and Reward Distillation 🍞 Hook: A panel of judges checks different things; then a head judge announces the final medal; later, a slim “judge memory” helps coach practice.

🥬 The Concept (Agent‑as‑a‑Judge → Judge‑as‑a‑Reward): Evaluate like humans, then compress that wisdom into a fast reward model. How it works:

Sub‑judges assess each dimension; the senior reviewer assigns S/A/B using defect‑then‑excellence logic.
Distill to a scalar reward model trained listwise (S > A > B) for fine‑grained signals.
Use prefix sharing in batching to speed training.
Feed rewards back into GRPO for both tags and explanations. Why it matters: Without process‑oriented judging, feedback is noisy; with it, learning aligns with human taste efficiently. 🍞 Anchor: Human agreement improves, and listwise rewards beat pointwise on both tag HR@30 and explanation quality.

Secret sauce:

Divide‑and‑conquer reasoning (HMAS) prevents duplication.
Atomized compression keeps meaning while slashing tokens.
CRS guards against reward conflicts, stabilizing RL.
Agentic judging plus listwise reward distillation creates a self‑improving flywheel.

04Experiments & Results

The tests:

Online A/B on Taobao’s “Guess What You Like” (1% traffic each group for two weeks) in item and feed scenarios.
Offline item‑tag prediction (HR@30) comparing Base, SFT, GRPO with sum vs. constrained rewards.
Explanation diversity and human‑rated quality across seven dimensions.
Judge alignment: LLM‑as‑a‑Judge (V1) vs. Agent‑as‑a‑Judge (V2).

The competition: RecGPT‑V1 (baseline) vs. RecGPT‑V2 variants (Base, SFT, GRPO‑SUM, GRPO‑CRS). For evaluation, one‑shot judge vs. multi‑agent judge; pointwise vs. listwise reward models.

Scoreboard with context:

Online engagement (Item scenario): IPV +3.64% (more item views), CTR +3.01% (clicks per impression), TV +2.11% (money spent), GMV +3.39%, ATC +3.47%. That’s like a class consistently moving from B+ to A‑ across all subjects.
Online engagement (Feed scenario): CTR +1.50%, GMV +1.53%, with other metrics slightly up. In a noisy mixed feed, stable gains are hard; this is a solid win.
Novelty Exposure Rate (NER): +11.46% (item) and +4.49% (feed). That’s like discovering many more “new to you” items without tanking clicks.
Offline tag prediction (HR@30): V1 at 26.29% → V2 Base 23.08% (need domain alignment) → V2 SFT 29.20% → V2 GRPO‑CRS 32.60%. CRS turns the dial past both V1 and SFT, showing constraints unlock RL benefits.
Explanation generation: Diversity +7.30%, Quality acceptance +13.04% vs. V1—users find texts fresher and more helpful.
Agentic evaluation: Human‑judge agreement improves in accuracy and F1 across model sizes, especially for explanation quality. Multi‑step judging better mirrors human reasoning.
Compute and serving: ~60% GPU savings overall, MFU up ~53%, prefill QPS up ~69×, decode TPS up ~7×. That’s like upgrading from a bike to an e‑bike with smooth gears.

Surprising findings:

More novelty didn’t hurt short‑term metrics; it slightly improved them. With better intent coverage, exploration pays off.
Sum‑based rewards sometimes worsened accuracy late in training as diversity “won” the gradient battle. CRS fixed this by gating.
Seasonal timeliness showed up clearly: V2 caught Halloween and winter trends earlier, reflected in pre‑holiday product view rates.
A stronger judge didn’t just grade better; once distilled, it trained better models—evidence that good evaluation fuels good learning.

Takeaway: Coordinated reasoning + compression + constrained RL + agentic judging produce consistent online lifts and major cost cuts, with explanations people prefer.

05Discussion & Limitations

Limitations:

Compression fidelity: If embeddings or the adaptor miss rare nuances (e.g., niche product attributes), atoms could drop meaning; full‑text fallbacks may be needed for edge cases.
Threshold tuning: CRS requires sensible gates (alignment/diversity/length). Poor thresholds can block learning or let conflicts slip through.
Judge drift: Agentic judges trained on current standards may drift as trends change; periodic recalibration and human spot checks remain necessary.
Compute during training: While inference is cheaper, multi‑agent training, reward modeling, and RL still require significant GPU hours.
External LLM reliance: Some bootstrapping (e.g., QA generation, category alignment) depends on strong external models; quality varies by availability.
Domain transfer: V2 is proven on Taobao; other domains (music, news) may need tailored personas, rewards, and evaluation dimensions.

Required resources:

Modern GPUs with FP8‑friendly kernels and a serving stack that supports prefill–decode disaggregation.
Strong embedding models and a lightweight adaptor training loop.
Data plumbing for environmental signals (weather, trends, seasons) and safe logging.
A small but high‑quality human annotation budget to seed the judge and validate standards.

When not to use:

Tiny catalogs with very low variance (simplicity wins).
Extreme real‑time constraints with no budget for LLM latency (use classic retrieval/ranking only).
Highly sensitive privacy settings where rich histories or contexts can’t be used (prefer on‑device or differentially private methods).

Open questions:

Can planner–expert–arbiter and RL be trained end‑to‑end, with personas emerging automatically?
How to auto‑tune CRS thresholds online to adapt to shifting objectives and seasons?
Can the judge detect and resist adversarial prompts or spammy item metadata?
What’s the causal, long‑term effect of higher novelty on retention and satisfaction across cohorts?
How to make explanations both attractive and scrupulously factual at scale in multi‑modal (image/video) settings?
Can multilingual, cross‑market versions keep the same gains with localized personas and trends?

06Conclusion & Future Work

3‑sentence summary: RecGPT‑V2 reorganizes LLM‑powered recommendation into a coordinated team (planner, experts, arbiter) that reasons over compact “atomic” histories, writes adaptive explanations, and learns under a fair, multi‑step judge. Constrained reinforcement learning grows accuracy while preserving diversity and style, and a distilled reward model turns careful judgments into fast training signals. Engineering upgrades make the whole system practical at scale, delivering higher clicks, purchases, and novelty with ~60% less GPU.

Main achievement: Showing that agentic intent reasoning—paired with hybrid compression, meta‑prompting, constrained RL, and agent‑as‑a‑judge—can both improve online metrics and slash compute in a large industrial platform.

Future directions: Jointly train planner–experts–arbiter end‑to‑end; auto‑tune CRS gates online; extend agentic judging to multimodal content; localize personas across languages and markets; strengthen robustness to noisy or adversarial inputs.

Why remember this: It’s a blueprint for making recommenders think more like people—coordinated, concise, and fair—so they can serve fresher, clearer, and timelier choices while being efficient enough for real‑world scale.

Practical Applications

•E‑commerce homepages that react instantly to holidays and weather with apt, explained picks.
•Streaming platforms that balance fan‑favorites with fresh discoveries without lowering engagement.
•News or content feeds that adapt tone and recommendations to events while avoiding repetition.
•Grocery apps that suggest meal kits based on season, diet, and recent carts with helpful cooking tips.
•Fitness platforms that match gear and plans to user goals and local climate, with motivating blurbs.
•Education portals that recommend courses and study materials with context‑aware encouragements.
•Travel sites that propose itineraries and gear tied to local events and weather, with clear justifications.
•Marketplace search that surfaces new, relevant sellers while keeping results specific and safe.
•Advertising systems that improve targeting and rationale while controlling diversity and safety constraints.
•Retail apps that auto‑allocate exposure between exploration and conversion based on business goals.

Version: 1