P-GenRM: Personalized Generative Reward Model with Test-time User-based Scaling

Pinyi Zhang; Ting-En Lin; Yuchuan Wu; Jingyang Chen; Zongqi Wang; Hua Yang; Ze Xu; Fei Huang; Kai Zhang; Yongbin Li

P-GenRM: Personalized Generative Reward Model with Test-time User-based Scaling

Intermediate

Pinyi Zhang, Ting-En Lin, Yuchuan Wu et al.2/12/2026

arXiv

Key Summary

•This paper introduces P-GenRM, a personalized generative reward model that judges AI answers using a custom scorecard built just for each user and situation.
•It turns messy preference clues (like chat history and optional stated tastes) into clear evaluation chains: a short persona plus a weighted scoring rubric.
•At test time, it scales up in two ways: samples multiple personal scorecards for the same user and also borrows signals from similar users (prototypes) to reduce noise.
•The model is trained in three stages: supervised fine-tuning to learn the evaluation chain format, reinforcement learning to improve reasoning quality, and curriculum learning to handle harder cases.
•Users are clustered offline into prototypes with embeddings and K-means, then refined with a history-aware attention mechanism so prototypes become strong priors.
•Across personalized benchmarks, P-GenRM sets a new state of the art, improving average accuracy by about 2.31% over prior best models.
•Test-time user-based scaling adds another ~3% boost, showing strong personalization gains with only modest extra compute.
•An 8B P-GenRM even beats prior 70B models on average, and it generalizes well to new users with sparse histories (cold-start).
•Results show that simply adding more similar-user ratings doesn’t always help; quality matters more than quantity because preferences are highly individual.
•The approach is interpretable (personas and rubrics are visible), fairer across groups (high macro accuracy), and practically fast enough for real-world use.

Why This Research Matters

Personalized AI that explains its reasoning is safer and more useful in real life. P-GenRM builds a clear persona and rubric for each situation, so people can understand why a response wins. It remains accurate even when the user is new, by borrowing just enough wisdom from similar users without drowning out individuality. This balance helps assistants write better emails, tutors teach more effectively, and support agents match customer tone. The method is also fairer, with strong macro accuracy across different user groups, not just the majority. Overall, it shows how to make personalization practical: structure the problem, then scale smartly at use time.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

You know how a good teacher adapts to each student—some need quick summaries, others want deep explanations? Early large language models (LLMs) tried to be helpful to everyone in the same way, like a teacher giving the same lesson to every student, every day. That worked okay for general politeness and safety, but it felt generic when you wanted the model to match your own style and needs.

Let’s lay a few building blocks using the Sandwich pattern for clarity:

Top Bread (Hook): Imagine your music app that learns you love upbeat songs in the morning but calm tracks at night. Filling (Concept – Generative Models): A generative model is a program that creates things like text by predicting what comes next, step by step. It works by learning patterns from lots of examples and then sampling likely outputs. It matters because modern assistants write, reason, and chat by generating words on the fly. Bottom Bread (Anchor): Chatbots like those answering your homework or emails are powered by generative models.

Top Bread (Hook): Think of a friend who likes short messages while driving but longer ones when chilling. Filling (Concept – User Preferences in AI): User preferences are the personal likes and dislikes that guide what kind of answer someone wants. AI can spot them from explicit clues (like “be concise”) and implicit clues (like past choices). This matters because without preferences, one-size-fits-all answers feel off. Bottom Bread (Anchor): A cooking fan might want recipes with precise measurements, while another wants high-level tips.

Top Bread (Hook): Picture a judge holding up scorecards at a skating contest. Filling (Concept – Reward Model): A reward model scores AI answers so the system learns which ones are better. It works by comparing two responses and picking the one that matches the goal. It matters because the reward model steers training and decisions; bad rewards lead to bad behavior. Bottom Bread (Anchor): If you ask for a quick summary, the reward model should favor the shorter, clearer answer.

Top Bread (Hook): Imagine we teach a puppy with treats for good behavior. Filling (Concept – RLHF): Reinforcement Learning from Human Feedback teaches models using reward signals from people. The model tries, gets a score, and updates to do better next time. It matters because it aligns the model with human values rather than just copying text. Bottom Bread (Anchor): If the AI gives a safe, helpful answer, the reward goes up; if it’s unhelpful, the reward goes down.

The world before: Alignment mainly targeted broad goals like helpfulness and harmlessness. That’s great for safety, but not great for taste—some people like bullet points; others prefer stories. In open-ended chats, judging quality depends on the person and situation (are you in a hurry? is the topic sensitive?).

The problem: Two big roadblocks kept popping up. First, static rules: many systems squished the rich variety of user preferences into a tiny, fixed checklist, missing scenario shifts (like concise while commuting, expressive at home). Second, cold-start: with new users who have only a little history, models struggled to guess their tastes.

Failed attempts: Researchers tried prompts stuffed with demographics, hand-written values, or persona summaries. These helped a bit but stayed rigid or overfit to a few axes. Others trained personalized reward models that still leaned on fixed dimensions, or needed lots of labeled data for each user.

The gap: We needed a system that (1) turns messy clues (explicit and implicit) into clear, situation-aware scorecards, (2) keeps flexibility across scenarios, and (3) generalizes to new users with little data by borrowing wisdom from similar people, without drowning in noise.

Real stakes: In daily life, this means a chatbot that actually matches your voice when drafting emails, a tutor that knows how you learn, and a helper that adjusts tone on sensitive topics. In the workplace, it can reflect team norms automatically; in healthcare, it supports clearer, kinder, safer communication. Personalization isn’t a luxury—it’s how AI becomes truly useful and respectful.

This paper’s answer is P-GenRM, a Personalized Generative Reward Model that builds a transparent evaluation chain (persona + weighted rubric) for each situation, then smartly scales at test time by combining multiple personal scorecards and a small dose of insight from similar users (prototypes). That combo tackles both noise and cold-start, while keeping the system explainable.

02Core Idea

The Aha! Moment in one sentence: Turn scattered preference signals into a clear, scenario-specific evaluation chain (persona + weighted rubric), and at test time, scale judgments by sampling multiple personal rubrics and blending in a few from similar users to reduce noise and handle cold-start.

Multiple analogies (three ways):

Music judge: Instead of using one generic rulebook, the judge creates a custom scorecard for your concert today (more weight on rhythm if you love dance tracks). Then, they peek at a few judges with similar taste to double-check scoring consistency.
Restaurant critic: Before tasting, the critic writes a checklist tailored to you (spice level, plating, portion). They rate each dish against the checklist and also glance at notes from critics with similar palates.
School rubric: A teacher builds a project rubric matched to how you learn (clarity vs. creativity weighting). They grade with that rubric and also consider how similar students scored to stabilize marks.

Let’s introduce the main concepts with Sandwich explanations:

Top Bread (Hook): You know how a good coach first explains the game plan before the match? Filling (Concept – Evaluation Chains): An evaluation chain is a step-by-step judging script that starts with a short persona and moves to a weighted scoring rubric. It works by first writing what this user likely values now, then assigning weights, then scoring each response against those weights. It matters because structure beats guesswork; without it, scoring is vague and inconsistent. Bottom Bread (Anchor): For a request about workplace email tone, the chain might weight professionalism 40%, clarity 30%, empathy 20%, and brevity 10%.

Top Bread (Hook): Think of a tailor measuring you before sewing. Filling (Concept – P-GenRM): P-GenRM is a generative reward model that crafts the evaluation chain for each user and situation, then uses it to score answers. It works by reading your history plus any stated preferences, writing a brief persona and rubric, and producing scores with reasoning. It matters because it adapts per user and per context while staying interpretable. Bottom Bread (Anchor): For a coding question, it might emphasize correctness and examples; for a poem, it might emphasize imagery and flow.

Top Bread (Hook): Imagine checking a photo with different filters to see which looks best. Filling (Concept – Test-time User-based Scaling): This is adjusting the judging process during use-time by creating multiple personal rubrics (individual level) and also importing a few rubric views from similar users (prototype level). It works by averaging these views to reduce randomness and fill gaps when history is sparse. It matters because it boosts accuracy without retraining the model. Bottom Bread (Anchor): For a new user with little history, the system samples several plausible personal rubrics and blends in a few from a similar-user group to steady the score.

Top Bread (Hook): Think of a clubhouse where people who like similar games hang out. Filling (Concept – User Prototypes): User prototypes are clusters that represent typical preference patterns across users. They’re built from embeddings and refined with attention to highlight what matters for current queries. They matter because they let the system transfer knowledge to new users. Bottom Bread (Anchor): If you like concise, factual answers with light empathy, you might be closest to the “clear-and-calm” prototype.

Before vs. After:

Before: Personalized reward models often forced rich tastes into a small fixed grid, and stumbled with new users.
After: P-GenRM composes a new, scenario-aware rubric every time, then scales at test time using both personal samples and prototype neighbors. The result is more accurate, fairer, and robust to cold-start.

Why it works (intuition, no math):

Structure improves signal: turning hints into a persona + weighted rubric clarifies what to look for and why.
Multiple views cancel noise: sampling multiple personal rubrics and adding a few from similar users averages out random errors.
Prior knowledge helps: prototypes encode common patterns, so new users can borrow them.

Building blocks:

Persona-guided Scoring Induction (SFT) to learn the format and habit of building chains.
Criteria-based Reasoning Enhancement (RL) to improve chain quality, especially when explicit preferences are missing.
Hard-negative-aware Curriculum Learning to toughen the model against tricky, subjective cases.
Prototype initialization and refinement to create useful, history-aware priors.
Dual-granularity scaling at test time to combine personal diversity and group wisdom.

03Methodology

High-level recipe: Input (query + user history + optional stated preferences) → Preference Modeling (persona + criteria) → Scoring Rubric (weights) → Score each candidate response → Aggregate with test-time scaling (personal samples + prototype neighbors) → Final decision.

Step-by-step with Sandwich explanations for each new concept:

Persona-guided Scoring Induction (PSI, supervised fine-tuning)

Top Bread (Hook): You know how coaches run drills before a big game?
Filling (Concept): PSI teaches the model to turn preference clues into a structured evaluation chain (persona + weighted rubric) and then score answers. How it works:
1. Gather implicit signals (user’s chosen vs. rejected past answers) and explicit signals (stated style, if any).
2. Use an instruction model to produce a “gold” chain: scenario-specific persona, criteria with weights, and scored examples.
3. Supervised fine-tune P-GenRM to imitate these high-quality chains. Why it matters: Without PSI, the model wouldn’t reliably produce clean, interpretable scorecards.
Bottom Bread (Anchor): For a travel-planning user, PSI yields a persona like “likes actionable, budget-aware tips” and a rubric weighting utility, clarity, and safety.

Criteria-based Reasoning Enhancement (CRE, reinforcement learning)

Top Bread (Hook): Imagine practicing with a referee who also checks your thinking, not just your final answer.
Filling (Concept): CRE improves how well the model’s chain covers the user’s real preferences when explicit notes are missing. How it works:
1. The model first infers likely explicit criteria from history.
2. It generates an evaluation chain and scores candidates.
3. A judge model gives a process reward if the chain covers the criteria well, and an outcome reward if it picks the truly preferred answer. Why it matters: Without CRE, the model might guess a decent final score but with weak reasoning that misses the user’s criteria.
Bottom Bread (Anchor): If a user values nuance on ethics questions, CRE rewards chains that explicitly weight “factuality and nuance” higher.

Let’s define a few helper ideas used in CRE:

Top Bread (Hook): Think of a game where you earn points for good moves and for winning the match.
Filling (Concept – Process Reward vs. Outcome Reward): Process reward scores how well your reasoning (the chain) reflects the user’s criteria; outcome reward checks if you chose the correct preferred response. Both matter because great results come from good reasoning plus correct choices.
Bottom Bread (Anchor): A chain that names and weights the user’s top values gets process points; picking the same winner as the ground truth gets outcome points.
Top Bread (Hook): Consider a smart practice routine that stops you from overfitting one trick.
Filling (Concept – GRPO): GRPO is a stable RL method that nudges the model toward higher-reward chains while keeping it close to a safe reference. It works by sampling chains, scoring them, and updating probabilities, with a regularizer to prevent wild swings. It matters because we want better chains without breaking the model.
Bottom Bread (Anchor): The model tries several chain wordings; GRPO upweights the better ones and softly downweights the worse.

Hard negative-aware Curriculum Learning (RL)

Top Bread (Hook): Video games start easy and get harder.
Filling (Concept – Curriculum Learning): We train on easier cases first, then mix in more hard negatives—pairs where the difference is subtle or misleading. It matters because without this, the model can freeze on tough, subjective examples.
Bottom Bread (Anchor): Disabling process reward here opens more exploration so the model can focus on telling apart tricky look-alike answers.

Prototype modeling (offline): embeddings, K-means, and refinement

Top Bread (Hook): Think of turning each user’s taste into a point on a map.
Filling (Concept – Embeddings): An embedding is a numerical vector that represents meaning so similar things are close together. We embed each user’s preference description. It matters because math can then group similar users.
Bottom Bread (Anchor): Users who like concise, factual answers end up near each other in the embedding space.
Top Bread (Hook): Sorting students into study groups based on similar needs.
Filling (Concept – K-means Clustering): K-means groups users so each cluster has a center (prototype). It works by repeatedly assigning users to the nearest center, then updating the centers. It matters because prototypes summarize common patterns for transfer.
Bottom Bread (Anchor): One prototype might represent “concise-and-professional” preferences; another, “creative-and-empathetic.”
Top Bread (Hook): A librarian not only shelves books by category but also highlights the passages most useful to your current question.
Filling (Concept – History-aware Attention for Prototype Refinement): The prototype looks at a user’s history and pays extra attention to records most relevant to the current query and the prototype itself. It matters because we need prototypes to be helpful priors for the here-and-now, not just averages.
Bottom Bread (Anchor): For a coding query, the prototype highlights past interactions where the user chose answers with working examples.

Test-time dual-granularity scaling

Top Bread (Hook): Imagine trying on a few outfit variations for yourself and also asking a couple of similar-styled friends for quick opinions before picking.
Filling (Concept – Dual-Granularity Scaling): At the individual level, P-GenRM samples multiple plausible personal evaluation chains (m samples). At the prototype level, it gathers a few scores informed by similar users (n neighbors). It then averages these to form the final score. It matters because this reduces randomness and helps new users with little history.
Bottom Bread (Anchor): The best results came from something like 16 personal samples + 8 prototype neighbors in experiments—strong accuracy with modest extra compute.

The secret sauce: combining explicit structure (evaluation chains), improved reasoning (CRE + curriculum), and smart test-time scaling (personal diversity + prototype priors). This trio makes personalization robust, interpretable, and efficient enough for practice.

04Experiments & Results

The test: The authors measured how accurately P-GenRM could pick the better answer for a given user, and how well it generalized to new users with little history. They also checked fairness (macro accuracy across groups), speed, and sensitivity to scaling choices.

Datasets and baselines:

Chatbot Arena–Personalized and PRISM–Personalized (combined as PersonalRewardBench): realistic open-ended prompts with user-specific preferences and pairwise choices.
LaMP-QA: long-form, personalized Q&A with sparse histories, used to test cold-start generalization.
Baselines: In-context LLM-as-a-judge with various personalization prompts (demographics, self-descriptions, persona prompts), standard Bradley–Terry reward models, and specialized personalized reward models (GPO, VPL, PAL, SynthesizeMe), plus strong proprietary systems.

Scoreboard with context:

P-GenRM sets a new SOTA across model sizes. On 8B backbones, it beats the previous best by about 2.77% average; on 70B with LoRA, by about 1.99% average. Even more striking, P-GenRM-8B surpasses earlier 70B models by about 1.04% on average—like a well-coached varsity team outplaying last year’s all-stars.
Test-time user-based scaling adds ~3% more accuracy on top of P-GenRM, with only modest extra compute. For example, an “Ind-16, Pro-8” setting outperforms a heavier “Ind-32” setting while doing fewer total scaling steps (16 + 8 < 32), showing that mixing in a few good prototype neighbors can be better than just sampling more personal rubrics.
Against a strong proprietary judge prompted with the authors’ Persona-guided Scoring Induction (PSI), P-GenRM still wins, demonstrating that a trained generative reward model can outperform carefully prompted judging.

Cold-start generalization (LaMP-QA):

With only sparse histories, P-GenRM-8B + (Ind-8, Pro-4) achieves the highest agreement (Spearman correlation) with ground-truth rankings made by advanced LLM judges, even beating much larger models (e.g., Qwen3–235B–A22B). This means the method doesn’t just memorize users—it meaningfully transfers from prototypes and keeps making good calls with little data.

Fairness and breadth:

Macro accuracy (averaging accuracy per persona group) remains highest for P-GenRM, suggesting it does not simply overfit to majority tastes. The model also surfaces a richer set of scoring dimensions than fixed checklists (e.g., “Philosophical Engagement,” “Nuance”) when they matter for specific users.

Speed:

End-to-end wall-clock shows only a moderate increase with scaling (e.g., from around 14 minutes to 23 minutes on the full test), while still outperforming larger baselines that take longer. Much of the latency is in prompt encoding, which can be shared across samples, making scaling more affordable than it looks.

Surprising findings:

More similar-user ratings aren’t always better—beyond a point, adding neighbors can inject noise because personalization is truly individual. Carefully chosen prototype counts (about 50 here) and modest neighbor numbers provided the best balance.
Dynamic personas (PSI) consistently beat static persona prompts (like SynthesizeMe), confirming that situation-aware adaptation is key.

Ablations:

Removing curriculum learning or either reward (process or outcome) hurts performance, proving each piece contributes meaningfully. Eliminating the entire RL stage causes a large drop, and training without SFT reduces performance to near baseline prompting.

Bottom line: The model is both accurate and practical. It beats prior SOTA, generalizes to new users, remains fair across groups, and stays efficient enough for deployment.

05Discussion & Limitations

Limitations (be specific):

Chain generation cost: Producing a full evaluation chain (persona + rubric + scored reasoning) takes longer than spitting out a single scalar. In high-throughput scenarios, a pure scalar model might be faster, though less interpretable.
History requirement: The best stability comes from having about three preference examples per user. In true zero-shot settings, the system leans more on prototypes and sampling, which helps but may not fully match the performance with richer histories.
Judge-dependence: The process reward uses an LLM-as-a-judge. While helpful, this can bake in judge biases; mixing judges or calibrating them can reduce this risk.
Prototype tuning: Choosing the number of prototypes is a trade-off—too few lose nuance; too many add noise. The sweet spot (about 50 here) may vary by domain.

Required resources:

A capable base LLM to act as the generative reward model (8B-class or larger for best results), GPUs for SFT/RL training, and an embedding model for prototypes. Inference-time scaling benefits from serving infrastructure (parallel sampling, batching).

When not to use:

Extremely low-latency pipelines that cannot afford chain reasoning.
Domains where preferences are fully standardized (e.g., strict factual QA with no stylistic nuance) and a fixed, simple reward suffices.
Settings with severe privacy constraints where storing any history is disallowed (unless strong on-device or privacy-preserving alternatives are used).

Open questions:

Privacy and consent: How to ensure ethical, explainable use of histories and prototypes, including opt-outs and on-device processing?
Adaptive prototype counts: Can the system learn how many prototypes to use per domain or even per user segment on the fly?
Judge robustness: What’s the best way to ensemble or calibrate process judges to reduce bias and drift over time?
Faster chains: Can we distill chain-based scoring into lighter models that preserve interpretability cues and accuracy?
Beyond text: How well does this approach transfer to multimodal personalization (images, audio) or to task-level planning?

Overall, P-GenRM offers a strong, interpretable path to personalization. Its main trade-off is a bit more inference work in exchange for clearer reasoning and better alignment with individual users—often a trade worth making.

06Conclusion & Future Work

Three-sentence summary: P-GenRM turns messy preference clues into a clear, scenario-specific evaluation chain (persona + weighted rubric) and uses test-time user-based scaling to combine multiple personal rubrics with a few from similar users. This design reduces noise, handles cold-starts, and stays interpretable, achieving new state-of-the-art results on personalized reward benchmarks with modest extra compute. It also generalizes well to new users and remains fair across groups.

Main achievement: Showing that a generative, chain-based reward model—augmented by dual-granularity test-time scaling—can beat larger prior systems while being more transparent about why it judges answers the way it does.

Future directions: Distill the chain-based method into faster reward models, explore privacy-preserving on-device prototypes, extend to multimodal personalization, and develop judge ensembles that minimize bias. The system could also learn prototype counts and neighbor selection dynamically per query.

Why remember this: P-GenRM demonstrates a practical recipe for truly personal AI—clear rubrics, smarter scaling, and strong results—highlighting that personalization isn’t just about bigger models; it’s about better structure and smarter use-time adaptation.

Practical Applications

•Email drafting that adapts tone and length per recipient and scenario (formal vs. friendly).
•Personalized tutoring that weights clarity vs. creativity based on a learner’s style.
•Customer support triage that matches the brand voice and the customer’s preference for brevity or detail.
•Healthcare messaging that balances empathy, clarity, and safety according to patient needs.
•Recommendation explainers that present concise or in-depth reasons based on user taste.
•Code-review assistants that emphasize correctness, examples, or readability per developer preference.
•Team documentation that adapts structure and formality to internal style guides and team norms.
•Knowledge-base search that scores answers by the user’s preferred format (steps, bullets, longform).
•Education feedback rubrics that reflect each student’s learning goals and growth areas.
•Onboarding assistants that quickly infer new users’ styles using prototypes for cold-start.

Version: 1