Alternating Reinforcement Learning for Rubric-Based Reward Modeling in Non-Verifiable LLM Post-Training
Key Summary
- âąThe paper introduces Rubric-ARM, a system that teaches two AI helpersâa rubric maker and a judgeâto learn together using reinforcement learning so they can better decide which answers people would prefer.
- âąInstead of giving one blunt score, Rubric-ARM makes a short, clear checklist (a rubric) and uses it to judge answers on multiple dimensions like rules and quality.
- âąThe training takes turns: first improve the judge while keeping the rubric generator fixed, then improve the rubric generator while keeping the judge fixed, which reduces training noise and stabilizes learning.
- âąA small but important extra reward makes the judge follow a strict output format, preventing messy or incomplete judging steps.
- âąAcross many benchmarks, Rubric-ARM beats strong baselines with about a 4.7% average gain in reward-modeling tasks and also makes downstream policy training (like DPO and GRPO) work better.
- âąTheoretical analysis shows why training the judge before the rubric generator lowers gradient variance (less training wobble), leading to more stable progress.
- âąRubric-ARM generalizes well to new writing tasks not seen during training and shows lower position bias compared to prior judges.
- âąIt is also efficient, running faster than most reasoning-heavy reward models, despite using two 8B components.
- âąUsing Rubric-ARM as the reward in both offline and online RL improves instruction following and human-preference alignment of policy models.
- âąOverall, the method makes AI evaluation clearer, fairer, and more useful for improving helpful behavior in non-verifiable tasks like creative writing.
Why This Research Matters
Rubric-ARM makes AI judgments clearer and fairer by turning fuzzy âone-numberâ scores into transparent, structured checklists. That transparency helps users trust why an answer was pickedâuseful in education, customer support, and safety-critical guidance. Because the rubric and judge learn together, the system adapts to new domains without relying on expensive human-authored rubrics. The method also improves the actual writing models, making them better at following instructions and matching human preferences. By reducing bias (like favoring the first or longer response) and increasing stability, it sets a stronger foundation for aligning AI with how people really evaluate quality. The efficiency gains make it practical enough to deploy at scale.
Detailed Explanation
Tap terms for definitions01Background & Problem Definition
đ Top Bread (Hook) You know how when teachers grade a story, they donât just give one numberâthey look at things like creativity, clarity, and following directions? A single score can miss a lot.
đ„Ź Filling (Concept: Nonâverifiable domains)
- What it is: Nonâverifiable domains are tasks where there isnât one exact right answer (like creative writing or open-ended advice), so you canât just check with a simple answer key.
- How it works: 1) People read two answers, 2) They say which one they prefer, 3) A system learns to predict these preferences without a clear ground truth.
- Why it matters: If we pretend thereâs only one ârightâ answer, the AI can end up favoring long or fancy texts instead of actually helpful or onârequest answers.
đ Bottom Bread (Anchor) Imagine asking for a cheerful birthday poem. Thereâs no single correct poem, but some poems will match the tone, follow length rules, and feel more personalâthose are preferred even if thereâs no answer key.
đ Top Bread (Hook) Imagine youâre judging two cupcakes: one looks pretty, the other tastes better. If you only give one number, itâs hard to explain your choice.
đ„Ź Filling (Concept: Reward modeling)
- What it is: Reward modeling is how an AI learns a scoring rule that matches what people like.
- How it works: 1) Collect pairs of answers where humans chose a favorite, 2) Train a model to predict which answer wins, 3) Use that modelâs signals to improve the AI that writes answers.
- Why it matters: Without a good reward model, the writing AI learns the wrong lessonâlike preferring longer answers or buzzwords.
đ Bottom Bread (Anchor) If people consistently prefer friendly, on-topic answers that follow instructions, the reward model should teach the writer AI to do exactly that.
đ Top Bread (Hook) You know how a checklist makes grading fairer? It turns fuzzy ideas like âgood explanationâ into clear steps.
đ„Ź Filling (Concept: Rubrics)
- What it is: A rubric is a short, structured checklist of criteria (like âfollows instructions,â âpolite tone,â âaccurate factsâ).
- How it works: 1) For a given prompt, make a rubric with hard rules (objective musts) and principles (quality guidelines), 2) Judge answers against each item, 3) Make a final decision.
- Why it matters: Without rubrics, judges are inconsistent and less transparent; with rubrics, judgments are clearer and more generalizable.
đ Bottom Bread (Anchor) For âExplain photosynthesis in two sentences,â a good rubric checks: exactly two sentences [Hard Rule], explanation is scientifically correct [Principle], and uses age-appropriate words [Principle].
The world before this paper: Most reward models output a single score or a simple âA vs Bâ choice. Thatâs fast but misses different angles of quality (e.g., rule-following versus tone). Recent work tried rubrics, but often used frozen, prompted models or trained the rubric-maker and judge separately. This caused three problems: 1) Static rubrics didnât adapt to the domain, 2) The judge couldnât learn better from the rubrics over time, 3) The two parts didnât improve each other.
Where attempts fell short: Prompt-only rubrics sometimes drift, overfit, or include irrelevant items. Separately trained judges can ignore the rubric or misuse it. This leads to noisy training signals and position bias (preferring the first or longer response regardless of quality).
The gap: We needed a way to actually learn rubrics and judging together, so that better rubrics make better judging, which then shapes even better rubricsâwithout the training becoming unstable.
Real stakes: This affects chat helpers, classroom tutors, creative tools, and safety filters. If the reward is fuzzy, models can be verbose, miss constraints, or favor style over substance. A clearer, rubric-based reward can make AI more helpful, fair, and aligned with how people actually evaluate quality.
02Core Idea
đ Top Bread (Hook) Imagine two coaches training a soccer player: one sets the drills (the rubric), the other gives the scores (the judge). If they talk and adjust together after each practice, the player improves faster.
đ„Ź Filling (Concept: RubricâARM)
- What it is: Rubric-ARM is a system that jointly learns to create good rubrics and to judge answers using those rubrics through reinforcement learning.
- How it works: 1) Start both models with basic skills (warmup), 2) Take turns improving the judge while keeping the rubric-maker fixed, then improving the rubric-maker while keeping the judge fixed, 3) Use human preference labels as the reward signal for correctness, plus a small format reward so the judge follows the checklist structure.
- Why it matters: If you update both at once, training becomes wobbly (non-stationary). Alternating turns reduces noise and lets each part learn from a stable partner.
đ Bottom Bread (Anchor) For âWrite a friendly, 3-bullet shopping list,â Rubric-ARM learns to generate a rubric with hard rules (exactly 3 bullets; must be friendly) and principles (clear, relevant items), and the judge uses it to fairly pick the better of two lists.
đ Top Bread (Hook) Think of cooking with two dials: heat and seasoning. Twisting both randomly makes a mess. Adjust heat first while seasoning stays still, taste, then adjust seasoning. Better, steadier progress.
đ„Ź Filling (Concept: Alternating optimization)
- What it is: Alternating optimization improves two connected pieces by updating one while holding the other fixed, then switching.
- How it works: 1) Freeze the rubric generator; train the judge to be accurate under those rubrics, 2) Freeze the judge; train the rubric generator to produce rubrics that make the judge more accurate, 3) Repeat.
- Why it matters: Updating both at once injects extra randomness; alternating cuts variance, stabilizing learning and improving final accuracy.
đ Bottom Bread (Anchor) When learning piano, you might first perfect the rhythm (fixed melody), then focus on melody (fixed rhythm). Alternating focus builds a solid song.
đ Top Bread (Hook) You know how a good referee follows the rulebook closely and explains calls clearly so the game feels fair?
đ„Ź Filling (Concept: Judge and rubric generator)
- What it is: The judge is an LLM that compares two answers under a rubric; the rubric generator is an LLM that creates the rubric for a prompt.
- How it works: Generator writes a checklist; judge evaluates each answer item-by-item and outputs a final decision. Training feedback says whether the judgeâs decision matches the labeled human preference.
- Why it matters: A judge without a good rubric can be inconsistent; a rubric without a capable judge wonât be used well. Together, they align on what people truly prefer.
đ Bottom Bread (Anchor) For âExplain the water cycle for 5th graders in under 60 words,â the generator creates rules about length and simplicity; the judge checks both answers against them and picks the winner.
Multiple analogies for the key idea:
- Classroom: The teacher (judge) uses a well-designed rubric (from curriculum designer) to grade essays; the rubric is improved when grading shows confusion points.
- Sports: A referee (judge) uses a rulebook (rubric); rule tweaks happen when certain plays are hard to call. Better rules â better calls; better calls â better rule tweaks.
- Map and compass: The rubric is the map (what to look for), the judge is the compass (which direction is right). Updating one while trusting the other avoids getting lost.
Before vs after:
- Before: One-score reward models missed multiple quality aspects; rubric methods often froze models or trained parts separately â limited learning and bias.
- After: Rubric-ARM learns rubrics as latent actions and a rubric-conditioned judge together, producing clearer, fairer, and stronger evaluation signals that better guide downstream training.
Why it works (intuition, not equations):
- Fixing one side removes a moving target, shrinking randomness in gradient signals (lower variance), so the other side learns reliably.
- Judging first creates a steady measuring stick; then the rubric-maker can confidently learn which checklists truly help the judge make correct calls.
Building blocks:
- SFT warmup to give both models basic skills.
- Alternating RL updates with a shared objective: judge correctness.
- A small format reward so the judge always follows the checklist steps.
- Efficient sampling (caching rubrics; greedy judge rollouts when training the generator).
- Use of learned rubrics and judge to train actual policy models (writers) in offline (DPO) and online (GRPO) RL.
03Methodology
At a high level: Prompt + two candidate answers â Rubric generator makes a checklist â Judge reads prompt, rubric, and both answers â Judge decides which answer wins and explains â We reward correctness and good format â We alternate training the judge and the rubric generator â After training, we use the learned judge as a reward to improve a writing policy.
đ Top Bread (Hook) Imagine teaching two teammates: first, let the scorer practice with a fixed playbook; then let the playbook writer improve based on what helped the scorer score.
đ„Ź Filling (Concept: SFT warmup)
- What it is: A starting phase where both models learn basic behaviors from examples by nextâtoken prediction.
- How it works: 1) Collect synthetic rubric and judge traces from open datasets, 2) Fine-tune the rubric generator to write clear rubrics, 3) Fine-tune the judge to follow rubrics and output structured decisions.
- Why it matters: Without warmup, RL is unstable because both models start from scratch and produce noisy rubrics and judgments.
đ Bottom Bread (Anchor) Like learning to ride a bike with training wheels before racing.
Stage II: Alternating Reinforcement Learning
Step A â Train the judge (rubric fixed):
- What happens: Freeze the rubric generator. For each training example (prompt, two answers, humanâpreferred label), sample one rubric per prompt and cache it. Train the judge so its decision matches the label. Use a shaped reward Rj = Racc + Rfmt.
- Why this step exists: If the rubric also changed, the target would keep moving, making judge learning shaky.
- Example: Prompt: âSummarize the article in 2 sentences.â Rubric has a hard rule: exactly 2 sentences. The judge checks both answers criterionâbyâcriterion, explains, then selects the winner. Reward gives +1 when the pick matches the label and extra points if the judge follows the required output format.
đ Top Bread (Hook) You know how a good science lab report has sections: question, method, results, conclusion? If you skip sections, readers get lost.
đ„Ź Filling (Concept: Format reward)
- What it is: A small extra reward that requires the judge to output a strict structure (gatekeeper criterion, perâcriterion analysis, final justification, decision).
- How it works: 1) Check that the judge fills each required section, 2) Penalize missing or jumbled sections, 3) Add this to the correctness reward.
- Why it matters: Without it, the judge may skip steps, forget to address each rubric item, or hide the final decision.
đ Bottom Bread (Anchor) Just like getting points for including every part of a lab report.
Step B â Train the rubric generator (judge fixed):
- What happens: Freeze the judge. Generate rubrics for prompts; let the judge decide winners using those rubrics; reward the rubric generator when the judgeâs decision matches the label (Rr = I[correct]). Use a single greedy judging rollout per rubric for efficiency.
- Why this step exists: Now that the judge is steady, the generator learns which rubrics actually help the judge make correct calls.
- Example: If the judge keeps miscalling when length limits are vague, the generator learns to write clearer, more discriminative hard rules.
Alternation schedule and stability:
- Always train the judge first in each cycle, then the generator. Theory shows the generatorâs updates have higher variance early because it explores many rubric wordings. Training the judge first reduces this noise, leading to steadier progress.
đ Top Bread (Hook) Think of a group game where everyone tries multiple answers to the same prompt, and you compare them fairly within that group.
đ„Ź Filling (Concept: GRPO, the RL optimizer)
- What it is: An efficient, actorâonly RL method that normalizes rewards within a promptâs group of samples to reduce variance, using PPOâstyle clipped updates.
- How it works: 1) Sample several outputs per prompt, 2) Compute withinâgroup advantages, 3) Apply clipped policy updates while keeping a KL guardrail to a reference.
- Why it matters: It stabilizes training and makes learning more sampleâefficient.
đ Bottom Bread (Anchor) Like grading all studentsâ answers to the same question together so the scale is fair.
Connection to EM (intuition):
- Treat the rubric like a hidden helper. Updating the judge (Mâstep) learns to be correct given rubrics; updating the generator (amortized Eâstep) increases probability of rubrics that make the judge correct. Because rubrics and decisions are text sequences, we use policy gradients instead of exact inference.
Using Rubric-ARM to train a writing policy
đ Top Bread (Hook) Imagine you learn to bake by trying two cakes and asking a skilled judge with a clear checklist which oneâs better, then adjusting your recipe.
đ„Ź Filling (Concept: Offline DPO with RubricâARM)
- What it is: A way to train a policy (the writer) using pairwise preferences labeled by the trained judge under learned rubrics.
- How it works: 1) Sample two responses from the policy for a prompt, 2) Use RubricâARM to decide which is better (check both orders to avoid position bias and keep only consistent pairs), 3) Apply DPO updates against a reference policy, optionally in multiple rounds (IterDPO).
- Why it matters: It turns the clearer rubricâbased judge into a strong training signal that steadily improves the writer.
đ Bottom Bread (Anchor) The writer model keeps practicing summaries; the judge picks the better one using rubrics; DPO nudges the writer toward making choices that win more often.
đ Top Bread (Hook) Now picture a live coach who scores each try as you write, helping you improve on the fly.
đ„Ź Filling (Concept: Online RL with RubricâARM)
- What it is: Train the writer with RL while RubricâARM provides rewards in real time (we use GRPO). A ReMaxâstyle baseline response helps shape rewards fairly.
- How it works: 1) For a prompt, create one greedy baseline response and K sampled responses, 2) Under one rubric, judge both orders (A vs baseline and baseline vs A) to reduce position bias, 3) Reward the sampled response when it beats baseline consistently, 4) Update with GRPO.
- Why it matters: This makes the writer better during training, not just after labeling data offline.
đ Bottom Bread (Anchor) Like scrimmaging against a baseline team and rewarding plays that clearly outperform it.
Implementation details that keep it practical:
- Cache one rubric per prompt during judge training to reduce sampling cost.
- Use greedy oneâshot judging when training the generator to save compute.
- Randomize response order during training to reduce position bias.
- Keep rubrics concise with hard rules (objective) and principles (quality), improving generalization.
04Experiments & Results
The test: Can a learned rubric + judge beat strong judges across many tasks and also make writer models better?
Benchmarks and settings:
- Reward-model evaluation: RewardBench (Chat, ChatâHard), RMâBench, PPEâIFEval, FollowBench, InfoBench, IFBench, RewardBench2 (PreciseâIF, Focus), ArenaâHard, AlpacaEval 2, Creative Writing Benchmark v3, WildBench, and the outâofâdistribution WritingPreferenceBench.
- Baselines: Strong whiteâbox judges (JudgeLRM, RRM), reasoning reward models (RMâR1), prior rubric model (RubricâRM), and trainingâfree prompting baselines (Qwenâ3â8B Rubric+Judge). API judges reported for reference.
Scoreboard with context:
- Average rewardâmodeling performance: RubricâARM reaches about 74.8 (vs. 70.1 for RubricâRM), a +4.7% average gain. With a small ensemble (voting@5), RubricâARM gets ~76.2, outperforming API judges (â71.3) and direct judge APIs (â64.9).
- Think of this like getting an A when others are getting Bâs, and the best ensemble reaches A+.
- WritingPreferenceBench (OOD): RubricâARM scores ~63.2, higher than RubricâRM (~60.3) and strong reasoning RMs like RMâR1âQwen2.5â7B (~59.8). This shows the rubrics capture criteria that transfer beyond training domains.
- Strict instructionâfollowing (RewardBench2âPreciseâIF): RubricâARMâs structureâaware judging and format reward help a lot, making it better at catching exact constraints (like exact length or required keywords).
Ablations (what makes it tick):
- Training order matters: Switching the order (training the generator first) drops the average from ~74.8 to ~72.4 (â2.4) and hurts most on strict constraints (PreciseâIF collapses from ~41.9 to ~24.4). This supports the theory: train the judge first to lower variance.
- Format reward matters: Removing it drops to ~72.6 (â2.2). It especially boosts structureâsensitive metrics, preventing the judge from skipping rubric items.
Downstream policy training (offline):
- DPO/IterDPO with RubricâARM labels improves instruction following (IFEval averages ~80.4 with DPO and ~80.8 with IterDPO, best) and openâended InfoBench (~83.7 DPO, ~85.0 IterDPO, best). On IFBench, RubricâARM with IterDPO hits ~35.4, topping iterative baselines.
- Analogy: The writer becomes more of a star student when coached by this clearer judge.
- Humanâpreference style evaluation: On ArenaâHard and AlpacaEval, DPO via RubricâARM leads (AVG â 51.7), and IterDPO via RubricâARM improves further (â 53.4), best among peers.
- Creative writing: On Creative Writing v3, RubricâARM helps policies reach ~39.0 (DPO) and ~39.3 (IterDPO), beating prior creative baselines like RaR (~38.8) and RuscaRL (~38.6).
Online RL (GRPO):
- Training Qwen2.5â7BâInstruct online with RubricâARM rewards lifts overall averages from ~46.8 (base) to ~55.4, beating a strong reward baseline (RMâR1 at ~52.3). Gains appear across instructionâfollowing and alignment metrics.
Surprising and notable findings:
- Lower position bias: Randomizing input order during training plus rubricâbased structure makes RubricâARMâs predictions much less sensitive to which answer is shown first. Some baselines swing wildly; RubricâARM stays steady.
- Efficiency: Despite using two 8B components (generator + judge), RubricâARM runs in ~33.5s on 100 samples, faster than many reasoningâheavy reward models. It trades long chains of thought for short rubric + concise judging.
- Case studies: On a âthumb warâ example, baselines get distracted by the word âwar,â while RubricâARM sets a hard rule to address âthumb warâ directly and picks the correct answer. On IFBench, RubricâARM correctly checks exact paragraph counts and keywords, avoiding judging hallucinations seen in a baseline.
Takeaway: Clear, learned rubrics plus a structured judge produce a stronger, fairer reward signal. This not only boosts evaluation accuracy but also reliably trains better writing models, both offline and online.
05Discussion & Limitations
đ Top Bread (Hook) Even great checklists and referees have limitsâwhat if the rules are unclear or the game suddenly changes?
đ„Ź Filling (Concept: Limitations and boundaries)
- What it is: A candid look at where RubricâARM might struggle and what it needs.
- How it works: We list constraints (compute, data quality), failure modes (overly rigid rules), and conditions where simpler tools may be enough.
- Why it matters: Knowing edges helps choose the right tool and inspires better future designs.
đ Bottom Bread (Anchor) Like choosing between a full lab experiment and a quick demoâsometimes the quick demo is fine.
Limitations:
- Compute and data: Alternating RL with two LLMs needs GPU time and curated preference pairs. While efficient compared to long CoT judges, itâs heavier than trainingâfree prompting.
- Rubric brittleness: If prompts are highly novel or ambiguous, the generator might write rubrics that are too strict or miss key nuances.
- Short answers and edge cases: For very tiny outputs (e.g., single tokens), rubric detail may not add much and could introduce overhead.
- Partial observability: If the pairwise labels are noisy or biased, the system can learn those biases unless mitigated.
Required resources:
- Two midâsize LLMs (â8B each) for the rubric generator and judge, preference datasets, and RL training pipelines (GRPO/DPO). Inference requires both models at evaluation time.
When not to use:
- Verifiable tasks with exact answers (math with solutions, code with tests). A simple verifier or exact metric is cheaper and more reliable.
- Ultraâlowâlatency settings where even short rubric + judging passes are too slow.
- Extremely constrained domains where a fixed, humanâwritten rubric already suffices and doesnât drift.
Open questions:
- How to adaptively size rubrics (fewer items when simple, more when complex) without losing accuracy?
- Can we learn to detect and repair biased rubric items automatically?
- How far can we push online learning without drifting rubrics in fastâchanging domains?
- Can we fuse verifiable signals (tests) with rubric principles for hybrid tasks?
- Can we compress the twoâmodel pipeline into a single efficient multiâhead model without losing interpretability?
06Conclusion & Future Work
Threeâsentence summary: RubricâARM jointly learns a rubric generator and a rubricâconditioned judge with alternating reinforcement learning, treating rubrics as latent actions that guide better preference predictions. By stabilizing updates (judge first) and enforcing structured judging with a small format reward, it delivers clearer, more accurate evaluation signals. Used as a reward, it consistently improves downstream policy models in both offline and online RL, with strong generalization and efficiency.
Main achievement: Showing that alternating RL on rubricsâasârewardsârather than static rubrics or separate pipelinesâyields large, reliable gains in nonâverifiable domains, while keeping judgments interpretable.
Future directions: Expand rubric generation to broader openâended tasks, mix rubrics with verifiable tests where possible, automatically deâbias and adapt rubric length to task complexity, and explore singleâmodel architectures that retain interpretability with lower latency.
Why remember this: It turns vague âoneânumberâ judging into transparent checklists that the judge actually learns to useâand proves that teaching the referee and the rulebook to improve together can make AI far more aligned with human preferences.
Practical Applications
- âąTrain classroom helper AIs to grade student writing using clear rubrics and offer targeted feedback.
- âąImprove customer support bots so they follow hard constraints (no PII, required steps) while staying helpful and polite.
- âąEvaluate creative writing assistants fairly across tone, style, and constraint-following without a single blunt score.
- âąBuild safer assistants that enforce hard rules (e.g., disallowed content) before judging softer quality principles.
- âąTune enterprise chatbots via DPO/GRPO using Rubric-ARM as the reward to boost instruction following and alignment.
- âąCreate transparent review tools that explain why one answer was preferred, increasing user trust.
- âąReduce position and length bias in automated evaluations by enforcing rubric-structured judging.
- âąGeneralize evaluation to new domains (e.g., promotional copy, poetry) with learned, domain-adaptive rubrics.
- âąGuide iterative policy improvement pipelines (IterDPO) with stable, rubric-based signals.
- âąAccelerate evaluation by replacing long chains of thought with concise rubrics plus lightweight judging.