Alternating Reinforcement Learning for Rubric-Based Reward Modeling in Non-Verifiable LLM Post-Training

Ran Xu; Tianci Liu; Zihan Dong; Tony You; Ilgee Hong; Carl Yang; Linjun Zhang; Tao Zhao; Haoyu Wang

Alternating Reinforcement Learning for Rubric-Based Reward Modeling in Non-Verifiable LLM Post-Training

Intermediate

Ran Xu, Tianci Liu, Zihan Dong et al.2/2/2026

arXiv PDF

Key Summary

•The paper introduces Rubric-ARM, a system that teaches two AI helpers—a rubric maker and a judge—to learn together using reinforcement learning so they can better decide which answers people would prefer.
•Instead of giving one blunt score, Rubric-ARM makes a short, clear checklist (a rubric) and uses it to judge answers on multiple dimensions like rules and quality.
•The training takes turns: first improve the judge while keeping the rubric generator fixed, then improve the rubric generator while keeping the judge fixed, which reduces training noise and stabilizes learning.
•A small but important extra reward makes the judge follow a strict output format, preventing messy or incomplete judging steps.
•Across many benchmarks, Rubric-ARM beats strong baselines with about a 4.7% average gain in reward-modeling tasks and also makes downstream policy training (like DPO and GRPO) work better.
•Theoretical analysis shows why training the judge before the rubric generator lowers gradient variance (less training wobble), leading to more stable progress.
•Rubric-ARM generalizes well to new writing tasks not seen during training and shows lower position bias compared to prior judges.
•It is also efficient, running faster than most reasoning-heavy reward models, despite using two 8B components.
•Using Rubric-ARM as the reward in both offline and online RL improves instruction following and human-preference alignment of policy models.
•Overall, the method makes AI evaluation clearer, fairer, and more useful for improving helpful behavior in non-verifiable tasks like creative writing.

Why This Research Matters

Rubric-ARM makes AI judgments clearer and fairer by turning fuzzy “one-number” scores into transparent, structured checklists. That transparency helps users trust why an answer was picked—useful in education, customer support, and safety-critical guidance. Because the rubric and judge learn together, the system adapts to new domains without relying on expensive human-authored rubrics. The method also improves the actual writing models, making them better at following instructions and matching human preferences. By reducing bias (like favoring the first or longer response) and increasing stability, it sets a stronger foundation for aligning AI with how people really evaluate quality. The efficiency gains make it practical enough to deploy at scale.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Top Bread (Hook) You know how when teachers grade a story, they don’t just give one number—they look at things like creativity, clarity, and following directions? A single score can miss a lot.

🥬 Filling (Concept: Non‑verifiable domains)

What it is: Non‑verifiable domains are tasks where there isn’t one exact right answer (like creative writing or open-ended advice), so you can’t just check with a simple answer key.
How it works: 1) People read two answers, 2) They say which one they prefer, 3) A system learns to predict these preferences without a clear ground truth.
Why it matters: If we pretend there’s only one “right” answer, the AI can end up favoring long or fancy texts instead of actually helpful or on‑request answers.

🍞 Bottom Bread (Anchor) Imagine asking for a cheerful birthday poem. There’s no single correct poem, but some poems will match the tone, follow length rules, and feel more personal—those are preferred even if there’s no answer key.

🍞 Top Bread (Hook) Imagine you’re judging two cupcakes: one looks pretty, the other tastes better. If you only give one number, it’s hard to explain your choice.

🥬 Filling (Concept: Reward modeling)

What it is: Reward modeling is how an AI learns a scoring rule that matches what people like.
How it works: 1) Collect pairs of answers where humans chose a favorite, 2) Train a model to predict which answer wins, 3) Use that model’s signals to improve the AI that writes answers.
Why it matters: Without a good reward model, the writing AI learns the wrong lesson—like preferring longer answers or buzzwords.

🍞 Bottom Bread (Anchor) If people consistently prefer friendly, on-topic answers that follow instructions, the reward model should teach the writer AI to do exactly that.

🍞 Top Bread (Hook) You know how a checklist makes grading fairer? It turns fuzzy ideas like “good explanation” into clear steps.

🥬 Filling (Concept: Rubrics)

What it is: A rubric is a short, structured checklist of criteria (like “follows instructions,” “polite tone,” “accurate facts”).
How it works: 1) For a given prompt, make a rubric with hard rules (objective musts) and principles (quality guidelines), 2) Judge answers against each item, 3) Make a final decision.
Why it matters: Without rubrics, judges are inconsistent and less transparent; with rubrics, judgments are clearer and more generalizable.

🍞 Bottom Bread (Anchor) For “Explain photosynthesis in two sentences,” a good rubric checks: exactly two sentences [Hard Rule], explanation is scientifically correct [Principle], and uses age-appropriate words [Principle].

The world before this paper: Most reward models output a single score or a simple “A vs B” choice. That’s fast but misses different angles of quality (e.g., rule-following versus tone). Recent work tried rubrics, but often used frozen, prompted models or trained the rubric-maker and judge separately. This caused three problems: 1) Static rubrics didn’t adapt to the domain, 2) The judge couldn’t learn better from the rubrics over time, 3) The two parts didn’t improve each other.

Where attempts fell short: Prompt-only rubrics sometimes drift, overfit, or include irrelevant items. Separately trained judges can ignore the rubric or misuse it. This leads to noisy training signals and position bias (preferring the first or longer response regardless of quality).

The gap: We needed a way to actually learn rubrics and judging together, so that better rubrics make better judging, which then shapes even better rubrics—without the training becoming unstable.

Real stakes: This affects chat helpers, classroom tutors, creative tools, and safety filters. If the reward is fuzzy, models can be verbose, miss constraints, or favor style over substance. A clearer, rubric-based reward can make AI more helpful, fair, and aligned with how people actually evaluate quality.

02Core Idea

🍞 Top Bread (Hook) Imagine two coaches training a soccer player: one sets the drills (the rubric), the other gives the scores (the judge). If they talk and adjust together after each practice, the player improves faster.

🥬 Filling (Concept: Rubric‑ARM)

What it is: Rubric-ARM is a system that jointly learns to create good rubrics and to judge answers using those rubrics through reinforcement learning.
How it works: 1) Start both models with basic skills (warmup), 2) Take turns improving the judge while keeping the rubric-maker fixed, then improving the rubric-maker while keeping the judge fixed, 3) Use human preference labels as the reward signal for correctness, plus a small format reward so the judge follows the checklist structure.
Why it matters: If you update both at once, training becomes wobbly (non-stationary). Alternating turns reduces noise and lets each part learn from a stable partner.

🍞 Bottom Bread (Anchor) For “Write a friendly, 3-bullet shopping list,” Rubric-ARM learns to generate a rubric with hard rules (exactly 3 bullets; must be friendly) and principles (clear, relevant items), and the judge uses it to fairly pick the better of two lists.

🍞 Top Bread (Hook) Think of cooking with two dials: heat and seasoning. Twisting both randomly makes a mess. Adjust heat first while seasoning stays still, taste, then adjust seasoning. Better, steadier progress.

🥬 Filling (Concept: Alternating optimization)

What it is: Alternating optimization improves two connected pieces by updating one while holding the other fixed, then switching.
How it works: 1) Freeze the rubric generator; train the judge to be accurate under those rubrics, 2) Freeze the judge; train the rubric generator to produce rubrics that make the judge more accurate, 3) Repeat.
Why it matters: Updating both at once injects extra randomness; alternating cuts variance, stabilizing learning and improving final accuracy.

🍞 Bottom Bread (Anchor) When learning piano, you might first perfect the rhythm (fixed melody), then focus on melody (fixed rhythm). Alternating focus builds a solid song.

🍞 Top Bread (Hook) You know how a good referee follows the rulebook closely and explains calls clearly so the game feels fair?

🥬 Filling (Concept: Judge and rubric generator)

What it is: The judge is an LLM that compares two answers under a rubric; the rubric generator is an LLM that creates the rubric for a prompt.
How it works: Generator writes a checklist; judge evaluates each answer item-by-item and outputs a final decision. Training feedback says whether the judge’s decision matches the labeled human preference.
Why it matters: A judge without a good rubric can be inconsistent; a rubric without a capable judge won’t be used well. Together, they align on what people truly prefer.

🍞 Bottom Bread (Anchor) For “Explain the water cycle for 5th graders in under 60 words,” the generator creates rules about length and simplicity; the judge checks both answers against them and picks the winner.

Multiple analogies for the key idea:

Classroom: The teacher (judge) uses a well-designed rubric (from curriculum designer) to grade essays; the rubric is improved when grading shows confusion points.
Sports: A referee (judge) uses a rulebook (rubric); rule tweaks happen when certain plays are hard to call. Better rules → better calls; better calls → better rule tweaks.
Map and compass: The rubric is the map (what to look for), the judge is the compass (which direction is right). Updating one while trusting the other avoids getting lost.

Before vs after:

Before: One-score reward models missed multiple quality aspects; rubric methods often froze models or trained parts separately → limited learning and bias.
After: Rubric-ARM learns rubrics as latent actions and a rubric-conditioned judge together, producing clearer, fairer, and stronger evaluation signals that better guide downstream training.

Why it works (intuition, not equations):

Fixing one side removes a moving target, shrinking randomness in gradient signals (lower variance), so the other side learns reliably.
Judging first creates a steady measuring stick; then the rubric-maker can confidently learn which checklists truly help the judge make correct calls.

Building blocks:

SFT warmup to give both models basic skills.
Alternating RL updates with a shared objective: judge correctness.
A small format reward so the judge always follows the checklist steps.
Efficient sampling (caching rubrics; greedy judge rollouts when training the generator).
Use of learned rubrics and judge to train actual policy models (writers) in offline (DPO) and online (GRPO) RL.

03Methodology

At a high level: Prompt + two candidate answers → Rubric generator makes a checklist → Judge reads prompt, rubric, and both answers → Judge decides which answer wins and explains → We reward correctness and good format → We alternate training the judge and the rubric generator → After training, we use the learned judge as a reward to improve a writing policy.

🍞 Top Bread (Hook) Imagine teaching two teammates: first, let the scorer practice with a fixed playbook; then let the playbook writer improve based on what helped the scorer score.

🥬 Filling (Concept: SFT warmup)

What it is: A starting phase where both models learn basic behaviors from examples by next‑token prediction.
How it works: 1) Collect synthetic rubric and judge traces from open datasets, 2) Fine-tune the rubric generator to write clear rubrics, 3) Fine-tune the judge to follow rubrics and output structured decisions.
Why it matters: Without warmup, RL is unstable because both models start from scratch and produce noisy rubrics and judgments.

🍞 Bottom Bread (Anchor) Like learning to ride a bike with training wheels before racing.

Stage II: Alternating Reinforcement Learning

Step A — Train the judge (rubric fixed):

What happens: Freeze the rubric generator. For each training example (prompt, two answers, human‑preferred label), sample one rubric per prompt and cache it. Train the judge so its decision matches the label. Use a shaped reward Rj = Racc + Rfmt.
Why this step exists: If the rubric also changed, the target would keep moving, making judge learning shaky.
Example: Prompt: “Summarize the article in 2 sentences.” Rubric has a hard rule: exactly 2 sentences. The judge checks both answers criterion‑by‑criterion, explains, then selects the winner. Reward gives +1 when the pick matches the label and extra points if the judge follows the required output format.

🍞 Top Bread (Hook) You know how a good science lab report has sections: question, method, results, conclusion? If you skip sections, readers get lost.

🥬 Filling (Concept: Format reward)

What it is: A small extra reward that requires the judge to output a strict structure (gatekeeper criterion, per‑criterion analysis, final justification, decision).
How it works: 1) Check that the judge fills each required section, 2) Penalize missing or jumbled sections, 3) Add this to the correctness reward.
Why it matters: Without it, the judge may skip steps, forget to address each rubric item, or hide the final decision.

🍞 Bottom Bread (Anchor) Just like getting points for including every part of a lab report.

Step B — Train the rubric generator (judge fixed):

What happens: Freeze the judge. Generate rubrics for prompts; let the judge decide winners using those rubrics; reward the rubric generator when the judge’s decision matches the label (Rr = I[correct]). Use a single greedy judging rollout per rubric for efficiency.
Why this step exists: Now that the judge is steady, the generator learns which rubrics actually help the judge make correct calls.
Example: If the judge keeps miscalling when length limits are vague, the generator learns to write clearer, more discriminative hard rules.

Alternation schedule and stability:

Always train the judge first in each cycle, then the generator. Theory shows the generator’s updates have higher variance early because it explores many rubric wordings. Training the judge first reduces this noise, leading to steadier progress.

🍞 Top Bread (Hook) Think of a group game where everyone tries multiple answers to the same prompt, and you compare them fairly within that group.

🥬 Filling (Concept: GRPO, the RL optimizer)

What it is: An efficient, actor‑only RL method that normalizes rewards within a prompt’s group of samples to reduce variance, using PPO‑style clipped updates.
How it works: 1) Sample several outputs per prompt, 2) Compute within‑group advantages, 3) Apply clipped policy updates while keeping a KL guardrail to a reference.
Why it matters: It stabilizes training and makes learning more sample‑efficient.

🍞 Bottom Bread (Anchor) Like grading all students’ answers to the same question together so the scale is fair.

Connection to EM (intuition):

Treat the rubric like a hidden helper. Updating the judge (M‑step) learns to be correct given rubrics; updating the generator (amortized E‑step) increases probability of rubrics that make the judge correct. Because rubrics and decisions are text sequences, we use policy gradients instead of exact inference.

Using Rubric-ARM to train a writing policy

🍞 Top Bread (Hook) Imagine you learn to bake by trying two cakes and asking a skilled judge with a clear checklist which one’s better, then adjusting your recipe.

🥬 Filling (Concept: Offline DPO with Rubric‑ARM)

What it is: A way to train a policy (the writer) using pairwise preferences labeled by the trained judge under learned rubrics.
How it works: 1) Sample two responses from the policy for a prompt, 2) Use Rubric‑ARM to decide which is better (check both orders to avoid position bias and keep only consistent pairs), 3) Apply DPO updates against a reference policy, optionally in multiple rounds (IterDPO).
Why it matters: It turns the clearer rubric‑based judge into a strong training signal that steadily improves the writer.

🍞 Bottom Bread (Anchor) The writer model keeps practicing summaries; the judge picks the better one using rubrics; DPO nudges the writer toward making choices that win more often.

🍞 Top Bread (Hook) Now picture a live coach who scores each try as you write, helping you improve on the fly.

🥬 Filling (Concept: Online RL with Rubric‑ARM)

What it is: Train the writer with RL while Rubric‑ARM provides rewards in real time (we use GRPO). A ReMax‑style baseline response helps shape rewards fairly.
How it works: 1) For a prompt, create one greedy baseline response and K sampled responses, 2) Under one rubric, judge both orders (A vs baseline and baseline vs A) to reduce position bias, 3) Reward the sampled response when it beats baseline consistently, 4) Update with GRPO.
Why it matters: This makes the writer better during training, not just after labeling data offline.

🍞 Bottom Bread (Anchor) Like scrimmaging against a baseline team and rewarding plays that clearly outperform it.

Implementation details that keep it practical:

Cache one rubric per prompt during judge training to reduce sampling cost.
Use greedy one‑shot judging when training the generator to save compute.
Randomize response order during training to reduce position bias.
Keep rubrics concise with hard rules (objective) and principles (quality), improving generalization.

04Experiments & Results

The test: Can a learned rubric + judge beat strong judges across many tasks and also make writer models better?

Benchmarks and settings:

Reward-model evaluation: RewardBench (Chat, Chat‑Hard), RM‑Bench, PPE‑IFEval, FollowBench, InfoBench, IFBench, RewardBench2 (Precise‑IF, Focus), Arena‑Hard, AlpacaEval 2, Creative Writing Benchmark v3, WildBench, and the out‑of‑distribution WritingPreferenceBench.
Baselines: Strong white‑box judges (JudgeLRM, RRM), reasoning reward models (RM‑R1), prior rubric model (Rubric‑RM), and training‑free prompting baselines (Qwen‑3‑8B Rubric+Judge). API judges reported for reference.

Scoreboard with context:

Average reward‑modeling performance: Rubric‑ARM reaches about 74.8 (vs. 70.1 for Rubric‑RM), a +4.7% average gain. With a small ensemble (voting@5), Rubric‑ARM gets ~76.2, outperforming API judges (≈71.3) and direct judge APIs (≈64.9).
- Think of this like getting an A when others are getting B’s, and the best ensemble reaches A+.
WritingPreferenceBench (OOD): Rubric‑ARM scores ~63.2, higher than Rubric‑RM (~60.3) and strong reasoning RMs like RM‑R1‑Qwen2.5‑7B (~59.8). This shows the rubrics capture criteria that transfer beyond training domains.
Strict instruction‑following (RewardBench2‑Precise‑IF): Rubric‑ARM’s structure‑aware judging and format reward help a lot, making it better at catching exact constraints (like exact length or required keywords).

Ablations (what makes it tick):

Training order matters: Switching the order (training the generator first) drops the average from ~74.8 to ~72.4 (−2.4) and hurts most on strict constraints (Precise‑IF collapses from ~41.9 to ~24.4). This supports the theory: train the judge first to lower variance.
Format reward matters: Removing it drops to ~72.6 (−2.2). It especially boosts structure‑sensitive metrics, preventing the judge from skipping rubric items.

Downstream policy training (offline):

DPO/IterDPO with Rubric‑ARM labels improves instruction following (IFEval averages ~80.4 with DPO and ~80.8 with IterDPO, best) and open‑ended InfoBench (~83.7 DPO, ~85.0 IterDPO, best). On IFBench, Rubric‑ARM with IterDPO hits ~35.4, topping iterative baselines.
- Analogy: The writer becomes more of a star student when coached by this clearer judge.
Human‑preference style evaluation: On Arena‑Hard and AlpacaEval, DPO via Rubric‑ARM leads (AVG ≈ 51.7), and IterDPO via Rubric‑ARM improves further (≈ 53.4), best among peers.
Creative writing: On Creative Writing v3, Rubric‑ARM helps policies reach ~39.0 (DPO) and ~39.3 (IterDPO), beating prior creative baselines like RaR (~38.8) and RuscaRL (~38.6).

Online RL (GRPO):

Training Qwen2.5‑7B‑Instruct online with Rubric‑ARM rewards lifts overall averages from ~46.8 (base) to ~55.4, beating a strong reward baseline (RM‑R1 at ~52.3). Gains appear across instruction‑following and alignment metrics.

Surprising and notable findings:

Lower position bias: Randomizing input order during training plus rubric‑based structure makes Rubric‑ARM’s predictions much less sensitive to which answer is shown first. Some baselines swing wildly; Rubric‑ARM stays steady.
Efficiency: Despite using two 8B components (generator + judge), Rubric‑ARM runs in ~33.5s on 100 samples, faster than many reasoning‑heavy reward models. It trades long chains of thought for short rubric + concise judging.
Case studies: On a “thumb war” example, baselines get distracted by the word “war,” while Rubric‑ARM sets a hard rule to address “thumb war” directly and picks the correct answer. On IFBench, Rubric‑ARM correctly checks exact paragraph counts and keywords, avoiding judging hallucinations seen in a baseline.

Takeaway: Clear, learned rubrics plus a structured judge produce a stronger, fairer reward signal. This not only boosts evaluation accuracy but also reliably trains better writing models, both offline and online.

05Discussion & Limitations

🍞 Top Bread (Hook) Even great checklists and referees have limits—what if the rules are unclear or the game suddenly changes?

🥬 Filling (Concept: Limitations and boundaries)

What it is: A candid look at where Rubric‑ARM might struggle and what it needs.
How it works: We list constraints (compute, data quality), failure modes (overly rigid rules), and conditions where simpler tools may be enough.
Why it matters: Knowing edges helps choose the right tool and inspires better future designs.

🍞 Bottom Bread (Anchor) Like choosing between a full lab experiment and a quick demo—sometimes the quick demo is fine.

Limitations:

Compute and data: Alternating RL with two LLMs needs GPU time and curated preference pairs. While efficient compared to long CoT judges, it’s heavier than training‑free prompting.
Rubric brittleness: If prompts are highly novel or ambiguous, the generator might write rubrics that are too strict or miss key nuances.
Short answers and edge cases: For very tiny outputs (e.g., single tokens), rubric detail may not add much and could introduce overhead.
Partial observability: If the pairwise labels are noisy or biased, the system can learn those biases unless mitigated.

Required resources:

Two mid‑size LLMs (≈8B each) for the rubric generator and judge, preference datasets, and RL training pipelines (GRPO/DPO). Inference requires both models at evaluation time.

When not to use:

Verifiable tasks with exact answers (math with solutions, code with tests). A simple verifier or exact metric is cheaper and more reliable.
Ultra‑low‑latency settings where even short rubric + judging passes are too slow.
Extremely constrained domains where a fixed, human‑written rubric already suffices and doesn’t drift.

Open questions:

How to adaptively size rubrics (fewer items when simple, more when complex) without losing accuracy?
Can we learn to detect and repair biased rubric items automatically?
How far can we push online learning without drifting rubrics in fast‑changing domains?
Can we fuse verifiable signals (tests) with rubric principles for hybrid tasks?
Can we compress the two‑model pipeline into a single efficient multi‑head model without losing interpretability?

06Conclusion & Future Work

Three‑sentence summary: Rubric‑ARM jointly learns a rubric generator and a rubric‑conditioned judge with alternating reinforcement learning, treating rubrics as latent actions that guide better preference predictions. By stabilizing updates (judge first) and enforcing structured judging with a small format reward, it delivers clearer, more accurate evaluation signals. Used as a reward, it consistently improves downstream policy models in both offline and online RL, with strong generalization and efficiency.

Main achievement: Showing that alternating RL on rubrics‑as‑rewards—rather than static rubrics or separate pipelines—yields large, reliable gains in non‑verifiable domains, while keeping judgments interpretable.

Future directions: Expand rubric generation to broader open‑ended tasks, mix rubrics with verifiable tests where possible, automatically de‑bias and adapt rubric length to task complexity, and explore single‑model architectures that retain interpretability with lower latency.

Why remember this: It turns vague “one‑number” judging into transparent checklists that the judge actually learns to use—and proves that teaching the referee and the rulebook to improve together can make AI far more aligned with human preferences.

Practical Applications

•Train classroom helper AIs to grade student writing using clear rubrics and offer targeted feedback.
•Improve customer support bots so they follow hard constraints (no PII, required steps) while staying helpful and polite.
•Evaluate creative writing assistants fairly across tone, style, and constraint-following without a single blunt score.
•Build safer assistants that enforce hard rules (e.g., disallowed content) before judging softer quality principles.
•Tune enterprise chatbots via DPO/GRPO using Rubric-ARM as the reward to boost instruction following and alignment.
•Create transparent review tools that explain why one answer was preferred, increasing user trust.
•Reduce position and length bias in automated evaluations by enforcing rubric-structured judging.
•Generalize evaluation to new domains (e.g., promotional copy, poetry) with learned, domain-adaptive rubrics.
•Guide iterative policy improvement pipelines (IterDPO) with stable, rubric-based signals.
•Accelerate evaluation by replacing long chains of thought with concise rubrics plus lightweight judging.

Version: 1