🎓How I Study AIHISA
đź“–Read
📄Papers📰Blogs🎬Courses
đź’ˇLearn
🛤️Paths📚Topics💡Concepts🎴Shorts
🎯Practice
🧩Problems📝Daily Log🎯Prompts🧠Review
SearchSettings
Beyond Length Scaling: Synergizing Breadth and Depth for Generative Reward Models | How I Study AI

Beyond Length Scaling: Synergizing Breadth and Depth for Generative Reward Models

Intermediate
Qiyuan Zhang, Yufei Wang, Tianhe Wu et al.3/2/2026
arXiv

Key Summary

  • •Longer explanations are not always better; the shape of thinking matters.
  • •This paper splits reasoning into two styles: Breadth (check many angles) and Depth (verify step by step).
  • •They build Mix-GRM, a reward model that can use Breadth for taste-based questions and Depth for fact-based ones.
  • •A modular schema (Principle–Judgment–Verdict) turns messy explanations into neat, checkable parts.
  • •Training first copies good examples (SFT), then learns from right/wrong final answers (RLVR) to pick the best thinking style for each task.
  • •Across five big benchmarks, Mix-GRM sets a new open-source state of the art, beating strong baselines by an average of 8.2%.
  • •Breadth helps with subjective preferences; Depth helps with objective correctness—mixing them prevents mismatches.
  • •RLVR acts like a switch that amplifies the right style automatically, improving results further.
  • •It works well in real use: better reward signals for DPO and better verifiers for choosing the best answer out of many.
  • •You get higher accuracy with far less data by structuring thoughts, not just making them longer.

Why This Research Matters

AI judges train other AIs; if the judge likes the wrong things, the student models learn bad habits. This work shows that matching the shape of thinking to the task—Breadth for preferences and Depth for correctness—makes judges fairer and more reliable. It also delivers better results with far less data, lowering costs and speeding up progress. In practical terms, it helps pick safer, clearer chatbot replies and more correct math or code solutions. The approach improves both training-time supervision (DPO) and runtime verification (Best-of-N), which many real systems already use. By focusing on structure over length, teams can build trustworthy evaluators that generalize well across domains.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine two kinds of homework. In art class, your teacher cares about creativity, neatness, and style (many angles). In math class, your teacher cares if each step of your proof is correct (one tight chain). If you grade both with the same checklist, you’ll be unfair to at least one.

🥬 The World Before:

  • What it is: Reward Models (RMs) are AI judges that score answers so other AIs can learn what “good” looks like.
  • How it worked: Many recent RMs became Generative Reward Models (GRMs). They don’t just say a score; they explain their reasoning first (a Chain-of-Thought), then give a verdict.
  • Why it mattered: Adding an explanation often made judgments more robust and generalizable, like showing your work in math helps catch mistakes.

🍞 Anchor: When two chatbots answer a question, a GRM might write why one answer is more helpful or safer before choosing A or B.

🍞 Hook: You know how kids sometimes think writing a longer essay will get a better grade? AI did that too with longer Chains-of-Thought.

🥬 The Problem:

  • What it is: People scaled up CoT length for GRMs, assuming “longer = smarter.”
  • How it showed up: Researchers fed models more features, critiques, and steps to make the rationale longer and denser.
  • Why it broke: Studies showed longer doesn’t always help. Different tasks need different thinking shapes: wide coverage vs deep verification. Treating all tasks the same caused mismatches and worse results.

🍞 Anchor: If you grade a poem with a 20-step math proof checklist, the poem won’t get a fair score, no matter how long the checklist is.

🍞 Hook: Imagine two thinking modes. One is brainstorming (many ideas in parallel). The other is proving (one idea checked line by line).

🥬 Failed Attempts:

  • What they tried: 1) Just make CoTs longer via RL; 2) Use big rubrics/checklists; 3) Distill longer critiques from powerful models.
  • Why it didn’t work: These were mostly static and task-agnostic. They expanded quantity, not the right structure. Breadth helps when preferences are multifaceted; depth helps when facts must be verified. Mixing them up confused the judge.
  • What broke without structure: Models rewarded fancy formatting or long-windedness over actual logical correctness—or overly fixated on logic and ignored user tone or language match.

🍞 Anchor: A one-size-fits-all ruler can’t fairly measure both spaghetti length and a chess strategy.

🍞 Hook: Think of Lego. Random piles are messy. Sorted bricks let you build smart shapes for different goals.

🥬 The Gap:

  • What it is: We needed a way to turn messy explanations into tidy, reusable blocks and then assemble the right shape of reasoning (breadth or depth) per task.
  • How it works: First, standardize rationales into Principle–Judgment–Verdict units (the Lego bricks). Then assemble them into Breadth-CoT (many principles in parallel) or Depth-CoT (reason a solution first, then judge grounded in that solution).
  • Why it matters: Without this, we can’t align the thinking style to the task—leading to avoidable errors and wasted tokens.

🍞 Anchor: With labeled Lego bricks, you can snap together a wide bridge (breadth) or a tall tower (depth) when each is needed.

🍞 Hook: Imagine a coach who knows when to practice dribbling (breadth of skills) and when to drill one perfect free throw (depth of one skill).

🥬 Real Stakes:

  • What it is: AI judges decide what other AIs learn. If the judge likes the wrong things, the students (models) copy bad habits.
  • How it works out in life: Customer support quality, code correctness, safety filters, and math solvers all depend on solid judging.
  • Why it matters: The paper shows that choosing the right reasoning mechanism boosts accuracy with far less data and compute, helping build fairer, safer, and more useful systems.

🍞 Anchor: A fair art judge and a fair math judge make better students—and better apps you can trust.

02Core Idea

🍞 Hook: You know how a Swiss Army knife has different tools for different jobs? You don’t use scissors to tighten a screw.

🥬 The “Aha!” Moment (one sentence): Don’t just make explanations longer—teach the reward model to switch between Breadth (cover many principles) and Depth (verify step by step) and use the right one for the right task.

Multiple Analogies:

  1. Library vs. Lab:
  • Breadth is like scanning many book sections to cover all topics (style, tone, safety).
  • Depth is like running a careful experiment to test one claim.
  1. Movie Critic vs. Math Teacher:
  • Breadth weighs acting, music, story, and pacing together.
  • Depth checks each algebra step for correctness.
  1. Detective Squad vs. Forensic Expert:
  • Breadth interviews many witnesses (multiple angles).
  • Depth replays the timeline with precise measurements.

Before vs. After:

  • Before: GRMs mostly stretched CoTs, hoping length would fix judgment quality across tasks.
  • After: Mix-GRM assembles the correct structure (Breadth-CoT for subjective preference, Depth-CoT for objective correctness) and learns to switch automatically with RLVR.
  • What changes: Higher accuracy, better data efficiency, and fewer failures caused by mismatched reasoning styles.

Why It Works (intuition):

  • Tasks differ. Preference tasks are multi-criteria (helpfulness, tone, safety, creativity)—they benefit from checking many principles in parallel (Breadth).
  • Correctness tasks hinge on logical chains—small mistakes break the answer—so they need a single, grounded reasoning trace (Depth) to avoid being fooled by surface polish.
  • RL with verifiable rewards (only the final verdict’s right/wrong) gives a clean signal that nudges the model toward the style that actually leads to correct verdicts—acting like a “switching amplifier.”

Building Blocks (each with a mini sandwich):

  • 🍞 You know how teachers use rubrics with named criteria? 🥬 Principle–Judgment–Verdict Units:

    • What: Break raw rationales into small, standard pieces: a Principle (what to judge), a Judgment (evidence/comparison), and a Sub-Verdict (A or B for that principle).
    • How: Parse messy text into 3–5 atomic units per example so they’re clear and checkable.
    • Why: Without tidy units, you can’t reliably build Breadth or Depth structures. 🍞 Anchor: Like turning a rambling book report into bullet-pointed rubric items.
  • 🍞 Imagine brainstorming with friends, then merging the best ideas without repeats. 🥬 Breadth-CoT (B-COT):

    • What: Parallel aggregation of diverse principles across multiple sampled rationales.
    • How: Sample N rationales, extract units, merge and deduplicate, keep high-frequency, representative principles.
    • Why: Without breadth, preference judgments miss subtle constraints (like language match or cultural sensitivity). 🍞 Anchor: For a Japanese question, Breadth flags that an English answer may be mismatched even if it’s detailed.
  • 🍞 Think of solving the math problem yourself before grading someone else’s solution. 🥬 Depth-CoT (D-COT):

    • What: A self-solve trace first, then judgments grounded in that trace.
    • How: Generate a reasoning trace; re-write judgments using that trace (1–3 key principles), serialize to a coherent evaluation.
    • Why: Without depth, correctness judgments reward fluff and miss hidden logical errors. 🍞 Anchor: In redox, Depth catches that K is more reactive than Mg, so a proposed reaction is impossible.
  • 🍞 Picture learning by example, then getting quizzed on right/wrong answers. 🥬 Mechanism-Adaptive Alignment (SFT + RLVR):

    • What: First teach both styles with the right tasks (SFT), then use RLVR to reinforce picking the best style per case.
    • How: Pair Breadth with preference data and Depth with correctness data during SFT; RLVR then rewards correct final verdicts only.
    • Why: Without the RLVR “switch,” the model under-uses the right style at inference time. 🍞 Anchor: After RLVR, the model chooses Breadth more on taste problems and Depth more on proofs—by itself.

03Methodology

At a high level: Input (instruction + two answers) → Standardize into Principle–Judgment–Verdict units → Synthesize Breadth-CoT or Depth-CoT → Train with SFT on a mixed dataset → Optimize with RLVR (verdict-only rewards) → Output: rationale + verdict, with the model adaptively favoring Breadth for preference tasks and Depth for correctness tasks.

Step 1: Modular Schema Standardization

  • 🍞 Hook: You know how sorting Lego pieces by shape and color makes building faster and better?
  • 🥬 What: Turn a free-form rationale into 3–5 atomic units: Principle (what we check), Judgment (evidence and comparison), Sub-Verdict (A or B for that principle). How:
    1. Use an LLM to parse raw reasoning text into units.
    2. Keep a small, clean set (3–5) to avoid redundancy.
    3. Ensure each unit is specific (e.g., “Instruction Adherence” with a concrete quote from the answers). Why it matters: Without standard units, later steps can’t combine ideas reliably; you’d just be stacking messy text.
  • 🍞 Anchor: The ramble “B is clearer but A is safer” becomes two units: Clarity (B wins) and Safety (A wins).

Step 2A: Breadth-CoT Synthesis (for Preference)

  • 🍞 Hook: Imagine five friends each list what matters for a great speech—tone, clarity, facts, empathy, and structure—and you merge the best points.
  • 🥬 What: Build a wide, parallel checklist of diverse, representative principles. How:
    1. Sample N independent rationales to explore different angles.
    2. Parse each into units.
    3. Merge and deduplicate with an LLM, preferring high-frequency principles (consensus) and removing noise.
    4. Serialize as a coherent Breadth-CoT: Principle 1 → Judgment → Sub-Verdict; Principle 2 → …; etc. Why it matters: Preference is multi-dimensional; without Breadth, the judge latches onto one loud feature and misses important subtleties (like language match).
  • 🍞 Anchor: In the Japanese-language question, Breadth includes a “Linguistic Alignment” principle that flips the verdict toward the answer in Japanese, even if the English one is detailed.

Step 2B: Depth-CoT Synthesis (for Correctness)

  • 🍞 Hook: Before grading a proof, you solve it yourself to see what the right path looks like.
  • 🥬 What: Ground judgments in a self-solved reasoning trace (z). How:
    1. Generate a reasoning trace that solves the problem.
    2. Select up to ~3 most critical principles (e.g., Factual Correctness, Logical Consistency).
    3. Re-write each Judgment strictly referencing the trace (quote <Answer> from the trace), then serialize to a Depth-CoT. Why it matters: Without a ground-truth-like trace, the judge can be fooled by polished but wrong answers.
  • 🍞 Anchor: On a redox question, the trace derives the reactivity order and spots that B proposes an impossible reaction; Depth picks A.

Step 3: Mechanism-Adaptive Alignment (Training)

  • 🍞 Hook: First learn the moves; then play games where only wins count.
  • 🥬 What: Two-stage training—SFT on matched data, then RLVR via GRPO. How:
    1. SFT: Train the model to produce Breadth-CoT on preference tasks and Depth-CoT on correctness tasks, with the right verdict.
    2. RLVR (verifiable rewards): Roll out multiple completions and reward +1 if the final verdict matches the human label, else -1, with KL regularization for stability. Why it matters: SFT gives the model both tools; RLVR teaches when to grab which tool by making verdict accuracy the single target signal.
  • 🍞 Anchor: After RLVR, the model’s outputs on preference tasks show more principles (Breadth), while correctness tasks show longer, trace-anchored judgments (Depth).

Example with Actual Data (miniatures):

  • Preference (Language Match): Instruction: Japanese question. A: English, detailed; B: Japanese, concise. Breadth principles: Linguistic Alignment (B), Helpfulness (A), Cultural Sensitivity (B) → Final: B wins.
  • Correctness (Chemistry Redox): Depth trace: Derive metal reactivity order; test each option. Judgment: B proposes K < Mg (wrong); A picks a valid reaction → Final: A wins.

The Secret Sauce:

  • Structural choice beats sheer length. By standardizing units and then assembling the right structure for each domain, RLVR can “polarize” the model to the optimal style. This switching amplifier effect boosts performance without extra token bloat, yielding better accuracy and data efficiency.

04Experiments & Results

🍞 Hook: If two students study for the same time, the one using the right technique for each subject usually scores higher.

🥬 The Tests (what and why):

  • What: Five broad reward-model benchmarks (RewardBench v1/v2, RM-Bench, RMB, PPE), covering chat, code, math, factuality, instruction-following, and safety. They measure pairwise comparison accuracy—did the judge pick the genuinely better answer?
  • Why: These span both preference and correctness tasks, so they reveal whether Breadth vs. Depth is matched properly.

🍞 Anchor: It’s like a combined exam with writing prompts (preference) and math proofs (correctness).

The Competition:

  • Strong open-source baselines: Skywork-Reward (scalar), JudgeLRM, RM-R1 variants, DeepSeek-GRM, FARE-8B, and RubricRM-8B.
  • Some proprietary references appear for context (not directly comparable due to size and data).

Scoreboard with Context:

  • Mix-GRM (SFT only) averages about 75.1—already beating several RL-trained or massive-data baselines (e.g., topping RM-R1-Instruct and DeepSeek-GRM-16B, and edging RubricRM-8B), despite using only ~9K SFT samples.
  • After RLVR, Mix-GRM climbs to ~79.4 average, widening its lead over a comparable Base-GRM and establishing a new open-source SOTA across the five benchmarks, with around 8.2% average improvement over leading open models.
  • Data efficiency: FARE-8B uses ~2.5M examples to reach ~75.9, while Mix-GRM hits ~75.1 with ~9K—showing structure beats brute force.
  • Domain split (Preference vs. Correctness):
    • Breadth boosts Preference but can hurt Correctness if used alone.
    • Depth boosts Correctness but can hurt Preference if used alone.
    • Mix-GRM combines them and, after RLVR, outperforms single-mode models even on their home turf.

Downstream Utility (real-life tasks):

  • Offline RL (DPO): Using Mix-GRM to label preference pairs yields the best instruction-following win rates (top ~12.1) and best math averages (~46.4) among open RMs in the study, improving GSM8K to ~77.6% from ~75.1% SFT baseline.
  • Test-time scaling (Best-of-10): As a verifier, Mix-GRM consistently picks better solutions across tough math (MATH, CHAMP) and code (MBPP+, BigCodeBench), e.g., ~43.2% on MATH—like getting an A when others get B’s—beating RL-heavy (RM-R1) and data-heavy (FARE-8B) alternatives.

Surprising Findings:

  • Misalignment hurts: Using Breadth on correctness or Depth on preference reliably drops accuracy—proof that “shape of thinking” matters more than “length.”
  • RLVR polarization: Even though RLVR only rewards the final verdict, the model learns to switch styles appropriately, boosting both domains.
  • Noise robustness: Training on unfiltered synthesized CoTs is nearly as good as filtered ones, simplifying data pipelines.

🍞 Anchor: Think of Mix-GRM as a coach who decides whether to run passing drills (Breadth) or free-throw practice (Depth) right before the game—and wins more often because of it.

05Discussion & Limitations

🍞 Hook: Even great tools have limits; you don’t use a paintbrush to hammer nails.

🥬 Limitations:

  • Coarse two-pole view: Splitting tasks into just Preference (Breadth) and Correctness (Depth) is useful but simplified. Real tasks can blend both (e.g., an essay that must be persuasive and fact-checked).
  • Rigidity on hybrids: RLVR’s helpful polarization can reduce flexibility for mixed demands unless the model learns soft routing or finer-grained hybrids.
  • Template sensitivity: While the schema helps, very unusual tasks may need new principles the current merge/dedup doesn’t capture well.

Required Resources:

  • A capable base model (e.g., 7–8B), modest SFT data (~9K), and an RLVR stage with rollouts and a reference policy. Token costs are similar across Breadth and Depth because both use about two reasoning passes.

When NOT to Use:

  • Tiny compute budgets where even two passes are too expensive.
  • Domains where ground-truth verdicts are unavailable or unreliable (RLVR needs verifiable labels).
  • Tasks demanding simultaneous rich style and airtight logic in a single pass without time for routing—current method may prefer one pole.

Open Questions:

  • Can we learn a continuous “reasoning manifold” with more than two axes (e.g., risk analysis, causal scrutiny, evidence tracing)?
  • How to do dynamic soft-routing inside one rationale (start broad, then go deep) under fixed token budgets?
  • Can we auto-discover new principles and self-calibrate Breadth size N per task difficulty?
  • How to provide uncertainty estimates so downstream systems know when to escalate to a human?

🍞 Anchor: The next step is like teaching the judge to start with a wide scan, zoom in where needed, and say how sure they are—without running out of time.

06Conclusion & Future Work

Three-Sentence Summary:

  1. This paper shows that reliable AI judging is not about longer explanations, but about using the right reasoning shape for the job: Breadth for multi-criteria preferences and Depth for step-checked correctness.
  2. Mix-GRM standardizes messy rationales into modular units, assembles Breadth-CoT or Depth-CoT as needed, and uses SFT plus RLVR to learn when to switch—achieving new open-source SOTA with far less data.
  3. RLVR acts like a switching amplifier, sharpening the match between task and mechanism and improving both benchmark scores and real downstream tasks.

Main Achievement:

  • Turning “length scaling” into “mechanism scaling” by synergizing Breadth and Depth within one generative reward model—and proving that structure beats sheer size for accuracy and efficiency.

Future Directions:

  • Develop finer-grained hybrid structures and soft-routing so one rationale can start broad and then go deep when evidence warrants it.
  • Expand beyond two poles toward a richer reasoning manifold (e.g., causality checks, safety risk audits, source attribution) with automatic principle discovery.
  • Add calibrated uncertainty so systems know when to defer to humans.

Why Remember This:

  • It reframes how we build judges: not longer thoughts, but better-shaped thoughts. By choosing the right tool—wide scan or deep proof—Mix-GRM teaches AIs to grade fairly, learn faster, and perform better in the real world.

Practical Applications

  • •Improve customer support quality by using Breadth-CoT to weigh tone, empathy, clarity, and safety together.
  • •Verify math and coding answers with Depth-CoT to avoid being fooled by polished but wrong solutions.
  • •Rerank multiple candidate answers (Best-of-N) to boost accuracy in reasoning-heavy tasks.
  • •Label higher-quality preference pairs for DPO training, producing better aligned assistant policies.
  • •Build safer moderation by including Breadth principles (e.g., risk, sensitivity, language match) in evaluations.
  • •Reduce data costs by structuring rationales (Principle–Judgment–Verdict) instead of collecting massive datasets.
  • •Audit AI decisions with interpretable units so humans can see which principles drove the verdict.
  • •Create domain-specific checklists by seeding Breadth-CoT with relevant principles (e.g., medical empathy, legal caution).
  • •Automatically route tasks: trigger Depth-CoT for correctness-critical workflows and Breadth-CoT for stylistic ones.
  • •Enhance exam-style grading systems by using Depth-CoT for proofs and Breadth-CoT for essays.
#Generative Reward Models#Chain-of-Thought#Breadth-CoT#Depth-CoT#Reinforcement Learning with Verifiable Rewards#Supervised Fine-Tuning#LLM-as-a-judge#Best-of-N reranking#Preference modeling#Correctness verification#Mechanism-adaptive alignment#Schema standardization#Test-time scaling#RLHF#GRPO
Version: 1

Notes

0/2000
Press Cmd+Enter to submit