🎓How I Study AIHISA
📖Read
📄Papers📰Blogs🎬Courses
💡Learn
🛤️Paths📚Topics💡Concepts🎴Shorts
🎯Practice
🧩Problems🎯Prompts🧠Review
Search
JudgeRLVR: Judge First, Generate Second for Efficient Reasoning | How I Study AI

JudgeRLVR: Judge First, Generate Second for Efficient Reasoning

Intermediate
Jiangshan Duo, Hanyu Li, Hailin Zhang et al.1/13/2026
arXivPDF

Key Summary

  • •JudgeRLVR teaches a model to be a strict judge of answers before it learns to generate them, which trims bad ideas early.
  • •Compared to standard RLVR, it makes solutions shorter and clearer without adding length penalties.
  • •On in-domain math, it raises accuracy by about 3.7 percentage points while cutting generation length by roughly 42%.
  • •On out-of-domain tasks like science, coding, and general knowledge, it improves accuracy by about 4.5 percentage points.
  • •The judging stage builds an "error radar" that helps the model avoid unhelpful detours when it later generates.
  • •Ablations show judging alone is not enough; the two stages in order (judge, then generate) matter.
  • •Language analysis shows fewer backtracking words like "but" and "however," suggesting cleaner, more decisive thinking.
  • •Perplexity shifts indicate the model’s style really changes during the judging stage, not just its final answers.

Why This Research Matters

JudgeRLVR makes AI answers faster and clearer by teaching the model to spot good and bad solution patterns before it starts talking. This saves real money and time because shorter outputs mean fewer tokens and quicker responses. It also makes results easier to trust: less back-and-forth and more direct logic are simpler to review. The benefits extend beyond math to science questions, coding tasks, and instruction following, improving both accuracy and efficiency. By changing the model’s habits rather than forcing it to be brief, JudgeRLVR avoids the accuracy drop that often comes with length penalties. In classrooms, apps, or coding tools, this means better help that gets to the point and gets it right.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

You know how when you’re solving a puzzle, it’s tempting to try lots of random moves until something works? That’s what many AI models used to do: they would write long, wandering explanations, hoping to bump into the right answer eventually.

🍞 Hook: You know how teachers don’t just grade the final answer, but also care about whether your steps make sense?

🥬 The Concept (Reinforcement Learning with Verifiable Rewards, RLVR): RLVR is a way to train models by giving them a reward when the final answer can be automatically checked as correct.

  • How it works:
    1. Give the model a problem with a known correct answer.
    2. Let it try an answer and parse the final result from its text.
    3. If the final result matches the truth, give a reward; if not, no reward.
  • Why it matters: Without RLVR, models might only imitate examples; with RLVR, they can discover strategies by exploring. But if the only signal is the final answer, they often explore in a verbose, meandering way.

🍞 Anchor: Imagine a math app that only says “right” or “wrong” at the end. You’ll try lots of guesses and write long notes, but you won’t learn to spot a bad path early.

Before this paper, RLVR was already the go-to way to boost reasoning in Large Language Models (LLMs). It worked well on math and code because the correctness of an answer can be checked automatically. But there was a hitch: optimizing only for final correctness pushed models toward long, back-and-forth chain-of-thought (CoT) traces—like talking out loud endlessly, revising themselves with lots of “but,” “however,” and “let’s try again.” This style is expensive (more tokens) and often low in information density (more words, not more insight).

🍞 Hook: Imagine narrating every tiny thought while doing long division—painful to listen to and slow to finish.

🥬 The Concept (Chain-of-Thought Dynamics): CoT dynamics describe how a model writes out its step-by-step reasoning.

  • How it works:
    1. The model writes intermediate steps to reach a final answer.
    2. It may branch, backtrack, or correct itself.
    3. These dynamics can be concise and direct or long and looping.
  • Why it matters: Without healthy CoT dynamics, the model wastes time and tokens, and sometimes confuses itself.

🍞 Anchor: Think of a friend who solves a maze by trying every hallway and narrating every turn—even wrong ones. They’ll get there, but slowly.

People tried to tame verbosity with length penalties (for example, some RL variants and training recipes). Those penalties can reduce tokens, but they often chop off useful steps and hurt accuracy. So, teams faced a tough trade-off: shorter solutions that miss important checks, or longer ones that waste time and compute.

🍞 Hook: You know how expert chess players don’t try every move—they rule out bad moves in their heads before touching a piece?

🥬 The Concept (Discriminative Error Awareness): This is the model’s skill at telling good solutions from bad ones before fully committing.

  • How it works:
    1. Show the model candidate answers (short solution write-ups) with known correctness.
    2. Make it classify them as correct or incorrect, and explain briefly why.
    3. Reward it for accurate judgments.
  • Why it matters: Without this error-awareness, the model keeps expanding bad branches and wastes tokens.

🍞 Anchor: A math coach who can instantly spot a shaky step saves the student from doing a whole page of useless work.

What was missing was a way to build that inner coach first—and then let the model generate. That’s the gap this paper fills with JudgeRLVR: a two-stage paradigm where the model first learns to judge solution responses, and only afterward learns to generate solutions with RLVR. No explicit length penalty is used. Any conciseness is a side effect of better judgment.

🍞 Hook: After a spelling bee judge trains on spotting mistakes, they write their own essays with fewer errors.

🥬 The Concept (Feedback Mechanism): This is the loop where the model gets signals—“this verdict was right,” or “this final answer matched the key”—and updates itself.

  • How it works:
    1. Compare predicted verdicts or answers with ground truth.
    2. Give reward for matches.
    3. Adjust the policy to do more of what earned rewards.
  • Why it matters: Without clear feedback, the model can’t improve its decision boundaries or generation habits.

🍞 Anchor: Like getting graded quizzes back each week, so you stop repeating the same error on fractions.

Why should anyone care? Because shorter, cleaner reasoning saves time and money (fewer tokens), is easier to audit, and often generalizes better. In everyday terms, that means faster math help, more reliable coding assistance, and clearer answers across many tasks—even outside math. JudgeRLVR shows that if you teach an AI to judge first, it talks less and says more.

02Core Idea

You know how great problem-solvers first decide which ideas are worth trying and only then dive in? That’s the heart of this paper.

Aha! Moment in one sentence: If a model first learns to judge which solution responses are valid, it will naturally prune bad paths when it later generates, producing shorter and better reasoning without extra penalties.

Three analogies:

  1. Referee first, player second: A soccer referee trains to spot fouls instantly; when that person later plays, they avoid risky moves and play clean, efficient soccer.
  2. Sorting laundry before washing: You separate whites and colors first; the wash goes smoothly afterward. Judging sorts good/bad solution patterns; generation then runs with fewer mishaps.
  3. GPS with built-in roadblocks: If the GPS already knows which roads are closed, it won’t even suggest them, giving you a direct route.

Before vs After:

  • Before (Vanilla RLVR): The model tries many reasoning branches, backtracks a lot, and often writes long CoTs. It gets rewarded only if the final answer is correct, so it may learn to search more but not necessarily smarter.
  • After (JudgeRLVR): The model has learned discriminative error awareness from judging. When it switches to generating, it uses that internal compass to avoid wrong turns. Outputs are typically shorter and more decisive, with higher or similar accuracy.

Why it works (intuition behind the math):

  • The judging stage builds a discriminative prior: a sense of “what a valid solution response looks like.”
  • This prior reshapes the model’s internal token probabilities so that early generation steps down-weight error-prone branches.
  • In the generating stage, standard RLVR then reinforces these pruned, high-yield paths using final-answer rewards.
  • No explicit length penalty is needed. Fewer detours naturally means fewer tokens.

Building blocks (explained simply):

🍞 Hook: Imagine interviewing contestants before a contest to learn what a good performance looks like.

🥬 The Concept (Judge-Then-Generate Paradigm): First train the model to judge candidate solution responses; then fine-tune it to generate solutions using RLVR.

  • How it works:
    1. Stage 1 (Judge): Feed the model a problem and a short solution response; it writes a brief critique and outputs a verdict (1 = correct, 0 = incorrect). Reward it when the verdict matches the truth.
    2. Stage 2 (Generate): Initialize from those judge-trained weights. Now let the model solve problems end-to-end. Reward only if the final parsed answer is correct (standard RLVR).
    3. Transfer: The discriminative habits learned in Stage 1 guide Stage 2 to avoid bad branches.
  • Why it matters: Without judging first, the model may keep exploring too widely. Without generating second, the model might judge well but still write overly cautious, long outputs.

🍞 Anchor: A student first learns how to grade sample solutions. Later, when solving their own test, they avoid the mistakes they’ve learned to spot.

Putting it together: JudgeRLVR is not about punishing length—it’s about teaching the model to recognize quality before it speaks. That’s why it achieves a better quality–efficiency trade-off: higher accuracy on several benchmarks and significantly fewer tokens on average, especially in math. And because the idea is general—“learn to filter, then learn to produce”—it also transfers to science questions, coding problems, and beyond.

03Methodology

At a high level: Problem → Stage 1 (Judge: critique + verdict) → Stage 2 (Generate: full solution) → Final answer (rewarded if correct).

Stage 1: Judging stage (building the inner coach)

  • Input: A math problem x and a candidate solution response z (a relatively concise, cleaned-up solution write-up that ends with an answer string). The ground-truth final answer a★(x) is known.
  • Output: The model writes a short commentary c (its reasoning about z) and then a verdict v ∈ {0,1} (1 = correct, 0 = incorrect).
  • Reward: 1 if v matches the true label ℓ(x,z) = I[a(z) = a★(x)], else 0.
  • Optimization: A GRPO-family policy gradient method (DAPO) updates the same policy that writes the commentary and the verdict.

Why a separate judging dataset? Because raw, full CoT traces can be extremely long and noisy. By using solution responses (cleaned, direct write-ups), we help the model focus on correctness signals without drowning in distractions.

Data construction for judging (like building a good practice set):

  • Rollouts: For each problem, sample multiple candidate solution responses from diverse models (e.g., MiMo-7B RL and the target SFT model), yielding a mix of correct and incorrect attempts.

🍞 Hook: Think of picking practice questions that are not too easy, not too hard.

🥬 The Concept (Hard Negative Mining): Choose problems where models sometimes succeed and sometimes fail, so the mistakes are informative.

  • How it works:
    1. Measure pass rates for each problem under rollouts.
    2. Keep those with intermediate pass rates (neither 0% nor 100%).
    3. Prioritize “nearly-correct” errors, which teach the judge to be sharp.
  • Why it matters: Without hard negatives, the judge learns less—easy wins or impossible cases don’t teach fine distinctions.

🍞 Anchor: A coach trains you on the tricky moves you almost get right—because that’s where you grow fastest.

🍞 Hook: Balance is key—if you only see winners or only see mistakes, your sense of fairness breaks.

🥬 The Concept (Class Balancing): Keep an even number of correct and incorrect samples per problem.

  • How it works:
    1. Subsample to match positives (correct) and negatives (incorrect).
    2. Avoid majority-class bias (e.g., always guessing “incorrect”).
    3. Ensure the judge learns to discriminate, not default.
  • Why it matters: Without balancing, the model may be rewarded for guessing the majority class, not for true understanding.

🍞 Anchor: If a quiz had 90% true answers, you could pass by guessing “true”—but you wouldn’t learn logic.

Judging prompt and behavior:

  • Prompt gives the question and the proposed answer (solution response) and asks for concise analysis plus a boxed verdict (1/0).
  • The commentary c trains the model to articulate correctness cues (e.g., matching final answer, logical consistency), while the verdict v is what earns the reward.

Stage 2: Generating stage (putting the coach to work)

  • Initialization: Copy weights from the trained judge.
  • Input: A problem x.
  • Output: The full chain-of-thought and a solution response z that ends with a parsed final answer a(z).
  • Reward: 1 if a(z) = a★(x), else 0 (vanilla RLVR).
  • Prompt: “Let’s think step by step and output the final answer within \boxed{}.”

Why this step exists: The judge knows what looks correct, but still needs to practice solving. RLVR in this stage connects discrimination to action: it strengthens the paths that end in verified correct answers while the judge’s prior discourages wasteful branches.

Concrete example (toy):

  • Problem: Convert (0,3) to polar coordinates.
  • During judging stage: The model sees a short solution response claiming r=3 and θ=π/2 and judges it as correct (with a brief why). It also sees a wrong one (e.g., θ=3π/2) and labels it incorrect.
  • During generating stage: When solving from scratch, the model quickly picks r=3 and θ=π/2 without lengthy detours, because paths leading to θ=3π/2 have been implicitly down-weighted.

Training setup (simplified):

  • Model: Qwen3-30B-A3B (MoE), first SFTed on CoT data for basic reasoning.
  • Data: ~113k math problems with gold answers; judge data built from rollouts (16 paths/problem), hard negatives, and balanced classes.
  • Optimization: DAPO with dynamic sampling (filter trivially easy/hard problems), AdamW, small LR, high token budget but no explicit length penalty.
  • Equal total steps for Vanilla RLVR vs JudgeRLVR; JudgeRLVR splits steps between judging and generating.

The secret sauce (why this is clever):

  • It reorders learning: first learn to say “this is a solid solution,” then learn to produce those solid solutions.
  • It removes the need for hand-crafted length penalties by shifting the model’s internal preferences.
  • It yields a style shift: fewer explicit backtracks (“but,” “however,” “wait”), more direct paths.
  • It generalizes: the “avoid detours” instinct helps outside math (science reasoning, coding) too.

04Experiments & Results

The test: Does judge-then-generate beat vanilla generate-only RLVR on both quality (accuracy) and efficiency (shorter outputs)? The team measured accuracy and average generation length across math and non-math benchmarks.

Competition: Baselines were (1) Base SFT (no RL), to show how much RL changes behavior and ability; and (2) Vanilla RLVR (final-answer-only reward), the strong standard.

Scoreboard with context:

  • In-domain math (AIME24/25, HMMT_feb_2025, BeyondAIME, MATH500): JudgeRLVR improved average accuracy by about +3.7 percentage points while cutting average generation length by ~42% versus Vanilla RLVR. That’s like getting higher grades while writing half as much—clearer and smarter, not just shorter.
    • On challenging sets like HMMT_feb_2025 and BeyondAIME, gains were largest, suggesting that judging helps most when reasoning chains are longer and choices matter more.
    • On MATH500 (a saturated set), JudgeRLVR slightly reduced accuracy but drastically shortened outputs, indicating much lower compute cost for nearly the same correctness.
  • Out-of-domain (GPQA Diamond for science, LiveCodeBenchv6 for coding, MMLU-Redux for knowledge, IFEval for instruction following, ZebraLogic for logic): JudgeRLVR delivered about +4.5 points average accuracy improvement. Length often decreased (GPQA, LiveCodeBenchv6, MMLU-Redux), showing better generalization with fewer tokens. On IFEval and ZebraLogic, accuracy rose but outputs sometimes got longer—likely because strict formatting or rule-checking needs more explicit steps. Importantly, the method aims to reduce unproductive verbosity, not necessary structure.

Surprising findings:

  • Judging alone isn’t enough. In “Judge Only,” accuracy often dropped and outputs got longer on math. The model became cautious (writing out checks), which can bloat tokens. This confirms the two-stage order matters: judge first to learn the filter, then generate to turn that filter into efficient action.
  • Mixing judging and generating simultaneously (1:1) hurt stability. The “Mixed Strategy” sometimes matched or beat JudgeRLVR on a task, but overall produced longer outputs and worse generalization. Interleaving two objectives likely confused the policy, preventing a clean internal “judge-then-generate” routine from forming.

Measuring the mechanism (style and backtracking):

🍞 Hook: You can tell a person’s style by how they talk—so can we do that for models?

🥬 The Concept (Perplexity, PPL): A number that tells how surprising a sequence is to a reference model; lower PPL means the text feels more familiar to that reference.

  • How it works:
    1. Fix a reference (Base SFT) model.
    2. Feed it outputs sampled during training from different methods.
    3. Track PPL over steps; changes signal style shifts.
  • Why it matters: Without a style metric, we can’t tell if the model’s language habits really changed.

🍞 Anchor: If your diary suddenly uses lots of new words, your friend reading it might feel more surprised (higher PPL).

Result: During JudgeRLVR’s judging stage, PPL measured by the Base SFT evaluator increased versus Vanilla RLVR’s flat line. That means the judging stage truly changed the style distribution, not just outcomes.

🍞 Hook: You can spot backtracking by the words people use—like “but,” “however,” or “wait.”

🥬 The Concept (Transition-Word Statistics): Counting certain discourse markers to estimate how much the model is backtracking or contradicting itself.

  • How it works:
    1. Sample many outputs along training.
    2. Count contrast/backtracking markers (e.g., but, however, wait).
    3. Track counts and frequencies over steps.
  • Why it matters: Without this, we only see final answers, not how the model got there.

🍞 Anchor: If a student says “but” every other sentence, they might be second-guessing themselves a lot.

Result: In JudgeRLVR’s generating stage, both the absolute count and frequency of such markers dropped substantially. This supports the claim that the model moved from external trial-and-error to more internal decision-making, leading to cleaner, more confident reasoning.

Bottom line: JudgeRLVR outperforms vanilla RLVR on accuracy and efficiency in math, generalizes better to new domains, and shows linguistic signs of healthier thinking patterns—fewer detours, more direct routes.

05Discussion & Limitations

Limitations:

  • Task dependence: Some tasks (e.g., strict instruction following or symbolic logic) may require explicit checks, which can lengthen outputs even as accuracy improves.
  • Judging isn’t a magic wand: The “Judge Only” ablation shows that discrimination alone can lead to cautious verbosity. The second stage is essential to convert judgment into efficient generation.
  • Data construction matters: Good hard negatives and class balance are critical. Poorly curated judge data may teach the wrong cues.
  • Final-answer-only reward remains sparse: While the judge prior helps, the generating stage still relies on a binary correctness signal; nuanced partial-credit feedback could help further.
  • Model size and compute: JudgeRLVR was tested on a strong MoE model with large token budgets. Smaller models or tight budgets may see different trade-offs.

Required resources:

  • A base SFT model with reasonable reasoning skill.
  • Problems with verifiable answers (so correctness can be automatically checked).
  • A rollout pipeline to collect diverse candidate solution responses.
  • RL infrastructure (e.g., DAPO/GRPO-family methods), and enough compute for two-stage training.

When NOT to use:

  • Tasks without reliable automatic verifiers (no clear final answers), where labels are subjective or noisy.
  • Settings demanding ultra-short outputs at any cost—JudgeRLVR reduces unproductive verbosity, but won’t force brevity if structure is necessary.
  • Extremely small datasets with few informative negatives; the judge may not learn sharp boundaries.

Open questions:

  • Can we design richer, process-aware rewards (beyond final answers) that work harmoniously with the judge prior?
  • How far does judgment transfer? Which non-math domains benefit most, and why?
  • Can smaller models gain more (or less) from judge-first training?
  • What’s the best way to build judge datasets at scale—can we automate hard negative mining even more effectively?
  • Could multi-stage variants (judge → plan → generate) improve further, or do they overcomplicate learning?

06Conclusion & Future Work

Three-sentence summary: JudgeRLVR trains a model to judge solution responses before it learns to generate full answers. This judge-first step teaches the model to spot and prune bad paths, so later generation becomes shorter, clearer, and often more accurate—without explicit length penalties. Across math and other domains, it delivers a better quality–efficiency trade-off and shows linguistic signs of healthier reasoning.

Main achievement: Turning discriminative error awareness into a transferable prior that reshapes generation, yielding around +3.7 pp math accuracy with ~42% fewer tokens, and about +4.5 pp gains out of domain versus vanilla RLVR.

Future directions: Combine judge-first with process-level rewards, explore scaling to more domains, refine data construction (hard negatives and balancing), and test adaptations for smaller models and tighter compute budgets. Investigate extensions like judge → plan → generate, and richer signals than binary correctness.

Why remember this: It reframes efficient reasoning as a two-step dance—learn to judge, then learn to speak. By moving verification upstream, the model stops wasting words on dead ends, giving us faster, clearer, and more generalizable solutions. In simple terms: think like a judge first, and the right answer becomes the only thing left to say.

Practical Applications

  • •Build math tutors that deliver concise, correct solutions with less verbose reasoning.
  • •Speed up code assistants by pruning unpromising code paths before generating fixes.
  • •Create science Q&A helpers that avoid meandering explanations and focus on solid steps.
  • •Develop grading assistants that first learn to judge solution quality, then provide model answers aligned with that judgment.
  • •Reduce cloud costs in production LLMs by cutting average generation length while maintaining or improving accuracy.
  • •Improve instruction-following agents by letting them learn to detect format/constraint violations early.
  • •Deploy verification-aware copilots that self-filter weak hypotheses before elaborating on them.
  • •Enhance dataset curation by mining hard negatives and balancing classes to train sharper evaluators.
  • •Use style diagnostics (PPL and transition-word counts) to monitor and enforce healthy reasoning patterns over time.
#RLVR#judge-then-generate#discriminative supervision#chain-of-thought#hard negative mining#class balancing#policy gradient#DAPO#perplexity#backtracking markers#reasoning efficiency#verifiable rewards#math reasoning#generalization#mixture-of-experts
Version: 1