JudgeRLVR: Judge First, Generate Second for Efficient Reasoning
Key Summary
- â˘JudgeRLVR teaches a model to be a strict judge of answers before it learns to generate them, which trims bad ideas early.
- â˘Compared to standard RLVR, it makes solutions shorter and clearer without adding length penalties.
- â˘On in-domain math, it raises accuracy by about 3.7 percentage points while cutting generation length by roughly 42%.
- â˘On out-of-domain tasks like science, coding, and general knowledge, it improves accuracy by about 4.5 percentage points.
- â˘The judging stage builds an "error radar" that helps the model avoid unhelpful detours when it later generates.
- â˘Ablations show judging alone is not enough; the two stages in order (judge, then generate) matter.
- â˘Language analysis shows fewer backtracking words like "but" and "however," suggesting cleaner, more decisive thinking.
- â˘Perplexity shifts indicate the modelâs style really changes during the judging stage, not just its final answers.
Why This Research Matters
JudgeRLVR makes AI answers faster and clearer by teaching the model to spot good and bad solution patterns before it starts talking. This saves real money and time because shorter outputs mean fewer tokens and quicker responses. It also makes results easier to trust: less back-and-forth and more direct logic are simpler to review. The benefits extend beyond math to science questions, coding tasks, and instruction following, improving both accuracy and efficiency. By changing the modelâs habits rather than forcing it to be brief, JudgeRLVR avoids the accuracy drop that often comes with length penalties. In classrooms, apps, or coding tools, this means better help that gets to the point and gets it right.
Detailed Explanation
Tap terms for definitions01Background & Problem Definition
You know how when youâre solving a puzzle, itâs tempting to try lots of random moves until something works? Thatâs what many AI models used to do: they would write long, wandering explanations, hoping to bump into the right answer eventually.
đ Hook: You know how teachers donât just grade the final answer, but also care about whether your steps make sense?
𼏠The Concept (Reinforcement Learning with Verifiable Rewards, RLVR): RLVR is a way to train models by giving them a reward when the final answer can be automatically checked as correct.
- How it works:
- Give the model a problem with a known correct answer.
- Let it try an answer and parse the final result from its text.
- If the final result matches the truth, give a reward; if not, no reward.
- Why it matters: Without RLVR, models might only imitate examples; with RLVR, they can discover strategies by exploring. But if the only signal is the final answer, they often explore in a verbose, meandering way.
đ Anchor: Imagine a math app that only says ârightâ or âwrongâ at the end. Youâll try lots of guesses and write long notes, but you wonât learn to spot a bad path early.
Before this paper, RLVR was already the go-to way to boost reasoning in Large Language Models (LLMs). It worked well on math and code because the correctness of an answer can be checked automatically. But there was a hitch: optimizing only for final correctness pushed models toward long, back-and-forth chain-of-thought (CoT) tracesâlike talking out loud endlessly, revising themselves with lots of âbut,â âhowever,â and âletâs try again.â This style is expensive (more tokens) and often low in information density (more words, not more insight).
đ Hook: Imagine narrating every tiny thought while doing long divisionâpainful to listen to and slow to finish.
𼏠The Concept (Chain-of-Thought Dynamics): CoT dynamics describe how a model writes out its step-by-step reasoning.
- How it works:
- The model writes intermediate steps to reach a final answer.
- It may branch, backtrack, or correct itself.
- These dynamics can be concise and direct or long and looping.
- Why it matters: Without healthy CoT dynamics, the model wastes time and tokens, and sometimes confuses itself.
đ Anchor: Think of a friend who solves a maze by trying every hallway and narrating every turnâeven wrong ones. Theyâll get there, but slowly.
People tried to tame verbosity with length penalties (for example, some RL variants and training recipes). Those penalties can reduce tokens, but they often chop off useful steps and hurt accuracy. So, teams faced a tough trade-off: shorter solutions that miss important checks, or longer ones that waste time and compute.
đ Hook: You know how expert chess players donât try every moveâthey rule out bad moves in their heads before touching a piece?
𼏠The Concept (Discriminative Error Awareness): This is the modelâs skill at telling good solutions from bad ones before fully committing.
- How it works:
- Show the model candidate answers (short solution write-ups) with known correctness.
- Make it classify them as correct or incorrect, and explain briefly why.
- Reward it for accurate judgments.
- Why it matters: Without this error-awareness, the model keeps expanding bad branches and wastes tokens.
đ Anchor: A math coach who can instantly spot a shaky step saves the student from doing a whole page of useless work.
What was missing was a way to build that inner coach firstâand then let the model generate. Thatâs the gap this paper fills with JudgeRLVR: a two-stage paradigm where the model first learns to judge solution responses, and only afterward learns to generate solutions with RLVR. No explicit length penalty is used. Any conciseness is a side effect of better judgment.
đ Hook: After a spelling bee judge trains on spotting mistakes, they write their own essays with fewer errors.
𼏠The Concept (Feedback Mechanism): This is the loop where the model gets signalsââthis verdict was right,â or âthis final answer matched the keyââand updates itself.
- How it works:
- Compare predicted verdicts or answers with ground truth.
- Give reward for matches.
- Adjust the policy to do more of what earned rewards.
- Why it matters: Without clear feedback, the model canât improve its decision boundaries or generation habits.
đ Anchor: Like getting graded quizzes back each week, so you stop repeating the same error on fractions.
Why should anyone care? Because shorter, cleaner reasoning saves time and money (fewer tokens), is easier to audit, and often generalizes better. In everyday terms, that means faster math help, more reliable coding assistance, and clearer answers across many tasksâeven outside math. JudgeRLVR shows that if you teach an AI to judge first, it talks less and says more.
02Core Idea
You know how great problem-solvers first decide which ideas are worth trying and only then dive in? Thatâs the heart of this paper.
Aha! Moment in one sentence: If a model first learns to judge which solution responses are valid, it will naturally prune bad paths when it later generates, producing shorter and better reasoning without extra penalties.
Three analogies:
- Referee first, player second: A soccer referee trains to spot fouls instantly; when that person later plays, they avoid risky moves and play clean, efficient soccer.
- Sorting laundry before washing: You separate whites and colors first; the wash goes smoothly afterward. Judging sorts good/bad solution patterns; generation then runs with fewer mishaps.
- GPS with built-in roadblocks: If the GPS already knows which roads are closed, it wonât even suggest them, giving you a direct route.
Before vs After:
- Before (Vanilla RLVR): The model tries many reasoning branches, backtracks a lot, and often writes long CoTs. It gets rewarded only if the final answer is correct, so it may learn to search more but not necessarily smarter.
- After (JudgeRLVR): The model has learned discriminative error awareness from judging. When it switches to generating, it uses that internal compass to avoid wrong turns. Outputs are typically shorter and more decisive, with higher or similar accuracy.
Why it works (intuition behind the math):
- The judging stage builds a discriminative prior: a sense of âwhat a valid solution response looks like.â
- This prior reshapes the modelâs internal token probabilities so that early generation steps down-weight error-prone branches.
- In the generating stage, standard RLVR then reinforces these pruned, high-yield paths using final-answer rewards.
- No explicit length penalty is needed. Fewer detours naturally means fewer tokens.
Building blocks (explained simply):
đ Hook: Imagine interviewing contestants before a contest to learn what a good performance looks like.
𼏠The Concept (Judge-Then-Generate Paradigm): First train the model to judge candidate solution responses; then fine-tune it to generate solutions using RLVR.
- How it works:
- Stage 1 (Judge): Feed the model a problem and a short solution response; it writes a brief critique and outputs a verdict (1 = correct, 0 = incorrect). Reward it when the verdict matches the truth.
- Stage 2 (Generate): Initialize from those judge-trained weights. Now let the model solve problems end-to-end. Reward only if the final parsed answer is correct (standard RLVR).
- Transfer: The discriminative habits learned in Stage 1 guide Stage 2 to avoid bad branches.
- Why it matters: Without judging first, the model may keep exploring too widely. Without generating second, the model might judge well but still write overly cautious, long outputs.
đ Anchor: A student first learns how to grade sample solutions. Later, when solving their own test, they avoid the mistakes theyâve learned to spot.
Putting it together: JudgeRLVR is not about punishing lengthâitâs about teaching the model to recognize quality before it speaks. Thatâs why it achieves a better qualityâefficiency trade-off: higher accuracy on several benchmarks and significantly fewer tokens on average, especially in math. And because the idea is generalââlearn to filter, then learn to produceââit also transfers to science questions, coding problems, and beyond.
03Methodology
At a high level: Problem â Stage 1 (Judge: critique + verdict) â Stage 2 (Generate: full solution) â Final answer (rewarded if correct).
Stage 1: Judging stage (building the inner coach)
- Input: A math problem x and a candidate solution response z (a relatively concise, cleaned-up solution write-up that ends with an answer string). The ground-truth final answer aâ (x) is known.
- Output: The model writes a short commentary c (its reasoning about z) and then a verdict v â {0,1} (1 = correct, 0 = incorrect).
- Reward: 1 if v matches the true label â(x,z) = I[a(z) = aâ (x)], else 0.
- Optimization: A GRPO-family policy gradient method (DAPO) updates the same policy that writes the commentary and the verdict.
Why a separate judging dataset? Because raw, full CoT traces can be extremely long and noisy. By using solution responses (cleaned, direct write-ups), we help the model focus on correctness signals without drowning in distractions.
Data construction for judging (like building a good practice set):
- Rollouts: For each problem, sample multiple candidate solution responses from diverse models (e.g., MiMo-7B RL and the target SFT model), yielding a mix of correct and incorrect attempts.
đ Hook: Think of picking practice questions that are not too easy, not too hard.
𼏠The Concept (Hard Negative Mining): Choose problems where models sometimes succeed and sometimes fail, so the mistakes are informative.
- How it works:
- Measure pass rates for each problem under rollouts.
- Keep those with intermediate pass rates (neither 0% nor 100%).
- Prioritize ânearly-correctâ errors, which teach the judge to be sharp.
- Why it matters: Without hard negatives, the judge learns lessâeasy wins or impossible cases donât teach fine distinctions.
đ Anchor: A coach trains you on the tricky moves you almost get rightâbecause thatâs where you grow fastest.
đ Hook: Balance is keyâif you only see winners or only see mistakes, your sense of fairness breaks.
𼏠The Concept (Class Balancing): Keep an even number of correct and incorrect samples per problem.
- How it works:
- Subsample to match positives (correct) and negatives (incorrect).
- Avoid majority-class bias (e.g., always guessing âincorrectâ).
- Ensure the judge learns to discriminate, not default.
- Why it matters: Without balancing, the model may be rewarded for guessing the majority class, not for true understanding.
đ Anchor: If a quiz had 90% true answers, you could pass by guessing âtrueââbut you wouldnât learn logic.
Judging prompt and behavior:
- Prompt gives the question and the proposed answer (solution response) and asks for concise analysis plus a boxed verdict (1/0).
- The commentary c trains the model to articulate correctness cues (e.g., matching final answer, logical consistency), while the verdict v is what earns the reward.
Stage 2: Generating stage (putting the coach to work)
- Initialization: Copy weights from the trained judge.
- Input: A problem x.
- Output: The full chain-of-thought and a solution response z that ends with a parsed final answer a(z).
- Reward: 1 if a(z) = aâ (x), else 0 (vanilla RLVR).
- Prompt: âLetâs think step by step and output the final answer within \boxed{}.â
Why this step exists: The judge knows what looks correct, but still needs to practice solving. RLVR in this stage connects discrimination to action: it strengthens the paths that end in verified correct answers while the judgeâs prior discourages wasteful branches.
Concrete example (toy):
- Problem: Convert (0,3) to polar coordinates.
- During judging stage: The model sees a short solution response claiming r=3 and θ=Ď/2 and judges it as correct (with a brief why). It also sees a wrong one (e.g., θ=3Ď/2) and labels it incorrect.
- During generating stage: When solving from scratch, the model quickly picks r=3 and θ=Ď/2 without lengthy detours, because paths leading to θ=3Ď/2 have been implicitly down-weighted.
Training setup (simplified):
- Model: Qwen3-30B-A3B (MoE), first SFTed on CoT data for basic reasoning.
- Data: ~113k math problems with gold answers; judge data built from rollouts (16 paths/problem), hard negatives, and balanced classes.
- Optimization: DAPO with dynamic sampling (filter trivially easy/hard problems), AdamW, small LR, high token budget but no explicit length penalty.
- Equal total steps for Vanilla RLVR vs JudgeRLVR; JudgeRLVR splits steps between judging and generating.
The secret sauce (why this is clever):
- It reorders learning: first learn to say âthis is a solid solution,â then learn to produce those solid solutions.
- It removes the need for hand-crafted length penalties by shifting the modelâs internal preferences.
- It yields a style shift: fewer explicit backtracks (âbut,â âhowever,â âwaitâ), more direct paths.
- It generalizes: the âavoid detoursâ instinct helps outside math (science reasoning, coding) too.
04Experiments & Results
The test: Does judge-then-generate beat vanilla generate-only RLVR on both quality (accuracy) and efficiency (shorter outputs)? The team measured accuracy and average generation length across math and non-math benchmarks.
Competition: Baselines were (1) Base SFT (no RL), to show how much RL changes behavior and ability; and (2) Vanilla RLVR (final-answer-only reward), the strong standard.
Scoreboard with context:
- In-domain math (AIME24/25, HMMT_feb_2025, BeyondAIME, MATH500): JudgeRLVR improved average accuracy by about +3.7 percentage points while cutting average generation length by ~42% versus Vanilla RLVR. Thatâs like getting higher grades while writing half as muchâclearer and smarter, not just shorter.
- On challenging sets like HMMT_feb_2025 and BeyondAIME, gains were largest, suggesting that judging helps most when reasoning chains are longer and choices matter more.
- On MATH500 (a saturated set), JudgeRLVR slightly reduced accuracy but drastically shortened outputs, indicating much lower compute cost for nearly the same correctness.
- Out-of-domain (GPQA Diamond for science, LiveCodeBenchv6 for coding, MMLU-Redux for knowledge, IFEval for instruction following, ZebraLogic for logic): JudgeRLVR delivered about +4.5 points average accuracy improvement. Length often decreased (GPQA, LiveCodeBenchv6, MMLU-Redux), showing better generalization with fewer tokens. On IFEval and ZebraLogic, accuracy rose but outputs sometimes got longerâlikely because strict formatting or rule-checking needs more explicit steps. Importantly, the method aims to reduce unproductive verbosity, not necessary structure.
Surprising findings:
- Judging alone isnât enough. In âJudge Only,â accuracy often dropped and outputs got longer on math. The model became cautious (writing out checks), which can bloat tokens. This confirms the two-stage order matters: judge first to learn the filter, then generate to turn that filter into efficient action.
- Mixing judging and generating simultaneously (1:1) hurt stability. The âMixed Strategyâ sometimes matched or beat JudgeRLVR on a task, but overall produced longer outputs and worse generalization. Interleaving two objectives likely confused the policy, preventing a clean internal âjudge-then-generateâ routine from forming.
Measuring the mechanism (style and backtracking):
đ Hook: You can tell a personâs style by how they talkâso can we do that for models?
𼏠The Concept (Perplexity, PPL): A number that tells how surprising a sequence is to a reference model; lower PPL means the text feels more familiar to that reference.
- How it works:
- Fix a reference (Base SFT) model.
- Feed it outputs sampled during training from different methods.
- Track PPL over steps; changes signal style shifts.
- Why it matters: Without a style metric, we canât tell if the modelâs language habits really changed.
đ Anchor: If your diary suddenly uses lots of new words, your friend reading it might feel more surprised (higher PPL).
Result: During JudgeRLVRâs judging stage, PPL measured by the Base SFT evaluator increased versus Vanilla RLVRâs flat line. That means the judging stage truly changed the style distribution, not just outcomes.
đ Hook: You can spot backtracking by the words people useâlike âbut,â âhowever,â or âwait.â
𼏠The Concept (Transition-Word Statistics): Counting certain discourse markers to estimate how much the model is backtracking or contradicting itself.
- How it works:
- Sample many outputs along training.
- Count contrast/backtracking markers (e.g., but, however, wait).
- Track counts and frequencies over steps.
- Why it matters: Without this, we only see final answers, not how the model got there.
đ Anchor: If a student says âbutâ every other sentence, they might be second-guessing themselves a lot.
Result: In JudgeRLVRâs generating stage, both the absolute count and frequency of such markers dropped substantially. This supports the claim that the model moved from external trial-and-error to more internal decision-making, leading to cleaner, more confident reasoning.
Bottom line: JudgeRLVR outperforms vanilla RLVR on accuracy and efficiency in math, generalizes better to new domains, and shows linguistic signs of healthier thinking patternsâfewer detours, more direct routes.
05Discussion & Limitations
Limitations:
- Task dependence: Some tasks (e.g., strict instruction following or symbolic logic) may require explicit checks, which can lengthen outputs even as accuracy improves.
- Judging isnât a magic wand: The âJudge Onlyâ ablation shows that discrimination alone can lead to cautious verbosity. The second stage is essential to convert judgment into efficient generation.
- Data construction matters: Good hard negatives and class balance are critical. Poorly curated judge data may teach the wrong cues.
- Final-answer-only reward remains sparse: While the judge prior helps, the generating stage still relies on a binary correctness signal; nuanced partial-credit feedback could help further.
- Model size and compute: JudgeRLVR was tested on a strong MoE model with large token budgets. Smaller models or tight budgets may see different trade-offs.
Required resources:
- A base SFT model with reasonable reasoning skill.
- Problems with verifiable answers (so correctness can be automatically checked).
- A rollout pipeline to collect diverse candidate solution responses.
- RL infrastructure (e.g., DAPO/GRPO-family methods), and enough compute for two-stage training.
When NOT to use:
- Tasks without reliable automatic verifiers (no clear final answers), where labels are subjective or noisy.
- Settings demanding ultra-short outputs at any costâJudgeRLVR reduces unproductive verbosity, but wonât force brevity if structure is necessary.
- Extremely small datasets with few informative negatives; the judge may not learn sharp boundaries.
Open questions:
- Can we design richer, process-aware rewards (beyond final answers) that work harmoniously with the judge prior?
- How far does judgment transfer? Which non-math domains benefit most, and why?
- Can smaller models gain more (or less) from judge-first training?
- Whatâs the best way to build judge datasets at scaleâcan we automate hard negative mining even more effectively?
- Could multi-stage variants (judge â plan â generate) improve further, or do they overcomplicate learning?
06Conclusion & Future Work
Three-sentence summary: JudgeRLVR trains a model to judge solution responses before it learns to generate full answers. This judge-first step teaches the model to spot and prune bad paths, so later generation becomes shorter, clearer, and often more accurateâwithout explicit length penalties. Across math and other domains, it delivers a better qualityâefficiency trade-off and shows linguistic signs of healthier reasoning.
Main achievement: Turning discriminative error awareness into a transferable prior that reshapes generation, yielding around +3.7 pp math accuracy with ~42% fewer tokens, and about +4.5 pp gains out of domain versus vanilla RLVR.
Future directions: Combine judge-first with process-level rewards, explore scaling to more domains, refine data construction (hard negatives and balancing), and test adaptations for smaller models and tighter compute budgets. Investigate extensions like judge â plan â generate, and richer signals than binary correctness.
Why remember this: It reframes efficient reasoning as a two-step danceâlearn to judge, then learn to speak. By moving verification upstream, the model stops wasting words on dead ends, giving us faster, clearer, and more generalizable solutions. In simple terms: think like a judge first, and the right answer becomes the only thing left to say.
Practical Applications
- â˘Build math tutors that deliver concise, correct solutions with less verbose reasoning.
- â˘Speed up code assistants by pruning unpromising code paths before generating fixes.
- â˘Create science Q&A helpers that avoid meandering explanations and focus on solid steps.
- â˘Develop grading assistants that first learn to judge solution quality, then provide model answers aligned with that judgment.
- â˘Reduce cloud costs in production LLMs by cutting average generation length while maintaining or improving accuracy.
- â˘Improve instruction-following agents by letting them learn to detect format/constraint violations early.
- â˘Deploy verification-aware copilots that self-filter weak hypotheses before elaborating on them.
- â˘Enhance dataset curation by mining hard negatives and balancing classes to train sharper evaluators.
- â˘Use style diagnostics (PPL and transition-word counts) to monitor and enforce healthy reasoning patterns over time.