Likelihood-Based Reward Designs for General LLM Reasoning
Key Summary
- ā¢Binary right/wrong rewards for training reasoning in large language models are hard to design and often too sparse to learn from.
- ā¢This paper tests a simple alternative: reward the model by how likely (in log-space) it thinks the reference answer is after its reasoning steps.
- ā¢Log-probability rewards work well both when answers are checkable (like math) and when they arenāt (like long explanations), while plain probability rewards often fail on long answers.
- ā¢On math benchmarks, log-prob rewards match or beat standard RL on accuracy and give much better perplexity (theyāre less confidently wrong).
- ā¢On long-form tasks without verifiers, log-prob rewards perform about the same as supervised fine-tuning, while plain probability rewards flatline due to tiny exact-match probabilities.
- ā¢A curious effect appears: chain-of-thought initially gets shorter under log-prob rewards; it later recovers in math but stays short in long-form tasks.
- ā¢Tricks to keep chains long (KL penalty or length rewards) prevent collapse but hurt performance, suggesting a trade-off.
- ā¢JEPO, VeriFree, and average-prob variants are competitive in some verified settings, but only log-prob is reliable across all settings.
- ā¢Training with log-prob rewards is compute-friendly (no need to sample final answers to score) and aligns with pretrainingās next-token log-likelihood.
- ā¢Bottom line: log-likelihood rewards are a simple, scalable, verifier-free recipe for CoT fine-tuning across many tasks.
Why This Research Matters
Many everyday tasksāexplaining ideas, writing helpful answers, or tutoringādonāt have a single ācorrectā sentence to match. A training method that only works when we can check answers exactly misses these common, useful skills. By using log-probability rewards, we can train models to reason better even on open-ended tasks without special verifiers. This approach also makes models less confidently wrong, which is safer and more trustworthy. Because it aligns with how models were pretrained, it is simpler and more stable to use in practice. In short, it allows building better helpers for real-world questions, not just test-style problems.
Detailed Explanation
Tap terms for definitions01Background & Problem Definition
š Hook: You know how itās easier to learn when your teacher gives you helpful hints instead of only saying ārightā or āwrongā?
š„¬ The Concept: Before this paper, many models learned to reason with a āright/wrongā score. That means the model writes a step-by-step chain-of-thought (CoT), then gets a 1 if the final answer is correct and 0 if not. How it works: (1) The model tries a reasoning path; (2) it gives a final answer; (3) a checker (verifier) says 1 or 0; (4) the model learns from that one bit. Why it matters: If the model rarely gets exactly right answers early on, the reward is almost always 0, so learning is slow and fragile.
š Anchor: Imagine practicing math and only hearing a buzzer for āwrongā without seeing which parts were closeāyouād improve very slowly.
š Hook: Imagine learning to write essays. Thereās no single ācorrectā sentence to match, so who decides if itās good?
š„¬ The Concept: Some tasks are verifiable (like 2+2=4) and others are non-verifiable (like a detailed explanation). How it works: (1) Verifiable tasks can be checked automatically; (2) non-verifiable tasks canāt be graded as simply right/wrong; (3) this makes binary rewards hard to use outside math/code. Why it matters: A training method that only works when you can verify exact correctness wonāt help for long answers, stories, or proofs.
š Anchor: Checking a multiple-choice quiz is easy; judging a history essay isnāt so clear.
š Hook: You know how teachers sometimes grade by āhow closeā you are, not just right/wrong?
š„¬ The Concept: Likelihood-based rewards give points based on how likely the model thinks the reference answer is after its reasoning. How it works: (1) For each question, thereās a reference answer (the continuation in data); (2) the model writes its CoT; (3) we measure the probability (or log-probability) of the reference answer given that CoT; (4) higher likelihood = higher reward. Why it matters: This gives a smooth, dense signal even when you canāt do exact right/wrong checks.
š Anchor: Itās like partial credit: the closer your final sentence is to the teacherās version, the more points you get.
š Hook: Think of whispering a long sentence and asking someone to repeat it perfectly; the chance of an exact match shrinks with length.
š„¬ The Concept: Plain probability (without logs) collapses on long answers because exact-match probabilities get tiny. How it works: (1) Multiply tiny token probabilities together ā super tiny overall probability; (2) reward becomes near zero; (3) learning flatlines. Why it matters: Non-verifiable, long-form answers need a reward that doesnāt vanish with length.
š Anchor: If the game is āsay my 200-word paragraph exactly,ā no one scores pointsāso no one learns.
š Hook: You know how in class you learn word-by-word, sentence-by-sentence?
š„¬ The Concept: Log-probability aligns with pretraining (next-token log-likelihood). How it works: (1) Pretraining teaches the model to assign high log-likelihood to the right next token; (2) using log-prob of the whole answer keeps the training signal consistent; (3) it avoids vanishing by adding logs (tiny numbers become manageable sums). Why it matters: Consistency with pretraining means stable learning and better perplexity.
š Anchor: Itās like practicing scales on a piano and then using the same skills in a song.
The World Before: LLMs improved at reasoning using CoT and RL with 0/1 rewards, mainly in math/code where answers are easy to check. The Problem: Binary rewards are sparse, sensitive to verifier design, and donāt extend to long-form tasks. Failed Attempts: Intrinsic/self-judging rewards often underperform verified correctness; plain probability rewards (like VeriFree) struggle on long outputs. The Gap: We needed a single, scalable reward that works whether or not we can verify. Real Stakes: With a unified reward, we can train reasoning on everyday tasksāexplaining, summarizing, and teachingāwithout special checkers or fragile signals.
02Core Idea
š Hook: Imagine coaching a team using a scoreboard that shows not only win/lose but also how close each play was to scoring.
š„¬ The Concept: The key insight is simple: reward the model by the log-probability of the reference answer after its chain-of-thought. How it works: (1) The model writes its reasoning (CoT); (2) we compute log p(reference answer | question, CoT); (3) use that number as the RL reward; (4) optimize with standard policy-gradient tricks (like leave-one-out baselines). Why it matters: This one signal works for short, checkable answers and for long, uncheckable onesāand it doesnāt vanish with length.
š Anchor: Itās like awarding points for how āon trackā the final play was, not just whether the team won the whole game.
Multiple Analogies:
- Thermometer analogy: Binary reward is āfever or no fever.ā Log-prob is an actual temperatureācontinuous and informative.
- Map reading: Exact-match probability is like demanding you step on every identical tile; log-prob is like rewarding you for getting close to the destination, even with small detours.
- Graded rubrics: Instead of all-or-nothing grades, use a rubric that gives partial credit based on closeness; logs make tiny credits add up sensibly.
Before vs After:
- Before: CoT RL needed verifiers; performance good in math/code but didnāt transfer to long-form. Probability rewards often flatlined on long answers.
- After: Log-prob rewards match or beat RL on verified tasks and match SFT on long-form; perplexity improves; the same objective covers both worlds.
Why It Works (intuition, not equations):
- Logs turn tiny products into manageable sums, preventing rewards from collapsing on long answers.
- The reward matches pretrainingās next-token log-likelihood, so optimization is stable and familiar to the model.
- Log-prob discourages overconfidence in wrong answers, improving perplexity (less āconfidently wrongā).
Building Blocks (mini-concepts, each as a sandwich):
š Hook: You know how you list steps to solve a puzzle? š„¬ The Concept: Chain-of-Thought (CoT) is the modelās written steps. How it works: It prints thoughts, then an answer. Why it matters: CoT helps with hard reasoning, and we can score how helpful the steps are by checking the final answerās likelihood. š Anchor: Like showing your long division steps before the final number.
š Hook: Imagine giving points based on how likely a guess is to be right. š„¬ The Concept: Likelihood-based reward scores how probable the reference answer is given the CoT. How it works: Compute probability (or log-probability) after the CoT; use as reward. Why it matters: Provides a dense training signal without a dedicated verifier. š Anchor: Near-miss answers still earn points and guide learning.
š Hook: Long sentences make exact repeats hard. š„¬ The Concept: Log-probability avoids vanishing rewards for long outputs. How it works: Sum token log-probs instead of multiplying tiny probs. Why it matters: The signal stays strong enough to learn. š Anchor: Counting steps instead of multiplying small chances.
š Hook: Sports teams compare plays within a game. š„¬ The Concept: Leave-One-Out (RLOO) reduces noise. How it works: Compare each CoTās reward to the others for the same question. Why it matters: Stable training with less randomness. š Anchor: Scoring a play relative to teammatesā attempts on the same drill.
š Hook: Two ways to read a score: per game or per play. š„¬ The Concept: Per-answer vs per-token averaging. How it works: Average log-probs per whole answer or per token. Why it matters: Controls how much long answers weigh during training/evaluation. š Anchor: Averaging the grade for the whole essay vs each sentence.
03Methodology
At a high level: Question ā Model writes a Chain-of-Thought (CoT) ā Compute log-probability of the reference answer given that CoT ā Use that as the reward ā Update the model with RL (RLOO) ā Repeat.
Step-by-step (like a recipe):
-
Collect data with reference continuations.
- What: Use datasets with prompts and answers (math, code, long-form). For long-form, the āanswerā is the reference continuation in the data.
- Why: We need a target continuation to score likelihood; no verifier required.
- Example: From MATH, a problem and its correct short answer; from Alpaca, a prompt and its long response.
-
Format prompts to elicit CoT.
- What: Use an instruction template that asks the model to think inside <think>ā¦</think> and then give <answer>ā¦</answer>.
- Why: Separates reasoning from the final answer; makes it easy to truncate for scoring.
- Example: āSolve the problem. Think first in <think>ā¦</think>. Then give just the final answer in <answer>ā¦</answer>.ā
-
Generate groups of CoTs per question (group size G).
- What: For each prompt, sample G independent CoTs (e.g., G=32 for verifiable, G=4 for non-verifiable).
- Why: Comparing CoTs for the same question reduces noise (RLOO/GRPO-style advantages).
- Example: For a math problem, produce 32 thought traces that may differ in steps and length.
-
Compute the log-probability reward.
- What: Reward = log p(reference answer | prompt, CoT). Optionally average per token (AvgLogProb) to normalize by length.
- Why: Dense, stable signal aligned with pretraining; avoids vanishing on long answers.
- Example: If the answer is ā42,ā compute the modelās log-likelihood of ā42ā after each CoT and use that as the reward.
-
Estimate advantages with RLOO (leave-one-out).
- What: For each CoT in the group, subtract the average reward of the other CoTs for the same question.
- Why: Stabilizes gradients by judging each attempt relative to its peers.
- Example: If one CoT earns higher log-prob than the others, it gets a positive advantage.
-
Update the model.
- What: Apply a policy-gradient step combining: (a) the RL term that nudges toward higher-reward CoTs, and (b) a direct supervised-like term on the answer (from the log-prob gradient).
- Why: The RL term learns better CoTs; the supervised-like term reinforces the answer tokens in context.
- Example: The model slightly increases chance of producing the higher-reward CoT patterns next time.
-
Decode and evaluate.
- What: Measure success rates (greedy and sampled), log-prob metrics (per-answer/per-token), perplexity, and CoT length.
- Why: Success shows correctness; log-prob/perplexity show confidence calibration; CoT length shows reasoning behavior.
- Example: On MATH, track greedy accuracy and perplexity over training steps.
-
Monte Carlo for log-of-expectation.
- What: True log p(answer | prompt with CoT marginalization) is log E_z p(answer|z). Approximate with MC by averaging over sampled CoTs, then take log (MC32 better than MC1).
- Why: The log of an average cannot be computed exactly without summing over all CoTs; MC keeps it practical.
- Example: For each question, average probabilities over 32 CoTs and then take log.
What breaks without each step:
- Without CoT formatting, the modelās thoughts and answers get mixed; scoring becomes noisy.
- Without likelihood reward, training in non-verifiable settings canāt proceed reliably.
- Without RLOO, gradients are high-variance; learning is unstable.
- Without MC, log-of-expectation is biased by single-sample underestimates.
Concrete data example:
- Prompt: āFind the value of x: 3x + 5 = 17.ā
- CoT A: āSubtract 5 both sides: 3x=12. Divide by 3: x=4.ā ā High log p(ā4ā).
- CoT B: āTry 3: 3*3+5=14; not 17. Try 4: 12+5=17.ā ā Also high but maybe slightly lower. RLOO favors the higher one.
Secret sauce (whatās clever):
- Using log-prob aligns post-training with pretraining, improving perplexity and stability.
- It avoids vanishing rewards on long outputs, unlike plain probability.
- It does not need sampling of final answers to score (compute-friendly), only a single forward pass on the reference answer given the CoT.
- Same recipe works for both verified and non-verified domains, simplifying pipelines.
Mini sandwiches for related methods:
š Hook: Two ways to count chances: raw chance vs log-chance. š„¬ The Concept: VeriFree uses raw probability p(answer|CoT); JEPO uses a grouped log-mean-exp across CoTs. How it works: VeriFree scores expected exact-match; JEPO tightens the log-of-expectation estimate with multiple samples. Why it matters: VeriFree can flatline on long answers; JEPO helps estimation but adds compute. š Anchor: Counting tiny raindrops (probability) vs using a rain gauge that sums steadily (log-prob).
04Experiments & Results
The Test: The authors measured how well different rewards train reasoning across two verifiable math sets (MATH, DeepScaleR) and two long-form, non-verifiable sets (Alpaca, NuminaProof). They reported:
- Success rate (greedy and sampled)
- Log-prob metrics (per-answer/per-token), plus perplexity
- CoT length over training
The Competition: Baselines included Supervised Fine-Tuning (SFT) and Base RL with 0/1 rewards. Variants tested: Log-prob, AvgLogProb, Probability (VeriFree), AvgProb (RLPR-like), and JEPO.
Scoreboard with context:
- Verifiable (MATH, Qwen-3B): Greedy success ~56.84% with log-prob vs 55.85% base RLālike edging from a solid B+ to an A-. With temperature-1 sampling, all methods dip, and log-prob no longer leads; greedy decoding shows the true gains.
- Perplexity shines: On MATH (Llama-3B), log-prob achieves ~2.21 perplexity vs base RL ~13.87ālike moving from guessing wildly to being calmly confident. Probability (without logs) sits in-between but much worse than log-prob.
- DeepScaleR shows similar patterns: log-prob-family gets strong greedy success and far better perplexity than base RL and probability-only rewards.
- Non-verifiable (Alpaca, NuminaProof): Log-prob, AvgLogProb, and JEPO match SFT in log-prob/perplexity. Plain probability rewards often flatline due to vanishing exact-match probabilities on long answers.
Surprising findings:
- CoT shortening: Under log-prob rewards, CoTs get much shorter early. In math, they later recover; in long-form, they collapse to very short traces (ā5ā10 tokens) and stay there, making behavior similar to SFT.
- Stabilizing CoT (via KL penalties or explicit length rewards) prevents collapse but hurts performance, revealing a trade-off between ālong visible thinkingā and measured quality.
- Warm-starting (teaching the model to answer in the presence of CoTs before RL) stabilizes CoT somewhat but still doesnāt beat SFT on long-form within the given compute.
- JEPO vs simple log-prob: With greedy decoding, JEPO wasnāt clearly better than basic log-prob despite extra compute; the simpler method often sufficed.
Key numbers (illustrative):
- MATH (Qwen-3B, greedy): Base RL 55.85% vs Log-prob 56.84%.
- MATH (Llama-3B, perplexity): Base RL ~13.87 vs Log-prob ~2.21; SFT ~2.63.
- Non-verifiable (Alpaca, Llama-3B): Log-prob perplexity ~2.56, matching SFT ~2.56; Probability reward struggles.
Takeaways:
- For accuracy on verified tasks, several methods are comparable under greedy decoding; for confidence calibration (perplexity), only log-prob-family shines.
- For long-form, log-prob matches SFT; plain probabilities fail. One reward to rule both worlds: log-prob.
05Discussion & Limitations
Limitations:
- CoT collapse: Log-prob rewards often shorten CoT dramatically, especially in long-form, and keeping CoT long tends to hurt measured performance.
- Compute sensitivity: JEPO can surpass in theory with more compute (as seen in other work), but here added complexity didnāt pay off within budget.
- Metric trade-offs: Optimizing success rate alone can hurt perplexity; optimizing log-prob helps perplexity but may reduce visible CoT.
- Domain scope: Results shown on 3B-scale Llama/Qwen; effects at very large scales may differ.
Required resources:
- Instruction-tuned base model, datasets with reference continuations, RL infrastructure with group sampling (G up to 32 for verified tasks), and standard optimizers.
- Optional verifiers for checked tasks; not required for log-prob training itself.
When NOT to use:
- If you must preserve long, human-readable CoTs on long-form tasks at all costs (log-prob may collapse CoT length without extra penalties, which then hurt performance).
- If your setup depends on temperature-1 sampling accuracy as the main metric; greedy decoding reveals clearer gains for log-prob.
- If you need maximal gains in unverifiable domains but have very limited compute; log-prob will likely match, not exceed, SFT under modest budgets.
Open questions:
- Can we design rewards that keep CoTs informative and long without hurting performance?
- Is there a compute regime where JEPO (or similar grouped objectives) reliably surpasses log-prob on long-form?
- Do larger models or mixed short/long curricula change the CoT collapse pattern?
- Can we measure or leverage āhidden CoTā inside the network to bridge visible-vs-internal reasoning?
- Is there a smooth path of tasks (short ā medium ā long answers) that preserves gains across lengths?
06Conclusion & Future Work
Three-sentence summary: This paper shows that rewarding models by the log-probability of the reference answer after their chain-of-thought is a simple, universal signal that works for both verifiable (math/code) and non-verifiable (long-form) tasks. It matches or beats standard RL on accuracy in math while strongly improving perplexity, and it matches SFT on long-form where plain probability rewards fail. Thus, one lightweight, verifier-free objective can train reasoning across many settings.
Main achievement: Establishing log-likelihood rewards as a practical, compute-friendly, and broadly effective CoT fine-tuning method that aligns with pretraining and avoids vanishing rewards on long outputs.
Future directions: Stabilize CoT length without hurting quality; explore larger models and longer training for potential gains beyond SFT on long-form; investigate curricula that blend short and long answers; study hidden CoT mechanisms and better group objectives (e.g., JEPO variants) under higher compute.
Why remember this: When in doubt, use log-prob. Itās the rare training signal that is simple, scalable, verifier-free, and strong across both short, checkable answers and long, open-ended explanations.
Practical Applications
- ā¢Fine-tune a math tutor model using log-prob rewards to improve accuracy and reduce overconfident mistakes.
- ā¢Train a helpdesk assistant on long-form solutions (FAQ continuations) without needing a special verifier.
- ā¢Improve code reasoning by rewarding high log-likelihood of reference solutions while keeping perplexity low.
- ā¢Build educational models that explain concepts step-by-step but are evaluated by how well their final answers match reference explanations.
- ā¢Calibrate models for medical or legal summaries by using log-prob rewards to avoid overconfident wrong statements.
- ā¢Use greedy decoding during evaluation to see clearer gains from log-prob training on verifiable tasks.
- ā¢Adopt per-token log-prob metrics to monitor stability across datasets with varied answer lengths.
- ā¢Warm-start training by exposing the model to CoTs (masked for gradients) before RL to reduce early CoT collapse.
- ā¢Deploy a single training pipeline across verified (math/code) and non-verified (long-form) data using the same reward.
- ā¢Reduce compute by skipping answer sampling for scoring; compute log p(answer|CoT) in a single forward pass.