Reinforcement Learning via Self-Distillation

Jonas Hübotter; Frederike Lübeck; Lejs Behric; Anton Baumann; Marco Bagatella; Daniel Marta; Ido Hakimi; Idan Shenfeld; Thomas Kleine Buening; Carlos Guestrin; Andreas Krause

Reinforcement Learning via Self-Distillation

Intermediate

Jonas Hübotter, Frederike Lübeck, Lejs Behric et al.1/28/2026

arXiv PDF

Key Summary

•The paper teaches large language models to learn from detailed feedback (like error messages) instead of only a simple pass/fail score.
•It introduces SDPO, a way for a model to become its own teacher by re-reading feedback and then training itself to prefer the improved next-word choices it would make after seeing that feedback.
•This turns feedback into dense, token-by-token guidance, fixing the classic credit-assignment problem where a single score hides which parts were right or wrong.
•On coding benchmark LiveCodeBench v6, SDPO beats a strong RL method (GRPO), reaching higher accuracy and getting there about four times faster in terms of generations.
•Even when only pass/fail rewards are available, SDPO cleverly uses successful attempts as implicit feedback for failed ones and still outperforms GRPO.
•SDPO tends to produce shorter, cleaner reasoning than GRPO, avoiding repetitive filler like 'Hmm' and 'Wait' while being more accurate.
•At test time on very hard tasks, SDPO can learn on a single question from its own mistakes and discover solutions with about three times fewer tries than best-of-k or multi-turn prompting.
•SDPO works best with stronger base models that can already learn well from context; with very small models, mixing GRPO and SDPO can help.
•The method adds only modest compute overhead by re-scoring the same attempt under a feedback-conditioned prompt and uses memory-saving top-K tricks.
•Overall, SDPO shows that giving models rich, text feedback and letting them teach themselves makes reinforcement learning more informative, faster, and more stable.

Why This Research Matters

Many real tasks provide helpful text feedback—like error logs, judge comments, or tool outputs—so learning to use this information directly unlocks faster progress. SDPO shows you don’t need a bigger external teacher to get dense, helpful guidance; the model can teach itself by rereading its own work with feedback. This reduces computation (fewer generations), saves money (no expensive teacher model), and speeds time to useful accuracy. By delivering shorter, clearer reasoning, SDPO also makes AI outputs easier to trust and use. In tough, high-stakes settings (debugging code, scientific problem solving), SDPO helps find working answers with far fewer tries. Over time, this approach could power more capable, efficient AI assistants that improve themselves from the everyday feedback they already see.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine you get a math test back with only a big red 0 or 1 at the top. That tells you if you passed or failed, but not which steps you messed up. Now imagine the teacher also writes notes like 'You divided by zero here' or 'Remember to include units.' Those notes make it way easier to fix your mistakes next time.

🥬 The Concept (Reinforcement Learning): Reinforcement learning (RL) is when a model tries something, gets feedback, and updates how it behaves next time. How it works:

The model answers a question. 2) The environment (like a code runner or a quiz) returns some feedback. 3) The model changes its policy (its strategy for picking the next word) using that feedback. Why it matters: If the feedback is just a single number (pass/fail), it's hard to know which exact steps were good or bad.

🍞 Anchor: Think of a spelling bee: if the only feedback is 'correct' or 'wrong,' you don't know which letter you missed. But if someone says 'You swapped the i and e,' you know exactly what to fix.

🍞 Hook: You know how you can get better at puzzles just by seeing a hint like 'Look at row 3' instead of a full solution? That hint narrows down where to look.

🥬 The Concept (Credit Assignment): Credit assignment is figuring out which parts of your answer helped or hurt the final result. How it works: 1) Compare your steps to the outcome. 2) Reward the steps that helped. 3) Penalize the steps that hurt. Why it matters: Without good credit assignment, learning is like guessing which move in a long chess game made you lose.

🍞 Anchor: After baking cookies that taste too salty, you blame the extra teaspoon of salt, not the oven temperature. That’s credit assignment.

🍞 Hook: Picture reading your own essay with your teacher’s margin notes. Suddenly, your own writing makes more sense—what to change and where.

🥬 The Concept (In-Context Learning): In-context learning is when a model improves its next-word choices just by reading extra helpful text in the prompt (like examples or feedback), without changing its weights. How it works: 1) Add feedback to the prompt. 2) The same model now predicts different, usually better, next words. 3) This effect vanishes if you remove the feedback from the prompt. Why it matters: If a model can spot its own mistakes when shown feedback, we can use this ability as a guide for training.

🍞 Anchor: If you solve a riddle better after rereading the hint, that’s in-context learning—your brain didn’t change, but the hint changed how you think.

🍞 Hook: Imagine a video game that only says 'win' or 'lose' at the end. Now imagine it also shows you the trap you fell in and the exact spot you slipped.

🥬 The Concept (Reinforcement Learning with Verifiable Rewards, RLVR): RLVR gives a simple score (often just pass/fail) after an attempt that can be checked by a program (like unit tests for code). How it works: 1) Generate an answer. 2) Run a verifier. 3) Get a score. 4) Nudge the model toward answers that scored higher. Why it matters: A single score hides where the error happened, making learning slow and unstable.

🍞 Anchor: If you submit code and only learn 'tests failed,' you still don’t know which test failed or why.

🍞 Hook: It’s much easier to fix LEGO instructions when someone circles the exact step you built wrong instead of just saying 'Your model is incorrect.'

🥬 The Concept (Reinforcement Learning with Rich Feedback, RLRF): RLRF gives you tokenized feedback, like runtime errors or judge comments, not just a score. How it works: 1) Attempt a solution. 2) Get detailed, text feedback describing what went wrong. 3) Use that feedback to guide exactly which words to prefer next time. Why it matters: Rich feedback shrinks the mystery, making credit assignment precise.

🍞 Anchor: A Python error 'ZeroDivisionError on line 73' tells you exactly what to fix instead of just saying 'Wrong answer.'

🍞 Hook: When two students compare a correct solution to a wrong one, they can spot the exact line that needs changing.

🥬 The Concept (Tokenized Feedback): Tokenized feedback means the feedback comes as text tokens (words/subwords) the model can read directly in its prompt. How it works: 1) Capture feedback as text. 2) Put it into the model’s context. 3) Let the model’s next-word probabilities change based on this text. Why it matters: Text feedback is easy for language models to ingest and reason about.

🍞 Anchor: A short note 'Don’t include n in the range' is tokenized feedback the model can read and act on.

🍞 Hook: Before this paper, many training setups gave models only a thumbs-up or thumbs-down. That’s like grading with a stamp, not a pen.

🥬 The Gap: The big problem was turning rich feedback into learning signals without hiring a stronger external teacher model. Past methods either used only scalar rewards (too weak), or required a big teacher (expensive and may cap progress). What was missing was a way for the model to use its own in-context smarts to guide itself.

🍞 Anchor: It’s like recording your own game, rewatching it with commentary (feedback), and then practicing exactly those moves you now realize were better.

Real stakes for daily life:

Faster bug-fixing: Code tools that read their own error logs can improve quickly.
Better math helpers: Tutors that explain not just the answer but exactly which step to correct.
Shorter, clearer explanations: Less rambling, more to the point.
Cheaper improvement: No need for a giant expert model hovering over your shoulder.
Quicker discovery: On tough problems, getting a solution in fewer tries saves time and compute.

02Core Idea

🍞 Hook: You know how you can learn a lot just by re-reading your own work with the teacher’s comments in mind—without the teacher rewriting it for you?

🥬 The Aha in one sentence: Let the model become its own teacher by re-scoring its original answer after reading rich feedback, then train it to prefer the feedback-improved next-word choices it now sees.

Multiple analogies:

Replay coach: A basketball player watches their own game with coach notes (feedback), then practices the corrected moves. SDPO turns those corrected moves into the new default.
Recipe taster: A chef tastes a dish, reads the tasting notes (too salty at step 3), and then refines the exact ingredient amounts next time. SDPO adjusts the exact word choices (tokens) at each step.
Map with landmarks: Instead of just 'You arrived or you didn’t,' you get turn-by-turn notes ('wrong turn at 2nd street'), and you update your mental map accordingly.

Before vs. After:

Before: With RLVR, the model gets one outcome number; learning is blurry and slow, often leading to overly long, meandering reasoning.
After: With SDPO, the model uses tokenized feedback to pinpoint which words should have been different; learning becomes fast, precise, and produces concise reasoning.

Why it works (no equations, just logic):

In-context learning already lets the model read feedback and propose better next words. If we freeze that better distribution as a target and gently pull the original model toward it, we turn a one-time insight into a permanent skill.
This gives dense credit assignment: each next token gets a specific thumbs-up/down based on whether the feedback-conditioned teacher likes it more or less.
No external teacher is needed: the same model, when shown feedback, is surprisingly better at judging its own past choices.

Building blocks (each with the sandwich pattern):

🍞 Hook: Like having both a final grade and margin notes, the notes do most of the teaching. 🥬 RLRF (Reinforcement Learning with Rich Feedback): It feeds the model detailed text feedback after each try. How: 1) Attempt. 2) Get text feedback. 3) Use it to guide updates. Why: It gives clues, not just scores. 🍞 Anchor: 'IndexError at line 12' is more useful than 'Fail.'

🍞 Hook: Think of 'Your future self' explaining which sentence you should have written. 🥬 SDPO (Self-Distillation Policy Optimization): The model, after seeing feedback, acts as a self-teacher whose next-word opinions are distilled back into the student. How: 1) Generate answer. 2) Get feedback. 3) Re-score the same answer under a feedback-augmented prompt (self-teacher). 4) Nudge the model to match those improved probabilities. Why: It turns hindsight into skill. 🍞 Anchor: The model learns to drop '+1' from 'range(1, n + 1)' after reading 'Don’t include n.'

🍞 Hook: Zoom in on the exact Lego piece that doesn’t fit. 🥬 Dense credit assignment: Instead of a single score for the whole answer, each token gets its own tiny reward/penalty based on how the self-teacher would rank it. How: Compare student vs. self-teacher probabilities at every position. Why: Pinpointing errors speeds up learning. 🍞 Anchor: Only the ' + 1 ' piece gets down-weighted; the rest stays.

🍞 Hook: Choosing the top few suspects beats interrogating the whole town. 🥬 Logit-level (top-K) distillation: The model focuses on the most likely next K tokens (plus a tail term) to save memory while keeping the key signal. How: Re-score top-K choices under feedback; update the student toward the teacher’s preferences. Why: Efficient and informative. 🍞 Anchor: Checking the top 100 likely words is enough to learn which one should replace the current choice.

🍞 Hook: It’s like a practice match where your team compares plays to see what worked. 🥬 GRPO baseline: GRPO gives the same advantage to all tokens in a rollout based on the final reward. How: Compare rewards across a group; push up winners, push down losers. Why: Simple, but too coarse when rewards are sparse. 🍞 Anchor: If all attempts get 0, GRPO stalls; there’s no signal.

Putting it together: SDPO uses rich feedback to create a feedback-conditioned self-teacher, compares token-by-token preferences (dense credit), and distills them back into the policy—without an external mentor. That is the core leap.

03Methodology

High-level recipe: Input (Question + Current Policy) → Generate Attempts → Get Rich Feedback → Build Self-Teacher with Feedback → Re-score Original Attempts → Update Student to Match Self-Teacher → Output (Improved Policy)

Step-by-step with the sandwich pattern for key parts:

Sampling rollouts (generate attempts) 🍞 Hook: Try first; learn after. Like taking a practice quiz before seeing the answer key. 🥬 What: The student model answers each question, producing y. How: 1) Pick a batch of questions. 2) Sample G answers per question from the current policy. Why: We need fresh, on-policy attempts so updates match what the model actually does. 🍞 Anchor: For a coding task, the model writes a function for each prompt.
Getting rich feedback 🍞 Hook: Error messages are like sticky notes on your code saying exactly where it broke. 🥬 What: The environment returns tokenized feedback f (runtime errors, failed tests, or judge notes). How: 1) Run the answer. 2) Collect any error text or test summaries. 3) Optionally include a sample solution if one succeeded in the same batch. Why: This text tells us not just that we failed, but how. 🍞 Anchor: 'ZeroDivisionError: division by zero on line 73' and a known passing solution are both helpful.
Building the self-teacher (feedback-conditioned re-scoring) 🍞 Hook: Read your answer while holding the teacher’s notes—your judgments change. 🥬 What: The same model re-scores the original answer after reading the feedback (prompt = question + feedback [+ sample solution if available]). How: 1) Concatenate feedback into the prompt. 2) Compute next-token probabilities for the already-generated answer tokens under this new prompt. Why: This reveals, token-by-token, where the model now disagrees with its earlier choices. 🍞 Anchor: After reading 'Don’t include n,' the model lowers probability of '+ 1' inside range.
Dense, logit-level credit assignment 🍞 Hook: A fine paintbrush beats a roller when you need precision. 🥬 What: Compare student vs. self-teacher probabilities at each token, over top-K candidate tokens. How: 1) For each position t, take the student’s top-K tokens. 2) Pull student toward tokens the teacher prefers; push away from those it dislikes. Why: This turns feedback into specific micro-corrections instead of a vague global nudge. 🍞 Anchor: Only the wrong off-by-one token gets a penalty; the rest of the code is left alone or improved.
The SDPO loss (distillation objective) 🍞 Hook: Practice the improved move until it becomes your new habit. 🥬 What: Minimize the divergence (e.g., Jensen–Shannon) between the student’s next-token distribution and the self-teacher’s, with gradients stopped through the teacher. How: 1) Compute per-token divergence. 2) Average across positions (with masks). 3) Update student weights via gradient descent. Why: Stopping gradients through the teacher prevents it from drifting to match the student and ignoring feedback. 🍞 Anchor: The student permanently learns the fix it only showed temporarily when feedback was in the prompt.
Stability tricks (regularized teacher) 🍞 Hook: Training wheels keep you steady while you speed up. 🥬 What: Keep the teacher close to a reference to avoid instability. How: 1) EMA teacher: maintain a moving average of weights. Or 2) Trust-region teacher: interpolate teacher logits with the initial teacher. Why: Prevents runaway drift and stabilizes learning. 🍞 Anchor: Like blending today’s advice with day-one guidelines so changes aren’t too wild.
Computation and memory efficiency 🍞 Hook: You don’t need to read the whole dictionary to fix a misspelling. 🥬 What: Use top-K distillation to avoid storing all logits. How: 1) Compute only top-K student logits and matching teacher logits. 2) Add a tail term for the rest. 3) Parallelize re-scoring (no extra generation). Why: Almost no memory overhead and only small time overhead. 🍞 Anchor: Checking the top 100 likely next words is enough to learn efficiently.
SDPO vs GRPO in the same pipeline 🍞 Hook: Swap the measuring stick, keep the workshop. 🥬 What: SDPO can be implemented by replacing the advantage term in a standard RLVR loop. How: 1) GRPO’s per-token advantage is constant within a rollout (based on final reward). 2) SDPO’s advantage is token- and logit-specific (based on feedback-conditioned probabilities). Why: Minimal code change, maximal signal. 🍞 Anchor: Same training loop, but richer, targeted nudges at every token.
Secret sauce 🍞 Hook: The magic is turning a one-time hint into a lasting skill. 🥬 What: Self-distillation compresses what the model learns from context (feedback in the prompt) into its weights. How: 1) The feedback-conditioned teacher spots mistakes. 2) The student integrates those corrections into its base behavior. Why: You get the best of both worlds—on-policy exploration plus dense, teacher-like supervision without any external teacher. 🍞 Anchor: After seeing 'IndexError' a few times, the model stops writing out-of-range loops even when no feedback is shown next time.

04Experiments & Results

🍞 Hook: If two runners race and one finishes a lap with fewer steps and faster time, you can tell whose stride is better.

🥬 The Test: The authors measured how quickly and how well models learn on reasoning and coding tasks. They tracked accuracy over time, response length, and how many generations were needed to reach certain scores. Why this matters: It shows whether SDPO turns rich feedback into real, practical gains, not just theory.

🍞 Anchor: On LiveCodeBench v6 (coding with unit tests and error messages), they watched how fast models improved and how high they climbed.

🍞 Hook: A fair contest needs a worthy opponent.

🥬 The Competition: SDPO was compared to GRPO, a strong RLVR baseline that uses outcome rewards. They also tried plain self-teacher SFT (off-policy), and, in some cases, combined SDPO+GRPO advantages. Why this matters: Beating a tuned, modern RL method shows the new idea is genuinely helpful.

🍞 Anchor: Think of GRPO as the current champion that grades by final score, and SDPO as the challenger that reads the teacher’s notes.

🍞 Hook: Don’t just say '87%'; tell me whether that’s top of the class.

🥬 The Scoreboard with context:

Coding (LCB v6, Qwen3-8B): SDPO reaches 48.8% validation accuracy vs. GRPO’s 41.2% and does so in about 4× fewer generations to match GRPO’s final level. That’s like getting to the same finish line with one quarter of the steps.
Reasoning tasks (SciKnowEval subsets + tool use): Even with only scalar rewards available, SDPO uses successful attempts as feedback for failed ones and outperforms GRPO on aggregate (e.g., 68.8% vs 64.1% in one summary), often learning faster in wall-clock time.
Concise reasoning: SDPO’s answers are much shorter (up to 7× shorter on some tasks) while being more accurate, avoiding repetitive filler that GRPO often produces.
Scaling: SDPO’s gains grow with model size (Qwen3 family). With very small models, pure SDPO can lag GRPO, but mixing SDPO+GRPO helps.
Forgetting: SDPO maintains capabilities on holdout tasks (IFEval, ArenaHard-v2, MMLU-Pro) at least as well as GRPO and better than off-policy SFT on teacher successes.

🍞 Anchor: On code problems, SDPO not only solved more tasks but also got to GRPO’s milestone in a fraction of the tries—like finishing the same puzzle with fewer wrong turns.

🍞 Hook: The biggest surprise is often where the method works when others can’t.

🥬 Surprising Findings:

Test-time self-distillation: On very hard questions where base pass@64 < 0.03, SDPO reached the same discovery probability with about 3× fewer attempts than best-of-k or multi-turn prompting. It even solved at least one question neither baseline could solve within 2,750 attempts.
Sequence-, token-, and logit-level ablations: Even when compressing SDPO’s signal to a single sequence-level number, it still beat GRPO—showing rich feedback alone is powerful. But the full logit-level, dense credit assignment performed best.
Teacher improves: The feedback-conditioned self-teacher gets better during training, and the student eventually surpasses the initial teacher—true bootstrapping without external experts.

🍞 Anchor: For very tough coding puzzles, SDPO learned from each error message and found working solutions where simple resampling or longer chats couldn’t keep up.

05Discussion & Limitations

🍞 Hook: Even great tools have right and wrong jobs.

🥬 Limitations:

Needs in-context skill: SDPO shines when the base model already benefits from reading feedback. On small models with weak in-context learning, SDPO may underperform GRPO.
Feedback quality matters: Misleading or uninformative feedback limits SDPO’s value. Garbage in, garbage out.
Small compute overhead: Re-scoring with the self-teacher adds time (though much less than extra generation), and some memory for stabilization.

🍞 Anchor: If the teacher’s notes are vague or the student can’t make sense of notes, progress slows.

🍞 Hook: What do you need in your backpack to hike this trail?

🥬 Required resources:

A feedback-producing environment (e.g., code runner with error logs or judge text).
A base model with decent in-context learning.
Training infra that can compute log-probs twice per batch (student and self-teacher), plus optional EMA or trust-region machinery.

🍞 Anchor: A coding sandbox that prints stack traces and a mid-sized LLM are a great starting kit.

🍞 Hook: When not to use the fancy wrench?

🥬 When NOT to use:

Very small models or tasks without meaningful textual feedback.
Settings where exploration is best guided purely by outcome rewards (e.g., no interpretable intermediates at all) and dense credit from feedback can’t be constructed.
Strict low-latency deployments with zero headroom for re-scoring.

🍞 Anchor: If all you ever get is 'pass/fail' with no error text, SDPO loses its edge.

🍞 Hook: What mysteries remain?

🥬 Open questions:

Agentic, long-horizon tasks: Can SDPO scale to multi-step tool use with long trajectories and partial observability?
Frontier scaling: How do gains change with even stronger base models and multi-task RL at scale?
Beyond verifiers: Can feedback from LLM judges or users (without ground-truth verifiers) drive safe, aligned improvements without reward models?
Behavior shaping: Why does SDPO so strongly reduce verbosity and circular reasoning? Which prompt templates or divergences amplify these benefits?

🍞 Anchor: The authors saw cleaner reasoning emerge; now we need to map exactly which training choices cause which behaviors.

06Conclusion & Future Work

🍞 Hook: The best teacher for a model might be… the model itself, rereading its work with helpful notes.

🥬 Three-sentence summary: This paper introduces SDPO, a way for a language model to act as its own teacher by re-scoring its answers after reading rich, tokenized feedback and then learning from those improved token preferences. By turning feedback into dense, per-token guidance, SDPO fixes the credit-assignment bottleneck of pass/fail training and speeds up learning. It beats strong RL baselines on reasoning and coding, produces shorter, clearer solutions, and even accelerates discovery on very hard problems at test time.

Main achievement: Showing that a single model, conditioned on feedback, can provide teacher-quality, dense supervision to itself—without any external reward model or bigger teacher.

Future directions:

Push to agentic, long-horizon tasks with many intermediate states and tools.
Scale SDPO on large multi-task runs with frontier models to study emergent self-teaching.
Explore open-ended alignment using textual feedback without hard verifiers.
Analyze how SDPO shapes reasoning styles and how templates/divergences control that.

Why remember this: SDPO flips the script—from 'Did I pass?' to 'What exactly should I fix?'; it shows that rich feedback plus self-distillation can make RL for language models more informative, faster, and more stable while reducing rambling and sharpening reasoning.

Practical Applications

•Code assistants that quickly learn from their own error messages to fix bugs in fewer iterations.
•Math tutors that highlight and permanently correct specific mistaken steps, not just final answers.
•Tool-using agents that improve API calls by reading API errors and adjusting parameters precisely.
•Customer-support bots that refine responses by learning from supervisor comments or customer sentiment.
•Data-cleaning pipelines that recognize and fix recurring parsing or schema errors using log feedback.
•Education apps that turn teacher notes into durable improvements in student-like solution steps.
•Scientific assistants that iterate on simulation or analysis failures, discovering working setups faster.
•Competitive programming trainers that learn from failed unit tests to pass more hidden tests efficiently.
•Test-time learners that adapt to a single hard problem and discover a solution with fewer tries.
•Long-context workflows that compress lessons from many feedback turns into the model weights.

Version: 1