Self-Distillation Enables Continual Learning

Idan Shenfeld; Mehul Damani; Jonas Hübotter; Pulkit Agrawal

Self-Distillation Enables Continual Learning

Intermediate

Idan Shenfeld, Mehul Damani, Jonas Hübotter et al.1/27/2026

arXiv PDF

Key Summary

•This paper shows a simple way for AI models to keep learning new things without forgetting what they already know.
•The trick is called Self-Distillation Fine-Tuning (SDFT), where the model teaches itself by using examples as temporary hints.
•Instead of copying answers from a dataset (off-policy SFT), SDFT makes the model practice on its own outputs (on-policy), which reduces forgetting.
•SDFT uses the same model in two roles: a 'teacher' that sees the example and a 'student' that does not; the student learns to move closer to the teacher.
•This training uses reverse KL divergence to gently nudge the student towards the teacher while staying close to its original abilities.
•Across science Q&A, tool use, and medical reasoning, SDFT beats standard SFT on new tasks and preserves older skills much better.
•In a sequence of three tasks learned one after another, SDFT lets one model stack skills without performance drop-offs.
•SDFT also helps models absorb brand-new facts (like 2025 disasters) and generalize those facts to new questions.
•Bigger models benefit more because they are better at in-context learning, which powers the teacher signal.
•SDFT costs more compute than SFT but can save time overall compared to multi-stage fixes for forgetting.

Why This Research Matters

AI systems in the real world must keep up with changing tools, rules, and facts without losing their prior strengths. SDFT gives teams a practical way to learn from demonstrations on-policy, avoiding the pitfalls of off-policy SFT and the heavy setup of building reward models. This leads to assistants that can adopt new APIs, regulations, or product features while staying reliable on everyday tasks. It also allows models to absorb fresh knowledge (like recent events) and use it flexibly, not just memorized as Q&A pairs. Because bigger models benefit even more, SDFT scales with modern foundation models. The result is safer updates, fewer surprises for users, and faster iteration cycles for developers. In short, it’s a pathway to AI that improves continuously like a good student—without forgetting yesterday’s lessons.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine you’re great at riding a bike, and later you learn to skateboard. It would be a bummer if learning skateboarding made you forget how to ride a bike! We want AI to learn new skills without losing old ones.

🥬 The Concept (Continual Learning): Continual learning means an AI keeps adding new skills and knowledge over time without erasing what it already knows. How it works (big picture):

The AI gets new examples of tasks or facts.
It updates itself to do the new task.
It tries not to damage older abilities while improving the new one. Why it matters: Without continual learning, every update risks breaking something that already worked. 🍞 Anchor: A helpful tutor-bot should learn a new school subject (like chemistry) and still remember math and history.

🍞 Hook: You know how if you only study last year’s test key, you might do fine on old questions but freeze on a new version of the test?

🥬 The Concept (Supervised Fine-Tuning, SFT): SFT is training a model to imitate correct answers from a fixed dataset of demonstrations. How it works:

Show the model an input and the expert’s correct output.
Nudge the model to match those outputs token by token.
Repeat over the whole dataset. Why it matters: SFT is simple and strong, but it’s off-policy: the model only sees expert paths, not its own mistakes. When the model later generates its own answers, errors can pile up. 🍞 Anchor: If you only memorize your teacher’s solution steps, you can get lost when your homework question is worded differently.

🍞 Hook: Practicing your own way helps you get better faster than only watching others.

🥬 The Concept (On-Policy Learning): On-policy means the model trains on the situations created by its own current behavior. How it works:

Let the model answer with its current policy.
Measure how good that answer is (or compare it to a teacher signal).
Update the model based on what it actually did. Why it matters: Training where you “live” prevents compounding errors and reduces forgetting. 🍞 Anchor: Learning to ride a bike on the actual path you’ll use is better than only watching videos of other riders.

🍞 Hook: Think of sticky notes you add at the top of your notebook to remind you how to solve a kind of problem.

🥬 The Concept (In-Context Learning, ICL): In-context learning lets a model adjust its behavior using examples placed inside the prompt, without changing model weights. How it works:

Put a demonstration before the new question.
The model uses that demo as a hint about what to do.
It answers in a way that fits the hint. Why it matters: If ICL is strong, you can turn the model into a capable temporary teacher simply by conditioning on the right example. 🍞 Anchor: Show the AI one worked tool-use example; it then solves the next tool-use query better by following the pattern it just saw.

🍞 Hook: Sometimes we learn best by being our own coach—like reading your own notes out loud.

🥬 The Concept (Catastrophic Forgetting): Catastrophic forgetting is when learning something new causes the AI to lose older skills. How it works:

The model is pushed strongly toward the new data distribution.
Its parameters drift away from patterns that supported earlier tasks.
Old abilities get worse, sometimes dramatically. Why it matters: Forgetting ruins reliability; users can’t trust updates if older strengths vanish. 🍞 Anchor: A chatbot trained hard on medical Q&A suddenly gets worse at basic common-sense reasoning.

The world before this paper:

Foundation models were powerful but mostly “frozen.” After deployment, they didn’t update themselves to add skills or facts.
When teams used SFT on new demos, the model often improved at that one task but forgot prior abilities (catastrophic forgetting).
On-policy reinforcement learning (RL) helps with forgetting but needs a reward signal, which is often unavailable (for example, how do you score the ‘right’ answer to an open question without labels?).

Failed attempts and their issues:

Pure SFT: Easy, but off-policy; models train on expert answers only and then stumble when facing their own generation paths at test time; forgetting is common.
IRL (Inverse Reinforcement Learning): Learn a reward from demos, then run RL on-policy. Elegant in theory but hard in practice; needs strong assumptions and can be expensive.
Context distillation (offline): Teach a student using a teacher that has extra context, but done offline on teacher’s outputs rather than on student’s own rollouts; helps, but still not fully on-policy.

The missing piece:

A way to get on-policy learning directly from demonstrations, without building a separate reward model.

Real stakes (why you should care):

Products update constantly: new tools, APIs, and rules appear every week. A smart assistant should learn them without losing old skills.
News changes: models need to absorb fresh facts (e.g., 2025 events) and use them flexibly in new questions.
Safety and reliability: fewer surprises when updating models means happier users and safer systems.

This paper’s proposal (teaser):

Use the same model twice: as a teacher (with the demonstration in its prompt) and as a student (without it).
Make the student practice on its own generations and nudge it toward the teacher’s distribution—an on-policy self-distillation loop that keeps older skills intact while adding new ones.

02Core Idea

🍞 Hook: Imagine you’re doing homework with a great example on the page. First, you study the example (teacher mode). Then you cover it and try the problem yourself (student mode). Finally, you compare and adjust.

🥬 The Concept (Self-Distillation Fine-Tuning, SDFT): SDFT is a way for a model to learn from demonstrations on-policy by using a demonstration-conditioned version of itself as the teacher and an unconditioned version as the student. How it works:

Teacher: Same model, but with the demo in the prompt (strong guidance via ICL).
Student: Same model, just the input question (no demo in prompt).
Generate: Let the student produce its own answer (on-policy).
Compare: Compute how different the student’s token probabilities are from the teacher’s (reverse KL).
Update: Move the student closer to the teacher while staying anchored to its base behavior. Why it matters: You get the benefits of on-policy learning from demos without inventing a reward function or suffering compounding errors. 🍞 Anchor: Tool use—teacher sees an example API call and subtly guides probabilities; student learns to call the right tool on its own next time.

The “Aha!” in one sentence:

Use your own model, made wiser by a demo in its prompt, to teach your unchanged self on your own rollouts—so you learn new skills without forgetting old ones.

Three analogies:

Training wheels: The demo-conditioned teacher is like training wheels that keep your balance while you ride your own bike. Over time, you ride straight without them.
Whisper coach: During practice, a quiet coach (the teacher) whispers hints based on a perfect example; you repeat the drill yourself and improve.
Tracing then freehand: First trace over a perfect shape (teacher mode), then draw it freehand (student mode) while adjusting your lines to match the traced style.

Before vs. after:

Before: SFT copies answers from a fixed dataset; strong on the new task but forgetful and brittle when the model is off the expert path.
After: SDFT practices on the model’s own path with a teacher nudge; stronger new-task accuracy, less forgetting, better generalization.

Why it works (intuition, not equations):

The teacher is close to the base model because it’s the same model, just shown a relevant demo. That keeps updates gentle and safe.
On-policy training avoids the mismatch between “what you trained on” and “what you actually do.”
Reverse KL pulls the student towards the teacher in a way that preserves the student’s high-confidence regions and prevents wild shifts.
If you view it like IRL, the teacher defines an implicit reward: “prefer answers that the demo-conditioned version likes more than your current self.”

Building blocks (each with a mini sandwich):

🍞 Hook: Like having two hats you put on—one sees the example, the other doesn’t. 🥬 The Concept (Teacher–Student Distillation): A teacher gives a soft target distribution; the student learns to match it. How it works: Compute probabilities for every token from teacher and student; train the student to reduce their difference. Why it matters: Soft signals are richer than hard labels and guide learning smoothly. 🍞 Anchor: The teacher doesn’t just say “B.” It says “60% B, 30% C, 10% A,” which tells the student how confident to be.
🍞 Hook: Picture spreading jam evenly rather than dumping it in one spot. 🥬 The Concept (Reverse KL Divergence, kid’s version): A way to measure and reduce the gap between two probability “shapes,” prioritizing areas where the student is confident but wrong. How it works: Compare token-by-token probabilities; nudge the student so its curve matches the teacher’s curve. Why it matters: This keeps the student from drifting too far while still correcting confident mistakes. 🍞 Anchor: If the student is 90% sure of the wrong tool call, reverse KL pushes hard to fix that.
🍞 Hook: A rolling average keeps your score stable even if one quiz is weird. 🥬 The Concept (EMA Teacher): Keep a smoothed copy of the student as the teacher, so updates aren’t jittery. How it works: After each step, mix a little of the new student into the teacher weights. Why it matters: Prevents instability from chasing noisy changes. 🍞 Anchor: The EMA teacher rises steadily with learning, avoiding zig-zag behavior.
🍞 Hook: Sometimes we learn the “why” behind an example, not just the answer. 🥬 The Concept (IRL view of SDFT): SDFT is equivalent to maximizing an implicit reward: “prefer what the demo-conditioned self prefers over the current self.” How it works: The difference in log-probabilities acts like a reward signal on-policy. Why it matters: You get on-policy learning from demos without building a separate reward model. 🍞 Anchor: If the teacher’s distribution is much sharper for a correct tool call, that becomes a strong reward for the student’s action.
🍞 Hook: A clue is most helpful when it’s specific to the problem at hand. 🥬 The Concept (Instance-wise Demonstration Conditioning): The teacher sees a demo tailored to the current input, not a generic rule. How it works: For each query x, pair it with its own demonstration c in the teacher prompt. Why it matters: Fine-grained guidance beats one-size-fits-all instructions. 🍞 Anchor: For a weather-API question, show a weather call demo; for a calendar-API question, show a calendar call demo.

03Methodology

At a high level: Input (query + demonstration dataset) → Build teacher context with the paired demo → Let student generate its own answer (on-policy) → Compare teacher vs. student probabilities (reverse KL) → Update student → Soft-update teacher with EMA → Output: a model that learned the new skill without forgetting.

Step-by-step recipe (with reasons and mini examples):

Prepare your data

What happens: You have pairs (x, c) where x is the new question and c is the expert demonstration (or text + answer for knowledge).
Why this step exists: Without demonstrations, the teacher would have no special hint to produce better guidance.
Example: x = “Which tool call fetches user’s calendar for July 10?”; c = a correct example of calling the calendar API.

Build the teacher prompt (demonstration-conditioned)

What happens: Create a teacher input like: “Question: x. This is an example for a response: c. Now answer with a response of your own, including the thinking process.”
Why this step exists: It activates the model’s in-context learning, turning it into a problem-specific teacher.
Example: For tool use, the teacher sees both the user request and a sample correct tool call; it won’t copy c but will steer toward a similar, correct pattern.

Define the student prompt (query-only)

What happens: The student sees only x, not the demo.
Why this step exists: The goal is to make the model perform well without needing the demo at inference time.
Example: Student gets only the user’s tool-use question.

Generate on-policy from the student

What happens: Sample the student’s answer y to x using its current parameters.
Why this step exists: On-policy learning means practicing on the student’s own rollouts, preventing mismatch.
Example: The student tries a tool call (maybe incorrect at first), producing a sequence of tokens.

Compute token log-probabilities from both sides

What happens: For each token in y, compute the student’s probability and the teacher’s probability.
Why this step exists: These soft probabilities contain rich guidance (not just “right/wrong”).
Example: For the function name token, teacher might give 0.9 to calendar.get and 0.1 to calendar.find, while the student gave them backward.

Minimize reverse KL (teacher vs. student)

What happens: Calculate a loss that is larger when the student is confidently different from the teacher, and take a gradient step to reduce it.
Why this step exists: Reverse KL fixes confident mistakes while keeping the model close to its original behavior, reducing forgetting.
Example: If the student was overly sure about the wrong tool, the loss strongly pushes it toward the teacher’s distribution.

Use an EMA teacher for stability

What happens: Maintain a smoothed copy of the student weights for the teacher (EMA). After each update: teacher ← α·student + (1−α)·teacher.
Why this step exists: Prevents the teacher from changing too fast and destabilizing training.
Example: Even if one step was noisy, the EMA teacher stays steady and provides consistent guidance.

Repeat across the dataset with one rollout per example

What happens: Iterate steps 1–7 over all (x,c) pairs for a few epochs.
Why this step exists: One on-policy rollout per example is enough and more efficient than multi-sample RL loops.
Example: Over time, the model steadily improves its tool-call accuracy while preserving earlier skills.

Concrete mini-walkthroughs:

Tool use: Input: API spec + user request to schedule an event. Teacher (with demo): Prioritizes tokens for the correct function and arguments. Student (no demo): Generates a call; loss pushes the distribution toward the teacher’s. Outcome: Next time, student more reliably picks the right function and fills arguments correctly.
Knowledge injection (2025 disasters): Input: Question about a 2025 earthquake; demonstration includes text + a worked answer. Teacher: Emphasizes correct place names and dates in its token probabilities. Student: Learns those tokens’ probabilities even without access to the article later. Outcome: The model answers correctly and can even use the facts in new, indirect questions.

Why this method is clever (the secret sauce):

It turns demonstrations into an on-policy learning signal without inventing a reward model.
It uses the same model as both teacher and student, so guidance stays close to what the model already knows, which protects prior skills.
Reverse KL targets confident mistakes, which are the most harmful ones, improving accuracy while limiting drift.
EMA teacher ensures stability, avoiding the “chase your own tail” problem.

What would break without each step:

No teacher conditioning: The student loses a strong, instance-specific guide; learning weakens.
No on-policy rollout: You’re back to off-policy imitation, risking compounding errors and forgetting.
No reverse KL: Updates can become either too timid (no progress) or too wild (forgetting).
No EMA: Training can jitter or diverge from chasing noisy updates.
No token-level probabilities: You lose rich, graded hints and fall back to blunt right/wrong supervision.

04Experiments & Results

The tests (what and why):

Skill Learning: Can the model pick up new skills (science Q&A, tool use, medical reasoning) without trashing older abilities?
Knowledge Acquisition: Can the model absorb brand-new facts from 2025 events and use them in both direct and indirect questions?
Retention: After training on one task (or a sequence of tasks), how much of the original broad capability suite remains?

The competition (baselines):

SFT (standard supervised fine-tuning): Simple, off-policy.
DFT (distillation fine-tuning): Uses importance tricks to act more on-policy but still degrades.
Re-invocation: Do SFT, then try to repair damage with extra on-policy distillation on general prompts.
CPT (continual pretraining): For knowledge injection, train on the raw corpus next-token loss.
Oracle RAG: Upper bound for knowledge questions when you always fetch the right article.

The scoreboard (with context):

Skill Learning trade-offs (three tasks): SDFT consistently sits in the top-right of the trade-off plots—high new-task accuracy and high retention of prior tasks—while SFT gets good new-task scores but forgets a lot; DFT is better than SFT but not as strong as SDFT; Re-invoke only partially recovers losses.
Concrete example (Science Q&A): SDFT reached about 70.2% new-task accuracy vs. 66.2% for SFT, while also preserving prior benchmarks much better (average retention around 64.5% vs. SFT’s 53.4%). That’s like getting an A- on the new unit while keeping your old grades, instead of getting a B and losing last term’s A’s.
Tool Use: Base 42.9%. SFT improves but forgets broadly; SDFT reaches about 70.6% on new task and keeps strong prior-task scores (≈65.4% avg), near the base model’s broad capability level (65.5%). That’s like leveling up in gadgets without becoming clumsy elsewhere.
Medical reasoning: SDFT improves accuracy (≈40.2% vs. base 30.1%) and retains broad skills better than baselines; SFT lags and shortens reasoning responses (a reasoning collapse), while SDFT preserves longer, useful reasoning.

Knowledge Acquisition results (new 2025 facts):

Strict accuracy (all details correct): SDFT ≈ 89%; SFT ≈ 80%; CPT ≈ 93 on some lenient measures but worse strict; Oracle RAG = 100% when you always fetch the perfect article.
Lenient accuracy (correct with no wrong details): SDFT ≈ 100%, matching Oracle RAG.
Out-of-distribution (indirect) questions: SDFT ≈ 98% vs. SFT much lower. This shows SDFT actually integrates facts into the model’s internal memory, rather than only parroting seen Q&A.

Surprising or notable findings:

Scale helps: Bigger models (with stronger ICL) gain more from SDFT. The gap vs. SFT widens from 3B to 14B.
Not just entropy collapse: SDFT improves pass@k for many k values, so it’s real skill gain, not merely shrinking the distribution.
Sequential continual learning: Train on Tool Use → then Science → then Medical. With SDFT, performance stacks and stays; with SFT, performance rises on the current task and drops on earlier ones.
On-policy is key: Offline distillation from the teacher helps a bit but does not match SDFT. The act of training on the student’s own rollouts is crucial.
Teacher closeness: The demo-conditioned teacher stays much closer (lower KL) to the base model than an SFT-fine-tuned model does, which likely protects prior capabilities.

Takeaway in plain words:

When the model practices on its own outputs while being gently steered by its demo-conditioned self, it learns new tasks better and keeps old ones strong. That’s the core reason SDFT wins across the boards.

05Discussion & Limitations

Limitations (honest and specific):

Compute cost: Generating on-policy rollouts during training adds overhead—about 2.5× FLOPs and roughly 4× wall-clock vs. plain SFT in these experiments.
Depends on ICL: If the base model’s in-context learning is weak (smaller models), the teacher signal is too faint and SDFT may underperform SFT.
Spurious markers: The student can inherit phrases like “Based on the text…” from the teacher, even when no text is present; masking initial tokens helps but is a heuristic.
Hard to cause radical behavior shifts: SDFT preserves style; it’s great for adding skills/knowledge, harder for turning a non-reasoning model into a chain-of-thought model from scratch.

Required resources:

A dataset of demonstrations (input + worked example or text+answer),
A model with decent ICL capability (7B+ works better),
GPU time for on-policy rollouts, and
Implementation of reverse-KL token-level distillation with an EMA teacher.

When NOT to use it:

If you have a tiny model with weak ICL—teacher signals will be poor.
If you need a dramatic style shift (e.g., force long chain-of-thought from a terse model); consider other methods or combine with RL.
If you have a perfect reward function and strong RL infra—you might prefer direct on-policy RL.

Open questions:

Can we design better teacher prompts or retrieval to strengthen the teacher without copying artifacts?
What’s the best blend of SDFT and RL signals—sequentially or jointly?
How to robustly handle noisy or non-expert demonstrations?
Can we provide theoretical guarantees on forgetting bounds under on-policy distillation?
Are there smarter, cheaper KL estimators or sampling tricks that reduce compute while keeping benefits?

Bottom line:

SDFT is not a replacement for RL but a practical, complementary tool: it unlocks on-policy learning from demonstrations—often what you have in real life—while safeguarding prior abilities.

06Conclusion & Future Work

Three-sentence summary:

This paper introduces Self-Distillation Fine-Tuning (SDFT), where a model learns on-policy from demonstrations by using a demonstration-conditioned version of itself as the teacher and its unconditioned self as the student.
By minimizing reverse KL on the student’s own rollouts and using an EMA teacher, SDFT improves new-task accuracy and sharply reduces catastrophic forgetting.
Across skills and fresh knowledge injection, SDFT consistently outperforms standard SFT and supports true sequential continual learning.

Main achievement:

A simple, practical recipe to turn demonstration datasets into on-policy learning signals—no reward model needed—enabling continual learning without wrecking prior capabilities.

Future directions:

Combine SDFT with on-policy RL (as initialization or mixed signals), design better teacher prompts and retrieval, handle noisy demos, and push to larger models where ICL shines.

Why remember this:

SDFT reframes “learn from examples” into “practice on your own with a guided nudge,” giving you the best of both worlds: stronger new skills and preserved old ones—an essential step toward truly ever-learning AI.

Practical Applications

•Update a customer-support bot with new company policies while preserving its existing troubleshooting skills.
•Teach a code assistant new library APIs without degrading its general coding help.
•Add brand-new world facts (post-cutoff events) so an assistant can answer timely questions and reason about them.
•Train a tool-using agent (e.g., calendar, email, CRM) to choose the right tools and arguments more reliably.
•Improve medical or legal Q&A capabilities without harming general reasoning performance.
•Create robust enterprise copilots that continuously learn from internal demos while staying safe and stable.
•Enhance robotics skills from demonstration videos/logs, stacking new skills without losing older ones.
•Strengthen tutoring systems with new curricular modules while keeping prior grade-level help intact.
•Initialize an RL fine-tuning stage from a stronger starting policy obtained via SDFT on demos.
•Upgrade multilingual assistants with domain-specific examples (finance, healthcare) while preserving general language abilities.

Version: 1