Entropy-Adaptive Fine-Tuning: Resolving Confident Conflicts to Mitigate Forgetting

Muxi Diao; Lele Yang; Wuxuan Gong; Yutong Zhang; Zhonghao Yan; Yufei Han; Kongming Liang; Weiran Xu; Zhanyu Ma

Entropy-Adaptive Fine-Tuning: Resolving Confident Conflicts to Mitigate Forgetting

Intermediate

Muxi Diao, Lele Yang, Wuxuan Gong et al.1/5/2026

arXiv PDF

Key Summary

•Supervised fine-tuning (SFT) often makes a model great at a new task but worse at its old skills; this paper explains a key reason why and how to fix it.
•The authors discover 'Confident Conflicts': tokens the model is very sure about (low entropy) but that the dataset says are wrong (low probability), which cause big, harmful updates.
•Their method, Entropy-Adaptive Fine-Tuning (EAFT), uses token-level entropy as a soft gate to reduce learning pressure on these conflicts while fully learning from uncertain parts.
•EAFT matches standard SFT on the target tasks (like math or tool use) but keeps general skills much better, shrinking forgetting by roughly two to four points across models.
•Unlike methods that only look at probability, EAFT separates 'I don’t know yet' from 'I’m confidently different,' so it doesn’t amplify destructive gradients.
•EAFT is simple to drop in: multiply the usual loss by a normalized top‑K entropy (K=20), which is fast and accurate (correlation ≈0.999 with full entropy).
•Across Qwen and GLM models from 4B to 32B parameters and in math, medical, and agentic domains, EAFT consistently preserves base capabilities while adapting.
•Pilot masking experiments confirm the cause: just skipping low-entropy, low-probability tokens already reduces forgetting, and EAFT improves on this with soft gating.
•EAFT is robust to different gating shapes (linear, polynomial, sigmoid) and introduces negligible compute and memory overhead.
•It’s not ideal for knowledge editing (when we must overwrite priors), but it’s a strong default for domain adaptation and continual learning.

Why This Research Matters

Models are used in the real world where we keep updating them, so we need a safe way to add new skills without breaking old ones. EAFT lets companies fine-tune assistants for math, medicine, or tool use while keeping their general smarts, reducing the risk of surprising regressions. This makes updates more reliable, speeding up product cycles and lowering costly revalidations. In safety-critical areas like healthcare, preserving baseline reasoning while adding domain knowledge helps avoid dangerous mistakes. Because EAFT is simple, fast, and robust, teams can adopt it quickly without adding heavy infrastructure. Over time, this can enable smoother continual learning, where models steadily improve instead of seesawing between gains and losses.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

Let’s build the story step by step, like good detectives.

Concept toolkit (we’ll use the Sandwich pattern so each idea sticks):

🍞 Hook: You know how your brain is made of lots of tiny messengers working together to recognize patterns, like faces or words? 🥬 The Concept: Neural networks are computer models made of many simple units (neurons) that work together to spot patterns in data. How it works:

Show examples (inputs) and a desired answer (label).
The network makes a guess.
Compare guess to answer and measure error.
Nudge the network to be a bit better next time. Why it matters: Without this pattern-spotting machine, modern AI couldn’t read, speak, or reason. 🍞 Anchor: Teaching a net to tell cats from dogs by showing many photos and gently correcting mistakes.

🍞 Hook: Imagine rolling a marble down a bumpy hill so it settles into a low spot. 🥬 The Concept: Gradient descent is the step-by-step way we move model parameters toward lower error. How it works: 1) Measure slope of the error; 2) Step downhill a bit; 3) Repeat; 4) Stop when flat enough. Why it matters: Without it, the model wouldn’t learn from mistakes. 🍞 Anchor: Each homework correction makes tomorrow’s quiz answers slightly better.

🍞 Hook: Think of guessing the weather: sunny is likely, snow is unlikely. 🥬 The Concept: A probability distribution assigns a likelihood to each possible outcome. How it works: 1) List outcomes; 2) Assign chances that add to 1; 3) Use the chances to pick or judge outcomes. Why it matters: Models speak by choosing the next word according to these chances. 🍞 Anchor: The model decides whether the next word after “peanut butter and” is “jelly” (high chance) or “spaceship” (low chance).

🍞 Hook: In a mystery book, if you have no clue what happens next, suspense is high; if it’s obvious, suspense is low. 🥬 The Concept: Entropy measures uncertainty in a probability distribution. How it works: 1) If one option dominates, entropy is low; 2) If choices are spread out, entropy is high; 3) We compute a number that grows with unpredictability. Why it matters: It tells us when a model is unsure versus very sure. 🍞 Anchor: After “capital of France is,” entropy is low (Paris is obvious); after “my favorite fruit is,” entropy is higher.

🍞 Hook: Think of looking at each word the model is about to say and asking, “How sure are you?” 🥬 The Concept: Token-level probability is the model’s confidence in the next specific word; token-level entropy is its overall uncertainty across all options at that step. How it works: 1) Compute probabilities for all words; 2) Read the probability for the chosen/target word; 3) Calculate entropy across the distribution. Why it matters: These numbers let us see if the model is confidently right, confidently different, or just unsure. 🍞 Anchor: If the model thinks “Paris” = 0.95 and entropy is low, it’s very sure.

🍞 Hook: Picture a coach who shows perfect plays and says, “Do it exactly like this.” 🥬 The Concept: Supervised Fine-Tuning (SFT) teaches a model by maximizing the likelihood of teacher-provided answers. How it works: 1) Feed input and the teacher’s answer; 2) Compute cross-entropy loss; 3) Backpropagate to make the teacher’s tokens more likely; 4) Repeat on many examples. Why it matters: It’s the standard, efficient way to specialize models. 🍞 Anchor: Training a general chatbot to be a math tutor by showing correct math solutions and making it imitate them.

🍞 Hook: Now imagine learning to ride a bike by trying, wobbling, and adjusting based on how it goes. 🥬 The Concept: On-policy Reinforcement Learning (RL) improves a model using its own generated attempts, guided by rewards. How it works: 1) The model acts (writes answers); 2) A reward signal scores them; 3) Update the model to make high-reward actions more likely; 4) Repeat with fresh attempts from the updated model. Why it matters: Because it learns within its own “comfort zone,” it often preserves general skills better. 🍞 Anchor: A student tries many math steps, gets a score, and practices more on the ones that worked.

The world before this paper:

LLMs got great at new domains using SFT, but often paid an “alignment tax”: they forgot general skills. A model fine-tuned to follow strict tool-call formats might stumble on broad knowledge tests afterward.
In contrast, on-policy RL often boosted the new skill yet kept old talents. That contrast puzzled researchers.

The specific problem:

Why does SFT so often cause catastrophic forgetting, while on-policy RL doesn’t?
Past explanations blamed overfitting or drift, but the exact token-level culprit was unclear.

What people tried and why it fell short:

KL-regularized SFT (SFT_KL) tries to hold the model near its base behavior but adds memory/computation and can still push on destructive samples.
Probability-based reweighting (like DFT) or token-adaptive schedules (like TALR, FLOW) down/up-weight based on how “hard” or “easy” tokens look—yet they mostly rely on probability, which can confuse two different cases: (a) the model truly doesn’t know yet (good to learn), and (b) the model is confidently different from the label (dangerous to force).

The missing piece this paper adds:

Look not just at probability but also at entropy (uncertainty). That lets us separate “uncertain and learnable” from “confidently conflicting” tokens.

🍞 Hook: Imagine a student who is very sure 7×8=54. Forcing them to chant “56” very loudly, instantly, might make them forget other facts they knew. 🥬 The Concept: Confident Conflicts are low-probability, low-entropy tokens—places where the model is very sure in a different answer than the dataset’s label. How it works: 1) The dataset says token A; 2) The model strongly prefers token B (low entropy); 3) Probability of A is low; 4) Cross-entropy pushes a huge correction; 5) Big updates distort useful prior knowledge. Why it matters: These tokens drive catastrophic forgetting when you force-fit them. 🍞 Anchor: The model wants to say “ball,” but the label demands “truncated icosahedron.” Forcing that can warp its learned language patterns.

Real stakes in daily life:

Personal assistants must keep general knowledge while learning new tools—nobody wants a model that masters spreadsheets but forgets basic facts.
Medical or legal adaptation must add domain skill without erasing safe, general reasoning.
Companies need reliable updates over time (continual learning) that don’t break what already works.

02Core Idea

Here’s the big idea in one line:

Aha! Weight training by uncertainty (entropy), not just surprise (probability), so you learn from what you don’t know and stop over-correcting what you’re confidently different about.

Three ways to picture it:

Classroom analogy: The teacher spends more time where a student looks unsure (furrowed brow) and eases off when the student is stubbornly confident, to avoid bulldozing good understanding.
Volume knob: Turn training volume up on high-entropy (I’m not sure) tokens; turn it down on low-entropy (I’m sure, but different) tokens.
Traffic cop: Green light for uncertain tokens to flow through learning; yellow/red for confident conflicts to slow/stop harmful gradient traffic.

Before vs. after:

Before: SFT treated all tokens equally. Confident Conflicts got slammed with big gradients, often overwriting useful priors and causing forgetting.
After: EAFT gently suppresses gradients where the model is confident-but-different, while fully training where it’s uncertain. You adapt to the new domain and keep your old skills.

Why this works (intuition without equations):

Cross-entropy pushes hardest exactly where the model is confidently giving a different answer. That giant push can twist shared representations used across many tasks. Entropy reveals how rigid the current belief is. If entropy is low (rigid), push lightly; if it’s high (flexible), push normally. This changes harmful “wrench turns” into careful “fine-tuning.”

Building blocks (each with Sandwich clarity):

🍞 Hook: You know how sometimes you pick from just a few top choices when you’re sure? 🥬 The Concept: Top‑K entropy computes uncertainty over only the most likely K tokens to save time. How it works: 1) Take the top K probabilities (e.g., K=20); 2) Renormalize; 3) Compute entropy; 4) Normalize by ln(K) to get a 0–1 weight. Why it matters: It’s fast, memory‑light, and almost identical to full entropy (correlation ≈0.999 at K=20). 🍞 Anchor: If “Paris, Lyon, London, …” are the only real contenders, measuring uncertainty over them is enough.

🍞 Hook: Think of dimming a lamp instead of switching it off. 🥬 The Concept: Soft gating multiplies the usual SFT loss by the (normalized) entropy, smoothly reducing pressure on confident conflicts instead of discarding them. How it works: 1) Compute cross‑entropy; 2) Compute normalized top‑K entropy; 3) Multiply: loss = gate × cross‑entropy; 4) Backpropagate. Why it matters: It preserves useful signals without bulldozing stubborn priors or throwing data away. 🍞 Anchor: A quiet coach’s nudge beats a shouted command when a player is already sure.

🍞 Hook: If a rule works with many dial shapes, it’s probably the rule—not the dial—that matters. 🥬 The Concept: Gating shape robustness means linear, polynomial, or sigmoid gates all work similarly because the key is entropy awareness itself. How it works: 1) Swap gate functions; 2) Results stay on the Pareto frontier; 3) Hard masks help retention but hurt target learning more. Why it matters: It’s the idea (uncertainty-aware gating), not a fragile formula. 🍞 Anchor: Different dimmers still dim; on/off switches can black out the room.

Put simply, EAFT is SFT with an “uncertainty-aware volume knob” at every token, turning up learning where the model is curious and turning it down where it’s stubbornly different. That’s how you adapt without forgetting.

03Methodology

At a high level: Text input → Model forward pass → Token probabilities → Top‑K entropy per token → Normalize to [0,1] gate → Multiply gate × cross‑entropy → Backprop → Updated model.

Step-by-step recipe (with why and a concrete mini‑example):

Prepare inputs and data loader

What happens: We stream (prompt, target) pairs from the domain dataset (math, medical, or tool‑use).
Why it exists: We need supervised targets to adapt skills.
Example: Prompt: “What’s the capital of France?” Target: “… Paris.”

Forward pass to get token distributions

What happens: The model produces, at each position t, a probability distribution over the vocabulary for the next token.
Why it exists: We need per‑token probabilities to compute both cross‑entropy (for learning) and entropy (for gating).
Example: After “capital of France is,” P(“Paris”)≈0.95, P(“Lyon”)≈0.03, others tiny.

Compute token-level cross‑entropy (standard SFT signal)

What happens: Cross‑entropy measures how unlikely the target token is under the model.
Why it exists: This is the regular teacher‑forcing signal that pulls the model toward the dataset labels.
Example: If the target token is “Paris,” cross‑entropy is small. If the target is a rare, odd choice, it’s large.

Compute token-level top‑K entropy (uncertainty signal)

What happens: For each step t, we take the top K tokens (K=20), renormalize, and compute entropy; then normalize by ln(K) so it lies in [0,1]. This is our gate weight H̃_t.
Why it exists: Entropy tells us if the model is exploring (high entropy) or rigid (low entropy). Using top‑K makes it fast and memory‑light but still accurate.
Example: After “capital of France is,” entropy is low (~0), gate≈0. After “my favorite fruit is,” entropy is higher, gate closer to 1.

Multiply gate × cross‑entropy to form the EAFT loss

What happens: L_t = H̃_t × CE_t. Uncertain tokens keep full learning pressure; confident conflicts get their pressure reduced.
Why it exists: This is the heart of EAFT—softly suppressing destructive updates while keeping useful learning.
Example A (uncertain): If H̃≈1 and CE≈0.8, L≈0.8 (full training). Example B (confident conflict): If H̃≈0.1 and CE≈4.0, L≈0.4 (greatly softened push).

Backpropagate and update parameters (optimizer: AdamW)

What happens: Standard gradient descent steps update the model.
Why it exists: To learn from the adjusted (safer) signals.
Example: The model becomes better at high‑entropy reasoning steps without twisting its general language knowledge.

Repeat across tokens, batches, and epochs; pick best checkpoint by average performance

What happens: Train for ~10 epochs with cosine LR schedule, warmup, batch size 64; choose the checkpoint best on averaged benchmarks.
Why it exists: Ensures balanced improvement on target and general capabilities.
Example: On math data, improve AIME/GSM8K while keeping MMLU/IFEval/CLUEWSC solid.

What breaks without each step:

Skip cross‑entropy: No supervised signal; you won’t learn the target domain.
Skip entropy gate: You’re back to SFT; confident conflicts cause big, harmful gradients and forgetting.
Use probability only (no entropy): You can mistake “I don’t know yet” (good to learn) for “I’m confidently different” (dangerous to force), amplifying forgetting.
Don’t normalize top‑K: Gating can be unstable across contexts and vocab sizes.

Concrete mini data walk‑through:

Suppose the dataset label for a math solution includes a rare symbol sequence the base model doesn’t like. The model’s distribution is very peaky—low entropy—and assigns the label very low probability. Cross‑entropy alone would slam a huge gradient, risking broad representation drift. EAFT reads the low entropy, softly gates the loss, and prevents an outsized parameter shove.

Secret sauce (why it’s clever):

It distinguishes epistemic uncertainty (what the model truly hasn’t learned yet) from conflict with strong priors—using entropy, not just probability. That one bit of information flips training from “one-size-fits-all” to “teach where it’s needed; protect where it’s risky.”
It’s compute‑friendly: top‑20 entropy matches full entropy (r≈0.999) with negligible memory (<0.4 KB extra), and the code change is a single per‑token multiply.
It’s robust: Linear, polynomial, and sigmoid gates all trace the same Pareto frontier; the mechanism—not hyperparameters—does the heavy lifting.

04Experiments & Results

What did they test and why?

They measured two things: target‑domain skill (like math or tool use) and general capabilities (like MMLU, IFEval, CLUEWSC). The goal was to see if EAFT could keep the new skill without paying a “forgetting tax.”

Who was the competition?

Baselines included: plain SFT; SFT with KL regularization; FLOW; DFT; and TALR—each a strong, recent attempt to stabilize fine‑tuning.

Scoreboard with context:

Across Qwen and GLM models from 4B to 32B parameters:
- Math domain: EAFT matched or was within about 1 point of the best target scores. That’s like getting a 94–96 on the specialty exam when the top peer got 95–96.
- General capabilities: EAFT consistently preserved more of the base model’s skills. Where standard SFT often dropped several points (e.g., roughly −4.6 overall in one 4B case), EAFT limited the drop to about −1.0—like going from a B− to an A− instead of down to a C+.
Medical domain: On Qwen3‑4B‑Thinking with Huatuo‑O1 fine‑tuning, standard SFT reduced the general average from 86.1 to 81.3 (−4.8). EAFT held much steadier at 84.5 (−1.6) while slightly edging SFT on the medical target average (73.7 vs 73.6).
Agentic tool‑use: On BFCL, SFT scored slightly higher on the target (61.4 vs 60.8) but at a steep cost to general performance (81.1→74.8). EAFT balanced both, keeping general average at 77.5 with only ~1% target gap.

Mechanism checks (does the gate really filter conflicts?):

Gradient landscape plots: Under SFT, the bottom‑left (low probability, low entropy) zone—Confident Conflicts—lights up with very strong gradients. Under EAFT, that same zone becomes pale: gradients are near zero there, confirming suppression of destructive updates.
Training dynamics: For high‑entropy tokens, EAFT’s loss drops as fast as SFT—so it learns what it doesn’t know. For low‑entropy tokens, SFT forces loss toward zero (memorization), while EAFT keeps it stable, avoiding over‑optimization of stubborn priors.

Surprising or notable findings:

Simple masking pilot: Just skipping tokens that are bottom 15% in both probability and entropy already mitigated forgetting a lot, validating the hypothesis that Confident Conflicts drive the damage. But this hurt target learning more—hence the value of soft gating.
Gating variants: Linear, polynomial, and sigmoid gates all sat on the Pareto frontier, proving it’s entropy‑awareness—not a specific formula—that matters. Hard masking preserved general skills best but cut target scores more, so soft gates are the sweet spot.
Efficiency: Top‑K (K=20) entropy estimates track full entropy almost perfectly (Pearson ≈0.999) with tiny memory cost (<0.4 KB), making EAFT practically free to run compared to SFT.

Bottom line with meaning:

EAFT keeps the new talents while protecting old ones. Think of it as learning a new song without forgetting how to play your favorite tunes—something previous methods struggled to balance consistently across sizes and domains.

05Discussion & Limitations

Limitations (be specific):

Not for knowledge editing: If you must overwrite a prior (e.g., fix a fact or teach a counterfactual), EAFT will resist by down‑weighting the needed strong update. In those cases, you want the model to change its mind, so standard SFT or editing methods are better.
Peak specialization: EAFT aims at Pareto balance, not always topping the target leaderboard. If the only goal is to squeeze out the last point on the target task, even at the cost of general skills, plain SFT might edge it.
Confidence calibration: EAFT assumes low entropy means “trust this prior.” If the base model is confidently wrong, EAFT may protect that error. Pairing with calibration could help.

Required resources:

Standard SFT stack: 8×A100 GPUs were used in the paper; memory/compute overhead is negligible beyond SFT since top‑K entropy is light and there’s no frozen reference model.
Usual training knobs: AdamW, cosine LR, warmup, batch size 64, ~10 epochs, long context where needed.

When not to use it:

One‑shot fact replacement/knowledge editing.
Counterfactual training where the goal is to deliberately override strong priors.
Settings where you want maximum specialization and accept forgetting.

Open questions:

Can we auto‑calibrate entropy so we don’t protect confident hallucinations?
Can sequence‑level or span‑level entropy gates (not just token‑level) further improve retention?
How does EAFT combine with lightweight RL (e.g., RLAIF, RFT) in hybrid training?
Can we adapt the gate over training time (curriculum by uncertainty) for even better results?
Can similar entropy gating help vision or multimodal models in continual learning?

06Conclusion & Future Work

Three‑sentence summary:

The paper finds that catastrophic forgetting in SFT is largely driven by Confident Conflicts—tokens where the model is very sure in a different answer than the label, producing harmful, oversized gradients.
It proposes EAFT, which multiplies the usual SFT loss by token‑level (top‑K) entropy, turning down learning where the model is rigid and turning it up where it’s uncertain.
Across multiple models and domains, EAFT matches target performance while clearly preserving general capabilities, with minimal compute overhead.

Main achievement:

A simple, drop‑in, uncertainty‑aware loss that cleanly separates “learn this” (uncertain) from “don’t bulldoze this” (confidently different), fixing a root cause of forgetting.

Future directions:

Add calibration to avoid protecting confident hallucinations; explore span/sequence‑level gating; blend with on‑policy RL or RFT to further reduce alignment tax; and extend to multimodal continual learning.

Why remember this:

EAFT upgrades SFT from a one‑volume‑fits‑all teacher to a caring coach who listens for uncertainty and avoids shouting at stubborn strengths. That simple switch—using entropy as a gate—lets models grow new skills without erasing who they already are.

Practical Applications

•Swap standard SFT loss with EAFT by multiplying token cross-entropy by normalized top‑K entropy (e.g., K=20).
•Use EAFT for domain adaptation (math, medical, tool use) to retain general skills while matching target performance.
•Enable continual learning updates by defaulting to EAFT, reducing regressions across releases.
•Monitor entropy–probability heatmaps to spot Confident Conflicts in your training data.
•Pair EAFT with light validation on general benchmarks (MMLU, IFEval, CLUEWSC) to select checkpoints that balance retention and specialization.
•Prefer soft gating over hard masking to avoid losing valuable target signals.
•For knowledge editing (overwriting priors), temporarily switch back to standard SFT or specialized editing methods, then return to EAFT for routine updates.
•Keep K≈20 for top‑K entropy to maintain high fidelity (r≈0.999) with negligible overhead.
•Test alternative gate shapes (linear, polynomial, sigmoid) if desired; default to linear for robustness and simplicity.
•Log per-token gates to audit when and where the model resisted updates, informing data cleanup or curriculum design.

Version: 1