LK Losses: Direct Acceptance Rate Optimization for Speculative Decoding
Key Summary
- ā¢Speculative decoding speeds up big language models by letting a small helper model guess several next words and having the big model check them all at once.
- ā¢What really controls the speedup is how often those guesses get accepted by the big model (the acceptance rate).
- ā¢People usually train the helper model with KL divergence, which tries to make its guesses look like the big modelāsābut that does not always maximize acceptance when the helper is small.
- ā¢This paper introduces LK losses, new training goals that directly raise the acceptance rate instead of using KL as a proxy.
- ā¢There are two LK losses: a hybrid that mixes KL with Total Variation (TV) distance using an adaptive schedule, and a negative log-acceptance loss that scales TVās gradients automatically.
- ā¢Across four helper architectures and six big models (8B to 685B), LK losses consistently improve acceptanceāoften by 8ā10% at temperature 1.
- ā¢LK losses are simple to add, need no extra compute, and work even when the helper uses a smaller vocabulary.
- ā¢They help the most when the helper is much smaller than the big model or when architectures differ a lot (like dense helper vs. MoE target).
- ā¢Pure TV training from scratch performs poorly due to tiny, unstable gradients; LK fixes this while keeping the direct acceptance focus.
- ā¢The authors release data and weights to help others reproduce and build on these results.
Why This Research Matters
LK losses make AI assistants feel faster and smoother without changing their brains (the target model). By training the small helper to get more of its guesses accepted, responses arrive with less waiting, improving user experience in chat, coding, and tutoring. This efficiency reduces cloud costs and energy use, helping companies serve more users per GPU and making deployments greener. On-device or edge AI also benefits because stronger acceptance lets smaller hardware do more. And because LK is a simple drop-in objective with no extra compute, teams can adopt it quickly across many model types.
Detailed Explanation
Tap terms for definitions01Background & Problem Definition
š Hook: Imagine you and a friend are writing a story together. Your friend (the fast one) quickly suggests several next words, and you (the careful one) check them all at once, keeping the good ones and stopping at the first bad one. This teamwork makes you both faster than if you wrote every word alone.
š„¬ The Concept: That is speculative decoding. A small draft model guesses a few tokens ahead, and a large target model verifies them in one go.
- How it works (step by step):
- The draft model proposes K next tokens in a row.
- The target model checks all K at once.
- It accepts tokens in order until it hits the first mismatch, then it stops.
- Generation then continues from there.
- Why it matters: Without this, the big model must generate tokens one by one, waiting on memory each time, which is slow and expensive.
š Anchor: Think of the draft like a sprinter laying down stepping stones, and the target like a judge who lets you keep stepping only as long as the stones are solid.
The World Before: Large language models (LLMs) are great at writing, coding, and explaining, but they are slow because they generate text sequentiallyāone token after another. Even with powerful chips, the real bottleneck is moving data (memory bandwidth), not just computing. Speculative decoding sped things up by letting a smaller helper guess multiple tokens and then having the big model check them in parallel. This preserves the big modelās quality while cutting the time per token.
The Problem: The speedup from speculative decoding hinges on acceptance rateāthe chance that each guessed token gets kept. If the helperās guesses are often accepted, we move faster; if not, we waste time. Traditionally, people train the helper using KL divergence, which tries to make its guessed probabilities match the big modelās. But small helpers (often only 1ā5% the size of the target) canāt perfectly match the target. In those real, capacity-limited settings, shrinking KL doesnāt necessarily maximize acceptance.
Failed Attempts: Researchers tried several divergencesāforward KL (mode-covering), reverse KL (mode-seeking), and Total Variation (TV) distance. TV is theoretically perfect for acceptance because acceptance equals 1 ā TV. But when training from scratch, pure TV has two problems: (1) gradients become tiny when the helper spreads probability across a huge vocabulary, and (2) the loss surface is non-smooth, making optimization unstable. Others used hybrids or extra signals (like matching hidden states), but results were uneven, often depending heavily on the dataset or starting from a pretrained helper.
The Gap: We needed a training objective that (a) directly optimizes acceptance (not just a proxy), (b) trains stably from scratch, and (c) works across architectures and sizes. In other words: get TVās focus on overlap (which truly drives acceptance) without TVās training headaches.
Real Stakes: Faster LLMs matter in daily life. They reduce waiting time in chat apps, cut cloud costs, save energy, and make on-device AI more feasible. For coding assistants, education tutors, and creative tools, smoother, speedier replies feel more natural. Also, businesses care about throughput (how many users per GPU) and latency (how fast a single answer returns). Raising acceptance directly improves both.
š Anchor: If youāre playing a game where you guess the next word and earn points only when the judge accepts it, you should practice getting acceptedānot just looking similar to the judgeās tastes on paper. LK losses teach exactly that.
02Core Idea
š Hook: You know how in school, itās better to practice the exact thing the test grades you on, not something related-but-different? Like, if the test scores how many answers the teacher marks as correct, you should train to get marked correct.
š„¬ The Concept: The key insight is: train the draft to maximize acceptance rate directly, not just to match the targetās distribution.
- How it works:
- Define acceptance for each position as the overlap between the draftās and targetās probabilities.
- Use losses that push the draft to increase that overlap.
- Make training stable from scratch by blending smooth guidance (KL) with direct overlap pressure (TV), or by scaling TVās gradients automatically.
- Why it matters: Without targeting acceptance directly, a small helper can waste effort matching details that donāt raise acceptanceāand speed suffers.
š Anchor: Itās like practicing free throws when the game scores free throws, not just practicing dribbling because itās sort of related.
Multiple Analogies:
- Stamp of Approval: The target model is a teacher stamping each token as Accepted or Rejected. LK losses train the student to get more stamps, not just to write like the teacher in general.
- Overlapping Circles: Imagine two paint blobs (probability distributions). The accepted part is the overlap. KL tries to make blobs look similar overall; LK grows the overlap directly.
- Grocery Budget: You can spend money (model capacity) on matching every shelf, or focus on the few items youāll actually buy (top-probability tokens). LK chooses the items that maximize your useful overlap.
Before vs After:
- Before (KL-only): Helpers tried to mimic the entire target distribution, which is ideal only if theyāre big enough. Small helpers often spread themselves too thin.
- After (LK): Helpers focus on the part that matters: increasing acceptance. This yields longer accepted runs and higher throughput without changing the target model.
Why It Works (intuition without equations): Acceptance equals the sum of the smaller of the two probabilities for each token. Maximizing acceptance means pushing the helperās mass onto places the target already believes in. TV distance measures the total mismatch and directly connects to acceptance (acceptance = 1 ā TV). But TV alone has weak, choppy gradients from scratch. So LK comes in two flavors: (a) a hybrid that starts with smooth KL guidance and gradually shifts to TV as the helper improves; (b) a negative log-acceptance loss that automatically scales TVās gradients so they arenāt tiny early on.
Building Blocks (with Sandwich explanations):
-
š Hook: Imagine a fast guesser and a careful checker. š„¬ Speculative Decoding: A small draft guesses multiple tokens; the big target checks them in parallel.
- How: Guess K tokens ā check all ā accept until first miss.
- Why: Speed without losing the targetās quality. š Anchor: Like a sprinter placing stones and a judge approving each step.
-
š Hook: Only kept guesses make you faster. š„¬ Acceptance Rate: The chance a drafted token is accepted by the target.
- How: Itās the probability overlap at each step.
- Why: Low acceptance wastes guesses; high acceptance boosts speed. š Anchor: More green lights in a row means faster driving across town.
-
š Hook: You can try to imitate someone overall, or focus on what theyāll actually approve. š„¬ KL Divergence: A measure of how the helperās distribution differs from the targetās.
- How: Penalizes mismatch across the whole distribution.
- Why: Smooth to optimize, but not always the best for acceptance when capacity is small. š Anchor: Copying a whole drawing when the test only grades the smiley face wastes effort.
-
š Hook: Two pies can share more or less slice overlap. š„¬ Total Variation (TV) Distance: Measures total mismatch; acceptance = 1 ā TV.
- How: Counts how much probability mass doesnāt overlap.
- Why: Perfect for acceptanceābut hard to train from scratch (tiny, choppy gradients). š Anchor: TV tells you exactly how much of your piesā slices line up.
-
š Hook: Train what the test scores. š„¬ LK Losses: New objectives that directly raise acceptance.
- How: Either mix KL+TV with an adaptive schedule, or optimize negative log-acceptance that scales TVās gradients.
- Why: Get TVās directness without its training trouble. š Anchor: Practice the exact questions that show up on the quiz.
-
š Hook: Start with training wheels, then ride freely. š„¬ Hybrid Objective: A blend LĪ» = λ·KL + (1āĪ»)Ā·TV.
- How: Begin with big Ī» (more KL), shrink Ī» as acceptance improves.
- Why: Stable early learning, direct acceptance focus later. š Anchor: Coach holds the bike at first, then lets go.
-
š Hook: Turn up the volume when itās too quiet. š„¬ Negative Log-Acceptance: Lα = ālog(acceptance) scales TVās gradients by 1/acceptance.
- How: When acceptance is small, the gradients get amplified automatically.
- Why: Fixes TVās tiny-gradient problem from scratch. š Anchor: If the music is faint, raise the knob so you can actually hear it.
03Methodology
At a high level: Context + Draft logits ā (Compute draft distribution q and target distribution p) ā (Compute acceptance per position) ā (Apply LK loss: hybrid or negative log-acceptance) ā Update draft model ā Output: a draft that gets accepted more often.
Step-by-step (with Sandwich for key pieces):
-
š Hook: Imagine three bags of marbles labeled: Draft, Target, and Overlap. š„¬ What happens: From the context, the target model outputs logits (turn into p), and the draft outputs logits (turn into q). The overlap is the per-token min(p, q); summing it gives acceptance α at that position.
- Why this step exists: We need p and q to measure how much they agree right now. Without them, we canāt compute acceptance.
- Example: If p says token A=0.6, B=0.3, C=0.1 and q says A=0.4, B=0.5, C=0.1, the overlap is min(A)=0.4, min(B)=0.3, min(C)=0.1, sum=0.8 ā α=0.8. š Anchor: The overlap bag is the set of marbles both bags would pickāthe accepted part.
-
š Hook: You can warm up with easy drills before hard ones. š„¬ Hybrid LK Loss (LĪ»): LĪ» = λ·KL(p||q) + (1āĪ»)Ā·TV(p, q).
- How: Compute both KL and TV; blend them with Ī».
- Why: KL gives smooth guidance early; TV pushes overlap later.
- Example: If current α is low, set Ī» high (e.g., ~1), focusing on KL. As α rises, reduce Ī» to emphasize TV. š Anchor: Like practicing basic passes before fast break plays.
-
š Hook: If you canāt hear the coach, you wonāt learn the move. š„¬ Negative Log-Acceptance (Lα): Compute Lα = ālog(α). Its gradient equals TVās gradient scaled by 1/α.
- How: When α is small, 1/α is bigāamplifying useful signals.
- Why: Fixes TVās tiny early gradients without fancy blending.
- Example: If α=0.1, ālog(0.1)=2.3, and gradients are boosted 10Ć relative to plain TV. š Anchor: Turning up the volume so the instructions come through clearly.
-
š Hook: Different steps in a dance need different attention. š„¬ Adaptive Blending (Ī» schedule): Ī» = exp(āĪ· Ā· sg[α]) per position.
- How: Compute α per draft head; plug into the schedule; stop-gradient so Ī» doesnāt wobble.
- Why: Early heads with high α get more TV (small λ); late heads with low α get more KL (big λ). This naturally balances trust and pressure.
- Example: With Ī·=3, α=0.2 ā Ī»āe^(ā0.6)ā0.55; α=0.6 ā Ī»āe^(ā1.8)ā0.17. š Anchor: You keep training wheels longer on the trickier parts of the ride.
-
š Hook: Focus first where wins count most. š„¬ Per-head weighting: Weight early positions more (e.g., geometric decay γ^nā1 with γ=0.8).
- How: Multiply each headās loss by its weight before summing.
- Why: Early tokens dominate acceptance length; improving them boosts speed the most.
- Example: Head 1 weight=1.0; Head 2=0.8; Head 3=0.64; and so on. š Anchor: In relay races, the first leg sets the tone.
-
š Hook: If your dictionary is smaller, donāt get stuck on missing words. š„¬ Vocabulary Truncation: Many drafters use a reduced vocabulary for speed.
- How: KL can blow up if q=0 where p>0; people mask p to dodge infinities. LK avoids this: tokens outside draft vocab contribute min(p,0)=0 to acceptance.
- Why: LK optimizes with respect to the real target, no masking hacks.
- Example: If āzebraā isnāt in the draftās vocab, it simply doesnāt affect α or the loss. š Anchor: If a library doesnāt carry a book, you donāt get penalized for not checking it out.
-
š Hook: Different tools, same goal. š„¬ Architecture-agnostic training: Works with MEDUSA (parallel heads), MLP speculators, EAGLE-3 (tiny transformer), and prebuilt MTP modules.
- How: The loss only needs p and q; it doesnāt care how q was produced.
- Why: You can drop LK into existing pipelines without redesign.
- Example: Fine-tuning DeepSeek-V3ās MTP with LK lifted acceptance further than KL. š Anchor: Whether you use a pencil or pen, you can still write the same A+ essay.
Concrete mini-example (3-token vocab):
- Target p: [cat=0.6, car=0.3, cap=0.1]
- Draft q: [cat=0.4, car=0.5, cap=0.1]
- Acceptance α = 0.4 + 0.3 + 0.1 = 0.8.
- Lα = ālog(0.8) ā 0.223; gradients push q to move 0.5ā0.3 for ācarā and 0.4ā0.6 for ācat,ā growing the overlap.
- In the hybrid, early on Ī» is large, so KL keeps things smooth; later, Ī» shrinks, and TV fine-tunes overlap.
Secret Sauce:
- Directness: Optimize acceptance, the true lever of speed.
- Stability: Early KL guidance or 1/α scaling prevents TVās tiny, jumpy gradients from stalling learning.
- Curriculum: Adaptive Ī» dials in the right kind of learning at each stage and position.
- Practicality: No extra compute, vocab truncation handled naturally, architecture-agnostic.
04Experiments & Results
š Hook: If you want to know who runs faster, you time the race on the same track.
š„¬ The Test: They measured average acceptance length Ļāthe expected number of tokens produced in each speculative round (including a guaranteed bonus token). Bigger Ļ means longer accepted runs and faster generation.
- How: Evaluate on full, common benchmarks across domains: MT-Bench (chat), HumanEval (code), and GSM8K (math).
- Why: These reflect varied real-world usage and reduce noise from tiny samples.
š Anchor: Itās like comparing how many clean passes a soccer team can string together before losing the ball.
The Competition:
- Baselines: Forward KL (standard), Total Variation (TV) alone, and public HuggingFace speculator checkpoints.
- Proposed: LK Hybrid (adaptive λ with η typically 3, 10 for MEDUSA) and Negative Log-Acceptance (Lα).
- Settings: Two temperaturesāT=0 (greedy) and T=1 (stochastic). Training targeted stochastic acceptance (T=1), their primary evaluation setting.
Scoreboard with Context:
- Across LLaMA-3.1-8B with three different drafters (EAGLE-3, MEDUSA, MLP), both LK variants beat KL on most datasets; Hybrid LK with adaptive Ī» is best overall.
- Under T=1, relative Ļ gains commonly reach about 8ā10% (like turning a B- into a solid A- in acceptance runs). Example highlights:
- For EAGLE-3 on Qwen3-235B (MoE, very large), Hybrid LK improved mean Ļ by about +8.2% at T=1 compared to KL.
- On GPT-OSS-120B (MoE), Hybrid LK gained +7.7% at T=1.
- For DeepSeek-V3 (685B) MTP, fine-tuning with Hybrid LK added +5.6% over KL at T=1, even though the module was already strong.
- Smaller targets (like 20B dense) saw smaller but consistent gains (~+3ā4%).
- Pure TV performed clearly worse from scratch (like practicing only the trick shot and missing basics). Constant-Ī» hybrids (e.g., Ī»=0.5) helped a bit, but not as much as adaptive Ī».
- Negative Log-Acceptance (Lα) also beat KL, especially at T=1; the hybrid was generally strongest, but Lα closed the gap when Ī» scheduling wasnāt tuned.
Surprising Findings:
- TV alone looks theoretically perfect but trains poorly from random startsāits gradients are too weak and choppy. LKās fixes (adaptive Ī» or 1/α scaling) are crucial in practice.
- The gains are largest when the draft is much smaller than the target or when their architectures mismatch (dense draft vs. MoE target). Thatās exactly when KLās "match everything" instinct misallocates the draftās tiny capacity.
- Adaptive Ī» per head helps uneven heads: early positions (already good) get more TV fine-tuning; later positions (weaker) get more KL stability.
Why the Numbers Matter:
- A +8ā10% jump in Ļ is like adding an extra accepted token every 10ā12 tokens. In real systems, that translates to noticeable speedups, better throughput per GPU, and lower costsāall without touching the target model.
š Anchor: If your relay team consistently adds one more clean handoff before a drop, your total race time improves a lotāeven if each runner didnāt change shoes.
05Discussion & Limitations
Limitations:
- š Hook: Even the best tool has jobs itās not built for. š„¬ What it canāt do: When the draft is very large (high-capacity) and can already match the target closely, KL is already strong, so LKās gains shrink. Also, pure TV remains hard from scratch; LK works by managing or scaling TV, not by making TV alone easy. š Anchor: A power saw shines on big cuts, not on tiny polishing.
Required Resources:
- Target model access (to compute p) during training and a dataset of prompts-responses from that target.
- Usual training stack (e.g., AdamW), no extra compute vs. KL.
- An inference server that implements correct rejection sampling at T>0 (they patched vLLM accordingly).
When NOT to Use:
- If you only ever decode greedily (T=0) and already have high acceptance with a strong draft, LKās advantage may be minimal.
- If K=1 (no multi-token speculation), acceptance optimization matters less.
- If your pipeline cannot supply target distributions during training, you canāt compute acceptance overlaps.
Open Questions:
- Directly optimizing system efficiency (accepted tokens per drafted tokens) rather than just per-position acceptance.
- Learning per-head aggregation weights instead of fixed geometric decay.
- Baking in real deployment knobs (top-k, top-p) into the loss so training matches inference exactly.
- Extending LK ideas to tree-based drafting and other speculative search strategies.
Risks and Practical Notes:
- Hyperparameter Ī· for Ī» scheduling affects how quickly the blend shifts; different architectures (like MEDUSA) may prefer different Ī·. Fortunately, the method is robust across a reasonable range.
- With very small datasets or severe domain drift, any distillation-style training can underperform; LK is no exceptionāensure your training data matches your deployment.
- The improvements focus on speed efficiency; they preserve target quality because verification remains unchanged.
06Conclusion & Future Work
Three-Sentence Summary:
- Speculative decodingās speed comes from how many drafted tokens the target accepts, so the best way to train a draft is to maximize acceptanceānot just to mimic the targetās full distribution.
- LK losses do exactly this: a hybrid KL+TV objective with an adaptive schedule, and a negative log-acceptance loss that scales TVās gradients, both training stably from scratch and directly boosting acceptance.
- Across many targets (8Bā685B) and drafters (EAGLE-3, MEDUSA, MLP, MTP), LK consistently improves acceptance lengthāoften by 8ā10% at T=1āwithout extra compute or architectural changes.
Main Achievement:
- Turning the theoretical link (acceptance = 1 ā TV) into practical, stable training that outperforms KL in real systems, especially when the draft is much smaller than the target or architectures differ.
Future Directions:
- Optimize a fuller system-efficiency objective, learn per-head weights, and incorporate inference knobs (top-k/top-p) during training.
- Explore LK within tree-based speculative search and across multilingual or domain-specialized settings.
Why Remember This:
- LK losses align training with what truly matters at inference time: being accepted. They make small helpers smarter about where to spend their limited capacity, unlocking faster, cheaper, and greener LLM servingāwithout touching the main model.
Practical Applications
- ā¢Speed up chat assistants by training their draft heads with LK losses to raise acceptance and reduce latency.
- ā¢Lower cloud serving costs by improving throughput (more users per GPU) via higher average acceptance length.
- ā¢Improve coding assistants (IDE plugins) so multi-token completions are accepted more often, reducing flicker and re-computation.
- ā¢Enhance on-device AI experiences (phones, laptops) where smaller draft heads paired with big remote targets benefit from better acceptance.
- ā¢Upgrade existing speculators (EAGLE-3, MEDUSA, MLP, MTP) by fine-tuning with LK for immediate acceptance gains.
- ā¢Deploy vocabulary-truncated draft heads without special masking tricks, using LKās natural handling of missing tokens.
- ā¢Stabilize training from scratch for lightweight drafters by starting with the hybrid LK objective and adaptive Ī» scheduling.
- ā¢Boost performance for large MoE targets (e.g., Qwen3, DeepSeek) where small dense drafters struggle under KL alone.
- ā¢Tune acceptance under real sampling (temperature/top-k/top-p) by aligning training objectives with inference behavior.
- ā¢Accelerate evaluation pipelines by using chain sampling with LK-trained drafters, then extending to tree-based methods.