šŸŽ“How I Study AIHISA
šŸ“–Read
šŸ“„PapersšŸ“°BlogsšŸŽ¬Courses
šŸ’”Learn
šŸ›¤ļøPathsšŸ“šTopicsšŸ’”ConceptsšŸŽ“Shorts
šŸŽÆPractice
🧩ProblemsšŸ“Daily LogšŸŽÆPrompts🧠Review
SearchSettings
LK Losses: Direct Acceptance Rate Optimization for Speculative Decoding | How I Study AI

LK Losses: Direct Acceptance Rate Optimization for Speculative Decoding

Intermediate
Alexander Samarin, Sergei Krutikov, Anton Shevtsov et al.2/27/2026
arXiv

Key Summary

  • •Speculative decoding speeds up big language models by letting a small helper model guess several next words and having the big model check them all at once.
  • •What really controls the speedup is how often those guesses get accepted by the big model (the acceptance rate).
  • •People usually train the helper model with KL divergence, which tries to make its guesses look like the big model’s—but that does not always maximize acceptance when the helper is small.
  • •This paper introduces LK losses, new training goals that directly raise the acceptance rate instead of using KL as a proxy.
  • •There are two LK losses: a hybrid that mixes KL with Total Variation (TV) distance using an adaptive schedule, and a negative log-acceptance loss that scales TV’s gradients automatically.
  • •Across four helper architectures and six big models (8B to 685B), LK losses consistently improve acceptance—often by 8–10% at temperature 1.
  • •LK losses are simple to add, need no extra compute, and work even when the helper uses a smaller vocabulary.
  • •They help the most when the helper is much smaller than the big model or when architectures differ a lot (like dense helper vs. MoE target).
  • •Pure TV training from scratch performs poorly due to tiny, unstable gradients; LK fixes this while keeping the direct acceptance focus.
  • •The authors release data and weights to help others reproduce and build on these results.

Why This Research Matters

LK losses make AI assistants feel faster and smoother without changing their brains (the target model). By training the small helper to get more of its guesses accepted, responses arrive with less waiting, improving user experience in chat, coding, and tutoring. This efficiency reduces cloud costs and energy use, helping companies serve more users per GPU and making deployments greener. On-device or edge AI also benefits because stronger acceptance lets smaller hardware do more. And because LK is a simple drop-in objective with no extra compute, teams can adopt it quickly across many model types.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

šŸž Hook: Imagine you and a friend are writing a story together. Your friend (the fast one) quickly suggests several next words, and you (the careful one) check them all at once, keeping the good ones and stopping at the first bad one. This teamwork makes you both faster than if you wrote every word alone.

🄬 The Concept: That is speculative decoding. A small draft model guesses a few tokens ahead, and a large target model verifies them in one go.

  • How it works (step by step):
    1. The draft model proposes K next tokens in a row.
    2. The target model checks all K at once.
    3. It accepts tokens in order until it hits the first mismatch, then it stops.
    4. Generation then continues from there.
  • Why it matters: Without this, the big model must generate tokens one by one, waiting on memory each time, which is slow and expensive.

šŸž Anchor: Think of the draft like a sprinter laying down stepping stones, and the target like a judge who lets you keep stepping only as long as the stones are solid.

The World Before: Large language models (LLMs) are great at writing, coding, and explaining, but they are slow because they generate text sequentially—one token after another. Even with powerful chips, the real bottleneck is moving data (memory bandwidth), not just computing. Speculative decoding sped things up by letting a smaller helper guess multiple tokens and then having the big model check them in parallel. This preserves the big model’s quality while cutting the time per token.

The Problem: The speedup from speculative decoding hinges on acceptance rate—the chance that each guessed token gets kept. If the helper’s guesses are often accepted, we move faster; if not, we waste time. Traditionally, people train the helper using KL divergence, which tries to make its guessed probabilities match the big model’s. But small helpers (often only 1–5% the size of the target) can’t perfectly match the target. In those real, capacity-limited settings, shrinking KL doesn’t necessarily maximize acceptance.

Failed Attempts: Researchers tried several divergences—forward KL (mode-covering), reverse KL (mode-seeking), and Total Variation (TV) distance. TV is theoretically perfect for acceptance because acceptance equals 1 āˆ’ TV. But when training from scratch, pure TV has two problems: (1) gradients become tiny when the helper spreads probability across a huge vocabulary, and (2) the loss surface is non-smooth, making optimization unstable. Others used hybrids or extra signals (like matching hidden states), but results were uneven, often depending heavily on the dataset or starting from a pretrained helper.

The Gap: We needed a training objective that (a) directly optimizes acceptance (not just a proxy), (b) trains stably from scratch, and (c) works across architectures and sizes. In other words: get TV’s focus on overlap (which truly drives acceptance) without TV’s training headaches.

Real Stakes: Faster LLMs matter in daily life. They reduce waiting time in chat apps, cut cloud costs, save energy, and make on-device AI more feasible. For coding assistants, education tutors, and creative tools, smoother, speedier replies feel more natural. Also, businesses care about throughput (how many users per GPU) and latency (how fast a single answer returns). Raising acceptance directly improves both.

šŸž Anchor: If you’re playing a game where you guess the next word and earn points only when the judge accepts it, you should practice getting accepted—not just looking similar to the judge’s tastes on paper. LK losses teach exactly that.

02Core Idea

šŸž Hook: You know how in school, it’s better to practice the exact thing the test grades you on, not something related-but-different? Like, if the test scores how many answers the teacher marks as correct, you should train to get marked correct.

🄬 The Concept: The key insight is: train the draft to maximize acceptance rate directly, not just to match the target’s distribution.

  • How it works:
    1. Define acceptance for each position as the overlap between the draft’s and target’s probabilities.
    2. Use losses that push the draft to increase that overlap.
    3. Make training stable from scratch by blending smooth guidance (KL) with direct overlap pressure (TV), or by scaling TV’s gradients automatically.
  • Why it matters: Without targeting acceptance directly, a small helper can waste effort matching details that don’t raise acceptance—and speed suffers.

šŸž Anchor: It’s like practicing free throws when the game scores free throws, not just practicing dribbling because it’s sort of related.

Multiple Analogies:

  1. Stamp of Approval: The target model is a teacher stamping each token as Accepted or Rejected. LK losses train the student to get more stamps, not just to write like the teacher in general.
  2. Overlapping Circles: Imagine two paint blobs (probability distributions). The accepted part is the overlap. KL tries to make blobs look similar overall; LK grows the overlap directly.
  3. Grocery Budget: You can spend money (model capacity) on matching every shelf, or focus on the few items you’ll actually buy (top-probability tokens). LK chooses the items that maximize your useful overlap.

Before vs After:

  • Before (KL-only): Helpers tried to mimic the entire target distribution, which is ideal only if they’re big enough. Small helpers often spread themselves too thin.
  • After (LK): Helpers focus on the part that matters: increasing acceptance. This yields longer accepted runs and higher throughput without changing the target model.

Why It Works (intuition without equations): Acceptance equals the sum of the smaller of the two probabilities for each token. Maximizing acceptance means pushing the helper’s mass onto places the target already believes in. TV distance measures the total mismatch and directly connects to acceptance (acceptance = 1 āˆ’ TV). But TV alone has weak, choppy gradients from scratch. So LK comes in two flavors: (a) a hybrid that starts with smooth KL guidance and gradually shifts to TV as the helper improves; (b) a negative log-acceptance loss that automatically scales TV’s gradients so they aren’t tiny early on.

Building Blocks (with Sandwich explanations):

  • šŸž Hook: Imagine a fast guesser and a careful checker. 🄬 Speculative Decoding: A small draft guesses multiple tokens; the big target checks them in parallel.

    • How: Guess K tokens → check all → accept until first miss.
    • Why: Speed without losing the target’s quality. šŸž Anchor: Like a sprinter placing stones and a judge approving each step.
  • šŸž Hook: Only kept guesses make you faster. 🄬 Acceptance Rate: The chance a drafted token is accepted by the target.

    • How: It’s the probability overlap at each step.
    • Why: Low acceptance wastes guesses; high acceptance boosts speed. šŸž Anchor: More green lights in a row means faster driving across town.
  • šŸž Hook: You can try to imitate someone overall, or focus on what they’ll actually approve. 🄬 KL Divergence: A measure of how the helper’s distribution differs from the target’s.

    • How: Penalizes mismatch across the whole distribution.
    • Why: Smooth to optimize, but not always the best for acceptance when capacity is small. šŸž Anchor: Copying a whole drawing when the test only grades the smiley face wastes effort.
  • šŸž Hook: Two pies can share more or less slice overlap. 🄬 Total Variation (TV) Distance: Measures total mismatch; acceptance = 1 āˆ’ TV.

    • How: Counts how much probability mass doesn’t overlap.
    • Why: Perfect for acceptance—but hard to train from scratch (tiny, choppy gradients). šŸž Anchor: TV tells you exactly how much of your pies’ slices line up.
  • šŸž Hook: Train what the test scores. 🄬 LK Losses: New objectives that directly raise acceptance.

    • How: Either mix KL+TV with an adaptive schedule, or optimize negative log-acceptance that scales TV’s gradients.
    • Why: Get TV’s directness without its training trouble. šŸž Anchor: Practice the exact questions that show up on the quiz.
  • šŸž Hook: Start with training wheels, then ride freely. 🄬 Hybrid Objective: A blend LĪ» = λ·KL + (1āˆ’Ī»)Ā·TV.

    • How: Begin with big Ī» (more KL), shrink Ī» as acceptance improves.
    • Why: Stable early learning, direct acceptance focus later. šŸž Anchor: Coach holds the bike at first, then lets go.
  • šŸž Hook: Turn up the volume when it’s too quiet. 🄬 Negative Log-Acceptance: Lα = āˆ’log(acceptance) scales TV’s gradients by 1/acceptance.

    • How: When acceptance is small, the gradients get amplified automatically.
    • Why: Fixes TV’s tiny-gradient problem from scratch. šŸž Anchor: If the music is faint, raise the knob so you can actually hear it.

03Methodology

At a high level: Context + Draft logits → (Compute draft distribution q and target distribution p) → (Compute acceptance per position) → (Apply LK loss: hybrid or negative log-acceptance) → Update draft model → Output: a draft that gets accepted more often.

Step-by-step (with Sandwich for key pieces):

  1. šŸž Hook: Imagine three bags of marbles labeled: Draft, Target, and Overlap. 🄬 What happens: From the context, the target model outputs logits (turn into p), and the draft outputs logits (turn into q). The overlap is the per-token min(p, q); summing it gives acceptance α at that position.

    • Why this step exists: We need p and q to measure how much they agree right now. Without them, we can’t compute acceptance.
    • Example: If p says token A=0.6, B=0.3, C=0.1 and q says A=0.4, B=0.5, C=0.1, the overlap is min(A)=0.4, min(B)=0.3, min(C)=0.1, sum=0.8 → α=0.8. šŸž Anchor: The overlap bag is the set of marbles both bags would pick—the accepted part.
  2. šŸž Hook: You can warm up with easy drills before hard ones. 🄬 Hybrid LK Loss (LĪ»): LĪ» = λ·KL(p||q) + (1āˆ’Ī»)Ā·TV(p, q).

    • How: Compute both KL and TV; blend them with Ī».
    • Why: KL gives smooth guidance early; TV pushes overlap later.
    • Example: If current α is low, set Ī» high (e.g., ~1), focusing on KL. As α rises, reduce Ī» to emphasize TV. šŸž Anchor: Like practicing basic passes before fast break plays.
  3. šŸž Hook: If you can’t hear the coach, you won’t learn the move. 🄬 Negative Log-Acceptance (Lα): Compute Lα = āˆ’log(α). Its gradient equals TV’s gradient scaled by 1/α.

    • How: When α is small, 1/α is big—amplifying useful signals.
    • Why: Fixes TV’s tiny early gradients without fancy blending.
    • Example: If α=0.1, āˆ’log(0.1)=2.3, and gradients are boosted 10Ɨ relative to plain TV. šŸž Anchor: Turning up the volume so the instructions come through clearly.
  4. šŸž Hook: Different steps in a dance need different attention. 🄬 Adaptive Blending (Ī» schedule): Ī» = exp(āˆ’Ī· Ā· sg[α]) per position.

    • How: Compute α per draft head; plug into the schedule; stop-gradient so Ī» doesn’t wobble.
    • Why: Early heads with high α get more TV (small Ī»); late heads with low α get more KL (big Ī»). This naturally balances trust and pressure.
    • Example: With Ī·=3, α=0.2 → Ī»ā‰ˆe^(āˆ’0.6)ā‰ˆ0.55; α=0.6 → Ī»ā‰ˆe^(āˆ’1.8)ā‰ˆ0.17. šŸž Anchor: You keep training wheels longer on the trickier parts of the ride.
  5. šŸž Hook: Focus first where wins count most. 🄬 Per-head weighting: Weight early positions more (e.g., geometric decay γ^nāˆ’1 with γ=0.8).

    • How: Multiply each head’s loss by its weight before summing.
    • Why: Early tokens dominate acceptance length; improving them boosts speed the most.
    • Example: Head 1 weight=1.0; Head 2=0.8; Head 3=0.64; and so on. šŸž Anchor: In relay races, the first leg sets the tone.
  6. šŸž Hook: If your dictionary is smaller, don’t get stuck on missing words. 🄬 Vocabulary Truncation: Many drafters use a reduced vocabulary for speed.

    • How: KL can blow up if q=0 where p>0; people mask p to dodge infinities. LK avoids this: tokens outside draft vocab contribute min(p,0)=0 to acceptance.
    • Why: LK optimizes with respect to the real target, no masking hacks.
    • Example: If ā€œzebraā€ isn’t in the draft’s vocab, it simply doesn’t affect α or the loss. šŸž Anchor: If a library doesn’t carry a book, you don’t get penalized for not checking it out.
  7. šŸž Hook: Different tools, same goal. 🄬 Architecture-agnostic training: Works with MEDUSA (parallel heads), MLP speculators, EAGLE-3 (tiny transformer), and prebuilt MTP modules.

    • How: The loss only needs p and q; it doesn’t care how q was produced.
    • Why: You can drop LK into existing pipelines without redesign.
    • Example: Fine-tuning DeepSeek-V3’s MTP with LK lifted acceptance further than KL. šŸž Anchor: Whether you use a pencil or pen, you can still write the same A+ essay.

Concrete mini-example (3-token vocab):

  • Target p: [cat=0.6, car=0.3, cap=0.1]
  • Draft q: [cat=0.4, car=0.5, cap=0.1]
  • Acceptance α = 0.4 + 0.3 + 0.1 = 0.8.
  • Lα = āˆ’log(0.8) ā‰ˆ 0.223; gradients push q to move 0.5→0.3 for ā€œcarā€ and 0.4→0.6 for ā€œcat,ā€ growing the overlap.
  • In the hybrid, early on Ī» is large, so KL keeps things smooth; later, Ī» shrinks, and TV fine-tunes overlap.

Secret Sauce:

  • Directness: Optimize acceptance, the true lever of speed.
  • Stability: Early KL guidance or 1/α scaling prevents TV’s tiny, jumpy gradients from stalling learning.
  • Curriculum: Adaptive Ī» dials in the right kind of learning at each stage and position.
  • Practicality: No extra compute, vocab truncation handled naturally, architecture-agnostic.

04Experiments & Results

šŸž Hook: If you want to know who runs faster, you time the race on the same track.

🄬 The Test: They measured average acceptance length τ—the expected number of tokens produced in each speculative round (including a guaranteed bonus token). Bigger Ļ„ means longer accepted runs and faster generation.

  • How: Evaluate on full, common benchmarks across domains: MT-Bench (chat), HumanEval (code), and GSM8K (math).
  • Why: These reflect varied real-world usage and reduce noise from tiny samples.

šŸž Anchor: It’s like comparing how many clean passes a soccer team can string together before losing the ball.

The Competition:

  • Baselines: Forward KL (standard), Total Variation (TV) alone, and public HuggingFace speculator checkpoints.
  • Proposed: LK Hybrid (adaptive Ī» with Ī· typically 3, 10 for MEDUSA) and Negative Log-Acceptance (Lα).
  • Settings: Two temperatures—T=0 (greedy) and T=1 (stochastic). Training targeted stochastic acceptance (T=1), their primary evaluation setting.

Scoreboard with Context:

  • Across LLaMA-3.1-8B with three different drafters (EAGLE-3, MEDUSA, MLP), both LK variants beat KL on most datasets; Hybrid LK with adaptive Ī» is best overall.
  • Under T=1, relative Ļ„ gains commonly reach about 8–10% (like turning a B- into a solid A- in acceptance runs). Example highlights:
    • For EAGLE-3 on Qwen3-235B (MoE, very large), Hybrid LK improved mean Ļ„ by about +8.2% at T=1 compared to KL.
    • On GPT-OSS-120B (MoE), Hybrid LK gained +7.7% at T=1.
    • For DeepSeek-V3 (685B) MTP, fine-tuning with Hybrid LK added +5.6% over KL at T=1, even though the module was already strong.
    • Smaller targets (like 20B dense) saw smaller but consistent gains (~+3–4%).
  • Pure TV performed clearly worse from scratch (like practicing only the trick shot and missing basics). Constant-Ī» hybrids (e.g., Ī»=0.5) helped a bit, but not as much as adaptive Ī».
  • Negative Log-Acceptance (Lα) also beat KL, especially at T=1; the hybrid was generally strongest, but Lα closed the gap when Ī» scheduling wasn’t tuned.

Surprising Findings:

  • TV alone looks theoretically perfect but trains poorly from random starts—its gradients are too weak and choppy. LK’s fixes (adaptive Ī» or 1/α scaling) are crucial in practice.
  • The gains are largest when the draft is much smaller than the target or when their architectures mismatch (dense draft vs. MoE target). That’s exactly when KL’s "match everything" instinct misallocates the draft’s tiny capacity.
  • Adaptive Ī» per head helps uneven heads: early positions (already good) get more TV fine-tuning; later positions (weaker) get more KL stability.

Why the Numbers Matter:

  • A +8–10% jump in Ļ„ is like adding an extra accepted token every 10–12 tokens. In real systems, that translates to noticeable speedups, better throughput per GPU, and lower costs—all without touching the target model.

šŸž Anchor: If your relay team consistently adds one more clean handoff before a drop, your total race time improves a lot—even if each runner didn’t change shoes.

05Discussion & Limitations

Limitations:

  • šŸž Hook: Even the best tool has jobs it’s not built for. 🄬 What it can’t do: When the draft is very large (high-capacity) and can already match the target closely, KL is already strong, so LK’s gains shrink. Also, pure TV remains hard from scratch; LK works by managing or scaling TV, not by making TV alone easy. šŸž Anchor: A power saw shines on big cuts, not on tiny polishing.

Required Resources:

  • Target model access (to compute p) during training and a dataset of prompts-responses from that target.
  • Usual training stack (e.g., AdamW), no extra compute vs. KL.
  • An inference server that implements correct rejection sampling at T>0 (they patched vLLM accordingly).

When NOT to Use:

  • If you only ever decode greedily (T=0) and already have high acceptance with a strong draft, LK’s advantage may be minimal.
  • If K=1 (no multi-token speculation), acceptance optimization matters less.
  • If your pipeline cannot supply target distributions during training, you can’t compute acceptance overlaps.

Open Questions:

  • Directly optimizing system efficiency (accepted tokens per drafted tokens) rather than just per-position acceptance.
  • Learning per-head aggregation weights instead of fixed geometric decay.
  • Baking in real deployment knobs (top-k, top-p) into the loss so training matches inference exactly.
  • Extending LK ideas to tree-based drafting and other speculative search strategies.

Risks and Practical Notes:

  • Hyperparameter Ī· for Ī» scheduling affects how quickly the blend shifts; different architectures (like MEDUSA) may prefer different Ī·. Fortunately, the method is robust across a reasonable range.
  • With very small datasets or severe domain drift, any distillation-style training can underperform; LK is no exception—ensure your training data matches your deployment.
  • The improvements focus on speed efficiency; they preserve target quality because verification remains unchanged.

06Conclusion & Future Work

Three-Sentence Summary:

  • Speculative decoding’s speed comes from how many drafted tokens the target accepts, so the best way to train a draft is to maximize acceptance—not just to mimic the target’s full distribution.
  • LK losses do exactly this: a hybrid KL+TV objective with an adaptive schedule, and a negative log-acceptance loss that scales TV’s gradients, both training stably from scratch and directly boosting acceptance.
  • Across many targets (8B–685B) and drafters (EAGLE-3, MEDUSA, MLP, MTP), LK consistently improves acceptance length—often by 8–10% at T=1—without extra compute or architectural changes.

Main Achievement:

  • Turning the theoretical link (acceptance = 1 āˆ’ TV) into practical, stable training that outperforms KL in real systems, especially when the draft is much smaller than the target or architectures differ.

Future Directions:

  • Optimize a fuller system-efficiency objective, learn per-head weights, and incorporate inference knobs (top-k/top-p) during training.
  • Explore LK within tree-based speculative search and across multilingual or domain-specialized settings.

Why Remember This:

  • LK losses align training with what truly matters at inference time: being accepted. They make small helpers smarter about where to spend their limited capacity, unlocking faster, cheaper, and greener LLM serving—without touching the main model.

Practical Applications

  • •Speed up chat assistants by training their draft heads with LK losses to raise acceptance and reduce latency.
  • •Lower cloud serving costs by improving throughput (more users per GPU) via higher average acceptance length.
  • •Improve coding assistants (IDE plugins) so multi-token completions are accepted more often, reducing flicker and re-computation.
  • •Enhance on-device AI experiences (phones, laptops) where smaller draft heads paired with big remote targets benefit from better acceptance.
  • •Upgrade existing speculators (EAGLE-3, MEDUSA, MLP, MTP) by fine-tuning with LK for immediate acceptance gains.
  • •Deploy vocabulary-truncated draft heads without special masking tricks, using LK’s natural handling of missing tokens.
  • •Stabilize training from scratch for lightweight drafters by starting with the hybrid LK objective and adaptive Ī» scheduling.
  • •Boost performance for large MoE targets (e.g., Qwen3, DeepSeek) where small dense drafters struggle under KL alone.
  • •Tune acceptance under real sampling (temperature/top-k/top-p) by aligning training objectives with inference behavior.
  • •Accelerate evaluation pipelines by using chain sampling with LK-trained drafters, then extending to tree-based methods.
#speculative decoding#acceptance rate#LK losses#total variation distance#KL divergence#negative log-acceptance#hybrid objective#adaptive blending#EAGLE-3#MEDUSA#MTP#vocabulary truncation#large language model inference#Mixture-of-Experts#throughput and latency
Version: 1

Notes

0/2000
Press Cmd+Enter to submit