Fast KVzip: Efficient and Accurate LLM Inference with Gated KV Eviction

Jang-Hyun Kim; Dongyoon Han; Sangdoo Yun

Fast KVzip: Efficient and Accurate LLM Inference with Gated KV Eviction

Intermediate

Jang-Hyun Kim, Dongyoon Han, Sangdoo Yun1/25/2026

arXiv PDF

Key Summary

•Fast KVzip is a new way to shrink an LLM’s memory (the KV cache) while keeping answers just as accurate.
•It adds tiny gate modules that score which saved pieces (KV pairs) really matter and which can be safely tossed.
•The gates look only at the model’s hidden states and use a special “sink-attention” design to make smart keep-or-evict choices.
•Training the gates is cheap and fast because the big LLM stays frozen; only the small gates learn, guided by task-agnostic reconstruction targets.
•During prefill and decoding, Fast KVzip keeps a small recent window of tokens and evicts low-score KV entries, cutting memory by up to 70%.
•It avoids the heavy runtime reconstruction work that slowed earlier methods like KVzip, so prefill is much faster and peak memory is lower.
•Across Qwen2.5, Qwen3, and Gemma3 models, it holds near-lossless accuracy at just 30–40% KV budget on long-context, code, and math tasks.
•Decoding overhead is minimized with a 128-token buffer that batches gate decisions, reducing added latency to about 1%.
•The same gate works across many tasks (generalizes well), thanks to training with reconstruction-based targets instead of narrow, task-specific signals.

Why This Research Matters

Fast KVzip lets language models read very long documents and still respond quickly because it trims memory smartly without hurting answers. That means cheaper servers, smaller GPUs, and even on-device assistants can handle bigger jobs like long contracts, codebases, or research papers. Teams can deploy high-quality chatbots, copilots, and learners with far fewer resources. The method is simple to add, quick to train, and works across many models, so upgrading existing systems is practical. It also lowers energy use and costs by avoiding heavy extra computation during prefill. In math and coding tasks, it preserves the model’s “thinking time” while staying within tight memory budgets.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Top Bread (Hook): Imagine your backpack during a long school day. You want to carry everything you might need, but if you stuff it with every book and toy, it becomes heavy and slow you down.

🥬 Filling (The Actual Concept):

What it is: In large language models (LLMs), the KV cache is like that backpack—it stores key and value features from all past tokens so the model can look back quickly using attention.
How it works: 1) As the model reads tokens, it saves KV pairs for each attention head. 2) Later, when predicting, it searches these saved pieces to decide what to focus on. 3) The more tokens you have, the bigger the KV cache grows. 4) This speeds up attention, but the cache grows linearly with context length and quickly dominates memory.
Why it matters: Without smart management, the KV cache makes long-context inference expensive: more GPU memory, higher latency, and sometimes not even fitting into device RAM.

🍞 Bottom Bread (Anchor): Think of reading a 200-page story. If you bookmark every single sentence forever, your bookmark pile gets huge and you spend more time flipping than reading. You need a way to keep only the important bookmarks.

🍞 Top Bread (Hook): You know how, when you study, you highlight the key sentences and skim the rest? That’s attention.

🥬 Filling (The Actual Concept):

What it is: The attention mechanism lets an LLM focus more on relevant past tokens and less on unimportant ones.
How it works: 1) For a new token, the model forms a query. 2) It compares that query with stored keys to get importance scores. 3) It uses the top-scoring values to guide the next prediction. 4) This repeats for each layer and head.
Why it matters: Without attention, the model would treat every past word equally, wasting compute and missing the key clues.

🍞 Bottom Bread (Anchor): When asked “What’s the capital of France?”, attention points hard at “capital” and “France,” not filler words like “the,” leading to “Paris.”

🍞 Top Bread (Hook): Picture splitting your homework into chunks so your desk never gets too messy.

🥬 Filling (The Actual Concept):

What it is: Chunked prefill processes very long inputs in manageable pieces so memory peaks stay low.
How it works: 1) Break the long context into chunks (e.g., 16K tokens). 2) For each chunk, compute attention and update the KV cache. 3) Optionally evict or compress as you go to avoid peak spikes. 4) Move on to the next chunk.
Why it matters: Without chunking, the KV cache can balloon during prefill, causing memory overflows or slowdowns.

🍞 Bottom Bread (Anchor): It’s like slicing a giant pizza so everyone can eat without juggling a floppy whole pie.

The World Before: LLMs got great at long-context tasks, but KV memory grew with every extra token. Serving 100K–300K token contexts strained GPUs: high memory, higher latency, and sometimes impossible deployments. Basic eviction methods tossed entries quickly (fast but often inaccurate), while precise ones like KVzip rebuilt context at runtime (accurate but doubled prefill cost).

The Problem: Can we keep accuracy close to full-cache while evicting most KV entries—and do it with almost no extra runtime cost?

Failed Attempts: 1) Heuristic sparsity during inference (e.g., SnapKV) works for the current query but may break on future queries. 2) Predefined attention shapes (like sliding windows) reduce compute but need retraining or fine-tuning. 3) KVzip stays accurate by reconstructing attention but slows prefill a lot.

The Gap: A method that predicts which KV entries will be useful later, without expensive reconstruction at inference, and that generalizes across tasks.

Real Stakes: Faster, cheaper LLMs that can read longer contexts, run on smaller GPUs or even edge devices, and help with things we care about—coding help, studying big documents, or solving long math problems—without waiting forever.

02Core Idea

🍞 Top Bread (Hook): You know how a good coach can tell early in a game which plays will matter later? They watch the team’s posture and movement, not the whole replay every time.

🥬 Filling (The Actual Concept):

What it is: Fast KVzip adds tiny gate modules that look at hidden states to score how important each KV head’s cache will be, then evict low-scoring KV entries on the fly.
How it works: 1) At each layer, a lightweight “sink-attention” gate reads the current hidden state. 2) It outputs per-head importance scores in [0,1]. 3) During prefill/decoding, we keep recent tokens (a local window) and the highest-scoring older KV entries under a memory budget. 4) Gates are trained once, using reconstruction-based targets, with the large LLM frozen. 5) At inference, we only run the cheap gates—no reconstruction—so overhead is negligible.
Why it matters: This keeps accuracy near full-cache while evicting up to ~70% of KV entries, avoids double prefill cost, and works across tasks and models.

🍞 Bottom Bread (Anchor): It’s like cleaning your notes as you study: you keep the freshest lines and the ones your neon highlighter says are crucial, recycling the rest so your notebook stays lean.

Multiple Analogies:

Librarian analogy: The gate is a smart librarian who tags which reference cards will be checked out again. Cards with low future use get archived, freeing shelf space.
Backpack analogy: The gate is your daily packer—keep today’s worksheets and the study guide, leave last month’s flyers at home.
Spotlight analogy: The gate is a lighting tech for a play—keep bright lights on key scenes and dim the rest, saving power without losing the story.

Before vs After:

Before: You either paid with accuracy (fast toss) or with time (reconstruct). Long contexts meant pain: big memory, slow prefill.
After: You predict future usefulness directly from hidden states. You keep a short recent window and top-scoring KVs, matching KVzip-like accuracy with a fraction of the runtime cost.

Why It Works (intuition): Hidden states compactly summarize what the model already knows at each layer; they often reveal which heads and past tokens will matter later. A sink-attention gate compares the hidden state to a small set of learned “sink” keys, acting like reusable prototypes of important patterns. Training with reconstruction-derived targets avoids overfitting to any one task, so the gate generalizes.

🍞 Top Bread (Hook): Think of a tidy toolbox where every tool has a spot.

🥬 Filling (The Actual Concept – Building Blocks):

What it is: Fast KVzip is built from four parts: per-layer gates (sink-attention), a keep-or-evict rule under a budget, a recency window, and a lightweight training recipe with frozen LLM.
How it works: 1) Per-layer gate reads hidden states, outputs head-wise importance. 2) Eviction keeps recent tokens and top-scoring older KVs. 3) A small decoding buffer (128 tokens) batches gate calls, slashing latency overhead. 4) Train gates using reconstruction-based targets; only gate params update.
Why it matters: Each piece reduces memory or prevents accuracy loss; remove any and you break either speed, stability, or quality.

🍞 Bottom Bread (Anchor): Imagine sorting a photo album. You always keep the last few pages (recent memories) and the photos with gold stars (high scores), and you file away the rest so your album closes flat.

03Methodology

At a high level: Text tokens → Transformer forward pass + per-layer gate scoring → Keep recent + top-scored KV entries under budget → Model outputs with a compact KV cache.

Step-by-step (prefill):

Chunk the input. Use chunked prefill (e.g., 16K tokens per chunk) to keep peak memory low.

Why this step exists: Very long inputs can blow up KV memory if processed at once.
Example: A 160K-token book is split into ten 16K chunks.

Compute attention and KV features for the chunk.

Why: You still need the regular model outputs; we’re not changing the LLM’s computations.
Example: For chunk #3, each layer produces new key/value tensors per head.

Score importance with the gate at each layer.

What happens: A tiny gate reads the current hidden state h and produces head-wise scores s in [0,1]. It uses low-rank projections and compares h to a learned set of S sink keys per layer (sink-attention). A small bias per group can adjust head-level strength. Hyperparameters are small and shared across models (e.g., S=16, projection D'=16).
Why: Scores estimate how useful each head’s cached KVs will be later, so we can safely evict.
Example: For layer 12, head 1 gets 0.92 (keep more), head 3 gets 0.12 (good eviction candidate).

Evict under a budget while keeping a local window.

What happens: Maintain a sliding recent window (e.g., 4K tokens in prefill) unconditionally. For older tokens, keep the highest-scoring KV entries until you hit your budget (e.g., 30–40% of full cache). Apply this per layer/head, compatible with uniform or non-uniform caches.
Why: Recency often matters in attention; the window preserves local dependencies. Scores protect globally useful long-range tokens.
Example: With a 30% budget, you keep the last 4K tokens plus the top-scored 8K older tokens, evicting the rest.

Move to the next chunk.

Why: By evicting as you go, peak KV memory stays low throughout prefill.
Example: Peak memory drops compared to naive chunking, and prefill time is lower than reconstruction-based methods.

Step-by-step (decoding):

Keep a 128-token buffer of recent hidden states.

Why: Calling the gate every token adds overhead. Buffering lets you compute gate scores in batches, cutting latency overhead to ~1%.
Example: Every 128 tokens, update scores and eviction decisions together.

Maintain a local decoding window (e.g., 128 tokens) and evict to stay within a fixed KV budget covering both prompt and generated tokens.

Why: Reasoning often needs nearby context; the window guarantees it, and the scores reserve space for the most useful older bits.
Example: On math problems, accuracy stays near full-cache at ~4K budget, unlike early-stopping strategies.

The Gate Architecture (sink-attention):

Inputs: hidden state h ∈ R^D at each layer.
Projections: low-rank projections produce k ∈ R^{H×D'} and q ∈ R^{G×H×D'} (H heads, G grouped-query size); weighted normalization stabilizes them.
Scoring: For each group j, compute attention-like weights between q_j and the per-layer learned sink keys, and between q_j and the actual head representations; combine with a nonnegative bias b_j to get per-head scores s ∈ [0,1]. Average across groups.
Why it’s clever: The learned sink keys act like reusable prototypes of “important” attention patterns, letting a tiny module approximate future head utility without touching the main model.

Training the Gates (no LLM backprop):

Targets: Use reconstruction-based maximum attention scores from KVzip’s pre-deployment process; this is task-agnostic and generalizes well.
Optimization: Freeze the LLM. Precompute hidden states + target scores. Train each layer’s gate independently using SGD with a binary cross-entropy loss; this parallelizes across layers and samples.
Data: Sample ~1M tokens from FineWeb-Edu, mixing sequences from 10K–30K and stitched 100K sequences; no overlap with evaluations.
Efficiency: Under 1 H100 hour even for 14B models; gate storage is small (e.g., ~0.18–0.30 GB for 8B–14B models).

Design Choices that Matter:

Gate inputs: Hidden states beat key states and pre-RoPE key states; adding position encodings hurt, so recency is handled via the local window instead.
Architecture: Sink-attention outperforms linear/MLP and variants without learnable denominators, especially on retrieval tasks.
Hyperparameters: Performance is robust to projection size; sufficient sink keys (e.g., S≥16) help most.
Inference knobs: Local windows of 1K–8K in prefill give similar strong results; decoding window of 128 works well.

Secret Sauce:

Predicting future utility directly from hidden states with a tiny, attention-like gate.
Using reconstruction-derived targets that generalize across tasks, avoiding domain overfitting.
Always preserving a small recent window so local dependencies remain intact.
Batching gate decisions during decoding to shrink overhead to ~1%.

Concrete Toy Example:

Suppose tokens 1..20,000 arrive in prefill with a 30% budget and a 4,000-token window. After chunk 1 (tokens 1..16K), you keep tokens 12,001..16,000 (window) and the top-scored 1..12,000 tokens until you hit ~30% cache usage; the rest are evicted. After chunk 2 (16,001..20,000), you again keep 16,001..20,000 plus the best of earlier tokens under budget. The model’s answers stay accurate like full-cache, but memory and time are much lower.

🍞 Top Bread (Hook): You know how you learn better when a coach gives you general skills instead of training for just one test?

🥬 Filling (The Actual Concept – Task-Agnostic Reconstruction Objective):

What it is: A training goal that teaches gates to recover broadly useful attention patterns, not just for one task.
How it works: 1) Reconstruct the context (offline) to compute which KV entries got the highest attention. 2) Use these as targets for the gate. 3) Train on general text (not narrow QA/math) so the signal isn’t biased.
Why it matters: Without task-agnostic targets, gates can overfit and fail on other tasks.

🍞 Bottom Bread (Anchor): Like practicing core dribbling and passing instead of only rehearsing one play—you’ll perform well in any game.

04Experiments & Results

The Test: Researchers measured whether Fast KVzip could keep accuracy near full-cache while slashing KV memory, and whether it could do so both in prefill-heavy long-context tasks and decoding-heavy reasoning. They also measured speed (prefill time, decoding latency) and peak memory.

The Competition: Baselines included KVzip (accurate but slow due to reconstruction), SnapKV (fast heuristics), Expected Attention, DuoAttention (structured heads), R-KV (decoding compression), and TrimKV (gated retention trained on specific tasks). Models spanned Qwen2.5-7B/14B-1M, Qwen3-8B/14B and FP8, and Gemma3-12B with hybrid attention.

Scoreboard with Context:

Long Context (prefill-intensive): On SCBench (12 datasets including retrieval, QA, and code), Fast KVzip matched or beat all baselines and achieved near-lossless performance at 30–40% KV budget. On RULER-4K and KVPress benchmarks, it reached KVzip-level accuracy without KVzip’s 2× prefill overhead. Peak memory and prefill time dropped significantly (think: going from a stuffed locker to a neatly organized shelf).
Decoding (reasoning-intensive): On AIME24 and MATH with Qwen3-8B/14B, Fast KVzip stayed near full-cache accuracy with ~4K KV budget, outperforming R-KV and far surpassing early-stopping-of-thinking strategies that prematurely cut reasoning and tank accuracy. It shows that preserving the right KV entries lets the model “think long enough” without huge memory.
Across Models: Averaged over many tasks, Fast KVzip consistently maintained high relative performance at low budgets on Qwen2.5, Qwen3 (including FP8 quantization), and Gemma3 (with sliding-window hybrids). This indicates strong generality.

Surprising Findings:

Sometimes accuracy improves slightly over full-cache, likely a denoising effect: evicting low-value KVs can make attention cleaner.
Gates trained on hidden states generalized better than those trained on key or pre-RoPE key states.
Reconstruction-based targets generalized across tasks much better than next-token prediction or instruction QA targets; narrow targets overfit and underperform on other tasks.

Efficiency Details:

Training: Under ~1 H100 GPU hour for 14B-scale models; gate files are small (≈0.1–0.3 GB depending on model). Only the gates are trained; LLM weights are frozen.
Inference overhead: Decoding gate calls are batched with a 128-token buffer, bringing added latency down to about 1% on average. Prefill stays fast because no runtime reconstruction is needed.

Plain-English Take: Fast KVzip keeps what matters, tosses what doesn’t, and does it so quickly you barely notice—yet the answers stay just as good as before.

05Discussion & Limitations

Limitations:

Frozen LLM weights: The method does not adapt the base model; if a model’s internal patterns are unusual, gate training alone may be suboptimal.
Pre-deployment prep: You need an offline pass to compute reconstruction-based targets (though it’s done with parallel forward passes, not expensive backprop through the LLM).
Not a cure-all: If your task heavily depends on rare, long-range references not captured by hidden-state cues, aggressive eviction may still hurt.
Architecture coverage: While tested on several strong families (Qwen2.5/3, Gemma3), there remain untested architectures and extreme settings (e.g., ultra-tiny models or very exotic attention variants).

Required Resources:

One modern GPU (e.g., H100 80GB) can train the gates in under about an hour for 14B models; smaller models are faster.
A modest amount of general-text data (~1M tokens sufficed here) for gate training, plus the ability to run reconstruction target extraction.

When NOT to Use:

If you already operate at extremely short contexts where KV memory is trivial, the gains may be negligible.
In one-off, narrow domains with radically different attention needs (e.g., specialized symbolic tasks), you might want domain-specific gate retraining.
If your deployment is bounded by compute per token but not memory, other optimizations (e.g., kernel-level speedups) may bring more benefit than KV compression.

Open Questions:

Joint training: What if we co-train gates with the base model during pretraining—could we unlock even more structured, hardware-friendly sparsity?
Multi-action gating: Beyond evict-or-keep, could gates choose among skip-compute, store-compressed, or defer decisions for selective retrieval?
Adaptive windows: Can we learn the local window size per layer/head dynamically instead of fixing it?
Robustness across modalities: How do gates behave with multimodal inputs (audio, vision) and agentic tool use where context structure shifts?
Theory: Can we better characterize when hidden states are sufficient statistics for future KV utility across layers and heads?

06Conclusion & Future Work

Three-Sentence Summary: Fast KVzip compresses an LLM’s KV cache using tiny sink-attention gates that predict which entries will matter later, based only on hidden states. Trained once with task-agnostic, reconstruction-derived targets and a frozen LLM, the gates keep accuracy near full-cache while evicting up to ~70% of KV entries. Unlike reconstruction-at-inference approaches, Fast KVzip keeps prefill fast, slashes peak memory, and adds only about 1% latency in decoding.

Main Achievement: Showing that future KV usefulness can be decoded from hidden states with a lightweight, learned gating mechanism—achieving KVzip-level accuracy without KVzip-level runtime cost.

Future Directions: Co-train gates during pretraining for even more structured sparsity; extend gating to multi-choice actions (skip/evict/store/selectively-recall); and learn dynamic local windows per layer/head. Explore multimodal and agentic settings to test generality further.

Why Remember This: It turns a hard trade-off (accuracy vs. speed/memory) into a win-win by moving smart decisions to a tiny, learned module—making long-context LLMs cheaper, faster, and more broadly deployable without losing their smarts.

Practical Applications

•Serve long-context chatbots that summarize 100K+ token documents on a single GPU using a 30–40% KV budget.
•Deploy code assistants that navigate large repositories without losing accuracy by evicting low-importance KV entries.
•Run math reasoning models that keep near-full accuracy at ~4K KV budget, avoiding expensive early stopping of thinking.
•Retrofit existing Qwen or Gemma deployments with gates trained in under an hour to slash peak prefill memory and time.
•Enable on-device or edge inference by cutting KV memory up to 70%, making long-context use feasible on smaller hardware.
•Combine with sliding-window or hybrid attention: compress only global KVs to keep Gemma-like models efficient.
•Use FP8 or other quantization plus Fast KVzip to stack memory savings from both weights and KV cache.
•Speed up retrieval-augmented generation (RAG) by keeping the freshest and most relevant chunks while evicting clutter.
•Handle multi-round long conversations (MRCR-like) by batching decoding gate decisions with a 128-token buffer to keep latency low.
•Standardize infrastructure: one task-agnostic gate per model that generalizes across QA, summarization, code, and reasoning.

Version: 1