Quant VideoGen: Auto-Regressive Long Video Generation via 2-Bit KV-Cache Quantization

Haocheng Xi; Shuo Yang; Yilong Zhao; Muyang Li; Han Cai; Xingyang Li; Yujun Lin; Zhuoyang Zhang; Jintao Zhang; Xiuyu Li; Zhiying Xu; Jun Wu; Chenfeng Xu; Ion Stoica; Song Han; Kurt Keutzer

Quant VideoGen: Auto-Regressive Long Video Generation via 2-Bit KV-Cache Quantization

Intermediate

Haocheng Xi, Shuo Yang, Yilong Zhao et al.2/3/2026

arXiv PDF

Key Summary

•Auto-regressive video models make videos one chunk at a time but run out of GPU memory because the KV-cache grows with history.
•Quant VideoGen (QVG) shrinks that KV-cache up to about 7× using only 2–4 bits while keeping video quality almost the same.
•The key trick is to group similar tokens (Semantic-Aware Smoothing) so their numbers become small and easy to compress.
•Then QVG compresses what’s left in several gentle passes (Progressive Residual Quantization) to lower error even more.
•Across LongCat-Video, HY-WorldPlay, and Self-Forcing, QVG beats prior KV quantizers like RTN, KIVI, and QuaRot at the same memory.
•QVG adds less than 4% extra time to generation, so it’s practical for real-time or streaming video.
•Thanks to QVG, HY-WorldPlay-8B can run on a single RTX 4090 for the first time with strong PSNR (over 29).
•Longer effective context is possible under the same hardware, improving long-horizon identity, layout, and motion consistency.
•QVG is training-free, so you don’t need to retrain the model to get the memory savings.
•This method moves the quality–memory Pareto frontier forward, enabling longer, steadier, and more accessible video generation.

Why This Research Matters

QVG lets long, consistent video generation run on everyday GPUs, not just giant servers, which makes creative tools and research more accessible. It enables live, interactive video worlds and streaming experiences because it keeps latency low while squeezing memory hard. By storing less data, it reduces hardware needs and energy use, which is good for the environment and budgets. Longer effective memory means steadier characters, layouts, and motions over minutes, which viewers notice and appreciate. Teams building world models, simulators, or educational content can now generate richer, longer experiences without retraining their models. Finally, this approach is training-free, so you can adopt it quickly across many existing video diffusion systems.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Top Bread (Hook) You know how when you film a long birthday party on your phone, the video takes a lot more space than a short clip? Long videos need more memory.

🥬 Filling (The Actual Concept)

What it is: Auto-regressive video generation is a way for AI to make videos step by step, using what it already made to decide what comes next.
How it works (recipe):
1. The model creates a few frames (a chunk).
2. It remembers important details from those frames in a fast memory called the KV-cache.
3. It uses that memory to make the next chunk, and repeats.
Why it matters: Without keeping a good memory of the past, the video forgets who is who or where things are, causing drift and inconsistency over time.

🍞 Bottom Bread (Anchor) Imagine telling a story sentence by sentence. If you forget the hero’s name after a few sentences, the story stops making sense. The KV-cache helps the AI “remember” so the story (video) stays consistent.

🍞 Top Bread (Hook) Think about a sticky-note wall full of clues for a mystery. The more clues you add, the more space you need to keep them.

🥬 Filling (The Actual Concept)

What it is: The KV-cache is where a transformer model stores Key and Value vectors from past tokens so it can attend back to them quickly.
How it works (recipe):
1. For every layer and head, the model saves K and V for each token it generated.
2. As the video grows longer or higher resolution, the number of tokens explodes.
3. All those K and V tensors must stay in fast GPU memory for speed.
Why it matters: The KV-cache quickly becomes the biggest thing in memory (often tens of GB), blocking long videos on common GPUs and forcing short context windows that harm quality.

🍞 Bottom Bread (Anchor) LongCat-Video at 480p needs about 34 GB of KV for just 5 seconds—more than a single RTX 5090 can hold—so longer generations hit an instant wall.

🍞 Top Bread (Hook) Imagine flipbooks: most pages are almost the same as the one before, just a tiny motion change.

🥬 Filling (The Actual Concept)

What it is: Spatiotemporal redundancy means nearby spots in space or time look very similar, so there’s repeated information we can compress.
How it works (recipe):
1. Adjacent frames often share the same background and shapes.
2. Neighboring patches in a frame often have similar textures and colors.
3. Their hidden tokens in the model also end up very similar.
Why it matters: If most tokens are similar, we can store them more efficiently by focusing on the small differences instead of the big shared parts.

🍞 Bottom Bread (Anchor) In a skateboarding scene, the mountain and road barely change from frame to frame; only the skater moves a bit. That sameness is compressible.

🍞 Top Bread (Hook) Have you ever tried to squeeze a suitcase shut because one big, lumpy sweater makes the whole bag bulge?

🥬 Filling (The Actual Concept)

What it is: Quantization turns big floating-point numbers into small integers (like 2-bit or 4-bit) to save memory.
How it works (recipe):
1. Pick a scale so numbers fit in a tiny range.
2. Round each value to its nearest small integer.
3. Save the small integers and the scale.
Why it matters: If a few values are huge (outliers), the scale must be big, and rounding becomes rough, making quality worse.

🍞 Bottom Bread (Anchor) It’s like shrinking all clothes to small sizes. If one coat is massive, you have to choose a big size for everything, and then most clothes fit poorly. That’s what outliers do to standard quantization.

The world before: Many state-of-the-art video diffusion systems used bidirectional attention, which is great for short, polished clips but bad for streaming because you must process future and past together before committing frames. Auto-regressive video generation fixed the streaming problem by committing frames as you go, but created a new one: a KV-cache that grows with history, dominating GPU memory.

The problem: To fit memory, people used short context windows (like 20-ish frames). This shrinks the model’s working memory, and videos start to drift: faces morph, layouts misalign, and motion becomes inconsistent.

Failed attempts: Teams tried LLM-style KV quantizers (RTN, KIVI, QuaRot) directly on video models. But video KV activations are more irregular across tokens and channels, making those methods lose quality—especially at 2 bits. Others tried recomputation, CPU offloading, or lowering resolution, but these slow things down or hurt fidelity.

The gap: We need a training-free, video-aware KV quantization that tames outliers, exploits video redundancy, and preserves long-horizon consistency.

Real stakes: With a memory-smart approach, we can stream longer, steadier videos on everyday GPUs, enable live interactive worlds, run world models for longer horizons, and lower costs and energy by using less hardware.

02Core Idea

🍞 Top Bread (Hook) Imagine packing a suitcase by first grouping similar clothes (all T-shirts together), smoothing them into neat stacks, and then compressing what’s left in a few gentle squeezes instead of one big crush.

🥬 Filling (The Actual Concept)

What it is: Quant VideoGen (QVG) is a training-free way to compress the KV-cache for auto-regressive video so it fits in much less memory without making the video look worse.
How it works (recipe):
1. Semantic-Aware Smoothing (group similar tokens and subtract their shared average so leftovers are tiny).
2. Progressive Residual Quantization (compress the leftovers in multiple stages, from coarse to fine).
3. Store only low-bit residuals plus small metadata (centroids, assignments, scales), and reconstruct on the fly.
Why it matters: This reduces outliers, shrinks the dynamic range, and makes 2–4 bit quantization behave well for video, pushing the quality–memory frontier forward.

🍞 Bottom Bread (Anchor) With QVG, HY-WorldPlay-8B runs on a single RTX 4090 for the first time while keeping PSNR over 29. That’s like packing a week of clothes into a carry-on and still looking sharp all trip.

Aha! moment in one sentence: If you first line up similar tokens and remove what they share, the remaining numbers become small and steady enough that super-low-bit quantization works great.

Three analogies:

Classroom notes: Group the same facts together, write a clean summary (centroid), and only keep small sticky notes for the extra details (residuals). Then file the sticky notes in a few passes to keep them tidy.
Painting: Sketch big shapes first (centroids), then add finer strokes layer by layer (progressive residuals), storing each layer compactly.
Lego building: Build a sturdy base (group averages), then snap on small detail bricks over a few stages; the pile stays compact but the model keeps all the detail.

Before vs After:

Before: Either keep a short context (drift!) or blow past memory limits. LLM-style quantizers struggled at 2-bit on video.
After: Up to ~7× KV savings with near-lossless quality and <4% latency overhead; longer effective context fits in the same GPU, improving long-horizon identity, layout, and motion.

Why it works (intuition): Quantization error grows with the largest value in a group. Video KV tensors have uneven, outlier-heavy ranges that wreck low-bit rounding. By clustering semantically similar tokens and subtracting the centroid, we remove the large shared parts, leaving small, uniform residuals that quantize cleanly. Doing this progressively catches remaining details in smaller, gentler steps, further lowering error.

Building blocks (mini “sandwich”s):

🍞 Top Bread (Hook) You know how teammates with similar skills can share the same playbook?

🥬 Filling (The Actual Concept)

What it is: Semantic-Aware Smoothing groups tokens by similarity (k-means), then subtracts each group’s centroid to get small residuals.
How it works (recipe):
1. Take a chunk of tokens (spanning a few frames).
2. Cluster them into C groups based on hidden features.
3. Compute each group’s centroid (average vector).
4. Subtract the centroid from the group’s tokens to form residuals with low magnitude and uniform distribution.
Why it matters: Small, uniform residuals quantize well at 2–4 bits; big outliers don’t set the scale anymore.

🍞 Bottom Bread (Anchor) It’s like averaging the homework answers of a study group, then only writing down each student’s tiny differences from that average.

🍞 Top Bread (Hook) Think of sculpting: remove big chunks first, then smooth finer bumps over a few passes.

🥬 Filling (The Actual Concept)

What it is: Progressive Residual Quantization repeats smoothing and subtracting on the leftovers, stage by stage.
How it works (recipe):
1. Stage 1: Group, subtract, quantize residuals.
2. Stage 2+: Treat those residuals as new inputs; group, subtract, quantize again.
3. Store low-bit residuals plus centroids/assignments per stage.
Why it matters: The first stage drops most error; later stages catch small details, giving a smooth quality–memory trade-off.

🍞 Bottom Bread (Anchor) Like saving a photo first as a clear thumbnail, then adding extra layers of detail—each layer is tiny, but together they look nearly lossless.

Practical boosts: QVG adds centroid caching (warm-start k-means using last chunk), a fused dequant+add-centroid kernel, and per-group FP8 scales to keep speed high while memory stays low.

03Methodology

At a high level: Input video tokens → group similar tokens (Semantic-Aware Smoothing) → subtract centroids to get small residuals → (optionally repeat for T stages) → quantize residuals to INT2/INT4 and store → when needed, dequantize and add back centroids from last stage to first → output reconstructed KV for attention.

Step-by-step (with why and an example):

Chunk the stream

What happens: Split the growing KV-cache into manageable chunks (e.g., 12–20 latent frames per chunk) and process one chunk at a time.
Why it exists: Keeps latency low and avoids reprocessing the whole history; fits streaming.
Example: HY-WorldPlay uses 12-frame chunks; Self-Forcing uses 16.

Semantic-based grouping (k-means)

What happens: For the N tokens in the chunk, run k-means (e.g., C=256) on their d-dimensional features to make C groups of similar tokens.
Why it exists: Similar tokens share large values in similar channels; grouping lets us represent shared parts once (the centroid) instead of many times.
Example data: If three sky patches across adjacent frames have vectors like [9.3, -8.8, 3.9, -5.4, ...], [9.2, -8.3, 3.4, -5.3, ...], [9.0, -8.3, 3.3, -5.2, ...], their average (centroid) captures the big numbers.

Centroid subtraction → residuals

What happens: For each group, subtract the centroid from every member token to get residuals with smaller magnitudes, centered near zero.
Why it exists: Quantization error scales with the biggest number; centroid subtraction reduces those big numbers dramatically.
Example: After subtraction, the three sky residuals might look like [+0.3, -0.5, +0.4, 0.0, ...]—much easier to quantize.

Progressive Residual Quantization (optional stages T>1)

What happens: Treat the residuals as new inputs; repeat grouping and subtraction for 2–4 stages (QVG-Pro uses S=4) to capture fine details.
Why it exists: The first pass removes most error (~5.8× MSE reduction), later passes pick up small remaining structure with diminishing returns.
Example: Stage 1 cleans 80–90% of error; stages 2–4 nibble down the rest for near-lossless quality at slightly higher metadata cost.

Low-bit quantization (INT2/INT4)

What happens: Quantize the final residuals using per-group scales (stored in compact FP8) and round to 2- or 4-bit integers.
Why it exists: This is where the big memory win happens; quantized values dominate storage (≥65%).
Example: With INT2, each residual value uses just 2 bits; with block size B=64, compression is highest. B=16 gives best quality.

Store compact metadata

What happens: Save: (a) quantized residuals, (b) per-group scales, (c) group centroids per stage, (d) assignment indices π (uint8 with K=256).
Why it exists: Needed to reconstruct the original KV on demand with minimal error.
Example: In QVG-Pro (4 stages, B=16), more metadata is stored than in QVG (1 stage, B=64), but the video looks even closer to BF16.

Fast reconstruction (dequantization)

What happens: When attending to history, a fused kernel dequantizes residuals and adds back the right centroids per stage (from last to first), keeping intermediates in registers.
Why it exists: Avoids multiple memory trips and keeps overhead under 4% end-to-end.
Example: For one token i, pick its group at stage T, add centroid, then stage T-1, etc., until original-space KV is recovered.

Secret sauce (why it’s clever):

It moves difficult, outlier-heavy numbers into clean centroids and turns the rest into tiny, uniform residuals that 2-bit rounding can handle.
It exploits video’s natural redundancy (space and time) rather than fighting it.
It’s training-free and plug-in: no model retraining or data collection required.

What breaks without each step:

No grouping: Outliers force big scales; 2-bit quality craters.
No centroid subtraction: Residuals stay large; error remains high.
No progressive stages: Fine details get lost or banded.
No fused reconstruction: Latency spikes; streaming stalls.
No centroid caching: k-means startup becomes a bottleneck.

Concrete mini example:

Suppose a group of 64 tokens has a max absolute value ≈ 10 before smoothing.
After centroid subtraction, max abs might drop to ≈ 0.5.
With 2-bit quantization, scale fits the 0.5 spread well; rounding error is small.
A second stage might drop residual spread to ≈ 0.2, tightening quality further.

“Sandwich” refreshers for the two main blocks:

🍞 Top Bread (Hook) You know how sorting Lego bricks by color before building makes everything faster and neater?

🥬 Filling (The Actual Concept)

What it is: Semantic-Aware Smoothing sorts tokens by similarity and removes the common part.
How it works: Group → average → subtract → small, tidy residuals.
Why it matters: Small, tidy numbers are perfect for tiny integers like 2-bit.

🍞 Bottom Bread (Anchor) It’s like pulling out all the blue bricks (centroid), then only noting which ones are slightly darker or lighter (residuals).

🍞 Top Bread (Hook) Think of shrinking a picture in steps: first make a small thumbnail, then add sharpness and texture layers.

🥬 Filling (The Actual Concept)

What it is: Progressive Residual Quantization compresses leftovers in a few small passes.
How it works: Stage-by-stage grouping and subtraction, then quantize.
Why it matters: Each stage adds detail without blowing up memory.

🍞 Bottom Bread (Anchor) After stage 1 your video already looks good; by stage 4 it’s almost indistinguishable from the original while still tiny.

04Experiments & Results

The test: Measure whether QVG keeps videos faithful to a BF16 reference while slashing KV memory, and whether it holds up over long horizons (hundreds of frames). They report fidelity (PSNR, SSIM, LPIPS), perceptual scores (VBench: background consistency, subject consistency, image and aesthetic quality), compression ratio, and end-to-end latency.

The competition: Prior KV quantizers designed for LLMs—Round-to-Nearest (RTN), KIVI, and QuaRot—using the same block size (e.g., 16) for fair comparison.

Models and setup: LongCat-Video-13B (continuation with a 73-frame context, generates 20-frame chunks), HY-WorldPlay-8B (12-frame chunks), Self-Forcing-Wan-1.3B (16-frame chunks). Hardware: mainly NVIDIA H100; notable demo on a single RTX 4090 for HY-WorldPlay-8B.

Scoreboard (with context):

LongCat-Video-13B, INT2: • QVG-Pro: PSNR ≈ 30.38 with ~5.0× compression—like getting a solid A while shrinking memory to one-fifth. • QVG: PSNR ≈ 28.72 with ~6.94× compression—like getting an A− and fitting nearly seven times more in memory. • Baselines (RTN/KIVI/QuaRot at ~6.4×): PSNR ≈ 20–22—more like a D/C−; visibly degraded.
HY-WorldPlay-8B, INT2: • QVG-Pro: PSNR ≈ 31.56 at ~5.2× compression—an A while still very compact. • QVG: PSNR ≈ 29.17 at ~7.05×—an A− with the tightest packing. • Baselines around PSNR ≈ 24–25 at ~6.4×—B− or worse, noticeable artifacts.
INT4 settings: • All methods improve, but QVG/QVG-Pro still lead or match best fidelity while offering better compression flexibility.

Long-horizon behavior (Self-Forcing):

Image Quality tracked every 50 frames out to ~700 frames. BF16 drifts modestly; RTN/KIVI/QuaRot drop sharply after ~100 frames under INT2.
QVG and QVG-Pro stay near-lossless across the whole sequence, indicating stronger resistance to drift thanks to longer effective memory within the same budget.

Error analysis and ablations:

Semantic-Aware Smoothing reduces MSE for Keys by ~6.9× and for Values by ~2.6× across INT2/INT4, confirming it tames the dynamic range.
First progressive stage gives the biggest MSE drop (~5.8×); later stages add smaller gains.
Block size trade-off: B=64 yields the best compression; B=16 yields the best quality. QVG (S=1, B=64) maximizes compression; QVG-Pro (S=4, B=16) maximizes fidelity.

Latency and practicality:

End-to-end overhead: ~2.1% on LongCat, ~1.5% on HY-WorldPlay, ~4.3% on Self-Forcing. Small enough for live or interactive use.
Memory accounting: Quantized values dominate (≥65%); metadata (centroids, assignments, scales) is modest and tunable.

Surprising findings:

With QVG, some setups can use longer effective context under the same hardware budget and even surpass BF16 quality measured under the original smaller cache budget.
HY-WorldPlay-8B runs on a single RTX 4090 for the first time while maintaining strong fidelity—broadening access.
Keys benefit more than Values from smoothing (larger MSE drop), suggesting different distributions and room for value-specific tweaks.

05Discussion & Limitations

Limitations:

Reliance on redundancy: When videos have rapid chaotic motion, flashing lights, or abrupt cuts, token similarity drops and savings vs quality can shrink.
Metadata overhead: Multi-stage setups (QVG-Pro) store more centroids and assignments; still small, but not free.
K-means cost: Even with centroid caching, clustering adds some latency. Very tight real-time loops on small GPUs may notice it.
Fixed grouping: k-means is unsupervised and not learned; a learned, content-aware compressor might do better but would need training.
Values vs Keys: Values show more irregularity; a specialized value pathway might further help.

Required resources:

A modern GPU with enough memory for the model weights plus quantized KV, plus a bit for metadata; CUDA/Triton kernels as implemented.
For best performance: fused dequant kernels, centroid caching, and per-group FP8 scales; pre-RoPE key caching helps.

When NOT to use:

If your videos are very short or your context window is tiny, the added complexity may not pay off.
If you do offline, non-streaming generation and don’t mind large memory or slow recomputation.
If your content has extremely low redundancy (e.g., heavy noise or constant jump cuts) and you demand perfect reconstruction.

Open questions:

Can learned grouping or vector quantization tailored to video distributions beat k-means while staying lightweight?
Adaptive control: choose bit-width, number of stages, and block size on the fly based on content difficulty.
Token importance: mix precision by saliency (e.g., faces, moving subjects) without manual labels.
Combine with sparsity/pruning or temporal key dropping for even larger savings.
Extend to 3D/360° video or video+audio, where cross-modal redundancy could help.

06Conclusion & Future Work

Three-sentence summary: QVG compresses the KV-cache for auto-regressive video by first grouping similar tokens and subtracting their centroids, then quantizing the small residuals—even in just 2–4 bits—while keeping quality high. A progressive, multi-stage version further trims error, and efficient kernels keep added latency under ~4%. This shifts the quality–memory Pareto frontier, enabling much longer, steadier videos on everyday GPUs.

Main achievement: A training-free, video-aware KV quantization method that delivers up to ~7× memory savings with near-lossless fidelity and practical runtime.

Future directions: Content-adaptive staging and bit-width, learned semantic grouping, value-specific handling, and integration with sparsity or learned codebooks; extensions to 3D or multi-modal generation.

Why remember this: It shows that respecting video’s natural redundancy—rather than forcing LLM tricks onto it—unlocks low-bit KV caching that is both accurate and fast, making long, consistent video generation far more accessible.

Practical Applications

•Run HY-WorldPlay-8B or similar models on a single consumer GPU (e.g., RTX 4090) for prototyping and demos.
•Enable real-time, long-horizon streaming video generation for live events or interactive shows.
•Improve long-form narrative consistency in story videos by fitting longer context into the same GPU memory.
•Deploy world-model rollouts for robotics or simulation with longer, steadier horizons on limited hardware.
•Power interactive game or VR experiences where low latency and long memory are both essential.
•Reduce cloud serving costs by shrinking KV memory footprint across many concurrent video streams.
•Support mobile or edge servers with tighter memory budgets via 2–4 bit KV caching and fast dequant kernels.
•Enhance video editing tools that need frame-consistent effects over long timelines without re-rendering overhead.
•Combine with sparsity or attention pruning to stack efficiency gains for even longer videos.
•Use adaptive stage/block settings to balance quality and memory per scene (e.g., calm scenes at 2-bit, action scenes at 4-bit).

Version: 1