Efficient Autoregressive Video Diffusion with Dummy Head

Hang Guo; Zhaoyang Jia; Jiahao Li; Bin Li; Yuanhao Cai; Jiangshan Wang; Yawei Li; Yan Lu

Efficient Autoregressive Video Diffusion with Dummy Head

Intermediate

Hang Guo, Zhaoyang Jia, Jiahao Li et al.1/28/2026

arXiv PDF

Key Summary

•This paper finds that about 1 out of every 4 attention heads in autoregressive video diffusion models mostly looks only at the current frame and almost ignores the past, wasting memory and time.
•The authors call these under-using heads “dummy heads” and show that removing their caches barely hurts quality (about 0.26% drop) but already speeds up generation.
•They introduce Dummy Forcing, a training-free way to give each attention head only the context it truly needs, cutting redundancy safely.
•Three tools power Dummy Forcing: Heterogeneous Memory Allocation (different head types get different context), Dynamic Head Programming (an optimal, greedy way to label head types), and Packed Attention Forward (packs head groups to reduce kernel launches and safely raise dummy head counts).
•On standard short video generation, Dummy Forcing hits real-time 24.3 FPS with about 1.4× end-to-end speedup and almost no quality loss.
•At higher resolutions like 1080P, speedup rises to 2.0× while preserving quality, showing that benefits grow with sequence length.
•For long-context storytelling, the method reallocates saved memory to useful heads, enabling up to 6.58× longer effective context at similar runtime.
•The idea is model-agnostic and training-free, working with Self Forcing, LongLive, and even large 14B-parameter setups.
•It combines well with other accelerators (e.g., TeaCache), reaching over 30 FPS, proving it is complementary rather than competitive.
•Bottom line: smarter, per-head context gives faster video generation without retraining and with tiny or no quality trade-offs.

Why This Research Matters

Video tools are moving from slow, offline rendering to interactive creation where every second counts. Dummy Forcing makes state-of-the-art video generators faster without retraining, so creators can iterate ideas in real time. It lowers compute and memory needs, which reduces costs and energy use, making high-quality generation more accessible. For long stories and high-resolution outputs, it sustains speed while keeping characters and styles consistent across scenes. It also plays well with other accelerators, stacking gains to reach smooth, 24–30+ FPS experiences. Overall, it turns a hidden inefficiency into a practical speed boost that benefits users today.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine watching a friend draw a flipbook. If they draw every page before you can peek, you wait a long time. Wouldn't it be better if they showed each page as soon as it’s ready?

🥬 The Concept (Autoregressive video diffusion - the world before): Autoregressive video diffusion models generate videos frame by frame, cleaning noise step by step so you can see frames as they’re made. How it works:

The model learns to turn noisy images into clean frames.
It creates frame 1, then frame 2 using frame 1 as helpful context, and so on.
It remembers earlier frames in a special fast memory (the KV cache) so it can reuse them. Why it matters: Without frame-by-frame generation and memory, you would have to wait until all frames are finished, making interactive or real-time video creation hard. 🍞 Anchor: Live-stream-like video generators that keep producing the next second while you’re already watching the last one are using this idea.

🍞 Hook: You know how during a group project, different teammates focus on different tasks? One checks past notes, another summarizes them, and another polishes the final slide.

🥬 The Concept (Attention mechanism and multi-head attention): Attention lets AI decide which parts of the input matter most right now, and multi-head attention is like a team of mini-attentions, each with a different specialty. How it works:

For each token (like a video patch), the model asks, “Which other tokens help me best?”
Each head gives its own answer (scores) and mixes information accordingly.
The model combines heads to form a richer understanding. Why it matters: Without attention (and multiple heads), the model would treat everything equally, missing important context like motion across frames. 🍞 Anchor: When generating “a cat jumping,” some heads focus on the cat’s shape now, others on where it was a moment ago, and others on background consistency.

🍞 Hook: Picture a backpack with pockets. Old notes go in one pocket so you can quickly check what happened before.

🥬 The Concept (KV cache): The KV cache is fast memory that stores past keys and values so the model can quickly look back at previous frames. How it works:

When a frame is processed, its key/value features are saved.
Later frames can ‘look back’ by attending to these cached features.
This avoids recomputing from scratch and makes long videos feasible. Why it matters: Without the cache, long videos would get very slow and costly because the model would repeatedly redo the same work. 🍞 Anchor: While drawing frame 30, the model can instantly glance at summaries from frames 1–29 in its cache, instead of rereading the whole flipbook.

🍞 Hook: Think of standing by a long parade—you won’t look at the entire street all the time; you peek at a window of recent floats and maybe remember one special float from the start.

🥬 The Concept (Sliding window with a sink frame and neighbor frames): To stay efficient, models often keep only a nearby ‘window’ of recent frames plus a special ‘sink’ frame that anchors the look and style. How it works:

Choose a small moving window of recent frames (neighbors).
Keep one sink frame fixed (like the first frame) as a global reference.
As new frames arrive, the window slides forward. Why it matters: Without this, caches would grow huge on long videos, making attention too slow and memory-hungry. 🍞 Anchor: The sink frame is like a style board you never throw away; the neighbor frames are the last few pages you keep open for quick reference.

The problem: Even with these smart tools, the authors discovered a surprisingly common inefficiency inside multi-head attention. About 25% of the heads mostly stare at the current frame (over 80% of their attention), almost ignoring the past—even though history is available. That wastes cache space and compute. Failed attempts and gaps: Prior speedups either pruned tokens per step (which costs time to decide what to prune), skipped diffusion steps (limited gains for already-fast diffusion), or used sparse patterns designed for fixed lengths (hard to apply to shifting, causal windows). What was missing was a head-aware, frame-level view: not all heads need the same context, and some need almost none. Real stakes: Faster generation matters for live creativity tools, interactive storytelling, mobile or edge devices with limited memory, and high-resolution or long videos where costs explode. Trimming waste without retraining can make these experiences smoother, cheaper, and greener.

02Core Idea

🍞 Hook: You know how a soccer team wins by assigning roles—some defend, some pass, some shoot—and not everyone chases the ball at once?

🥬 The Concept (Dummy heads): Dummy heads are attention heads that mostly look at the current frame and barely use history, so they don’t really help with cross-frame context. How it works:

Measure how much each head attends to sink, neighbor, and current frames.
Heads that put most attention on the current frame are tagged as dummy.
Treat their context differently to save memory and time. Why it matters: If you let every head keep long history, you waste compute. If you remove context from the wrong heads, you lose motion consistency. 🍞 Anchor: Like benching players who keep ball-watching; you free space for teammates who actually defend or pass.

🍞 Hook: Imagine packing your schoolbag. You don’t put your entire bookshelf inside—you only bring the books you need for today’s classes.

🥬 The Concept (Dummy Forcing): Dummy Forcing is a training-free recipe that assigns each head only the context it truly uses, cutting redundant cache while keeping quality. How it works:

Profile heads to see whether they prefer sink, neighbor, or current frames.
Classify each head into one of three types and give each type the minimal helpful context.
Pack compatible heads together to reduce the number of attention calls.
Optionally, give dummy heads a tiny extra frame (i−1) so boundary mistakes don’t hurt. Why it matters: Without per-head context control, you pay a ‘one-size-fits-all’ tax that slows everything, especially at high resolution or long videos. 🍞 Anchor: After applying Dummy Forcing, 1080P videos can run up to 2.0× faster with quality intact.

The “Aha!” in one sentence: Not every attention head needs the same past—by right-sizing each head’s context and packing compatible work, we get big speedups with tiny or no quality loss.

Three analogies:

Orchestra: Some instruments carry the melody (neighbor heads), one keeps the rhythm steady (sink heads), and some just add a quick sparkle (dummy heads). Give each their specific sheet music instead of a full symphony book.
Library: Power readers want old archives (neighbor heads), one shelf holds the master style guide (sink), and skimmers just need the current page (dummy). Don’t make skimmers borrow the whole archive.
Sports: Defenders (neighbor) watch the opponent’s last moves; a captain (sink) maintains team shape; a striker (dummy) focuses on the now. Don’t force every player to review the whole game history every minute.

Before vs. After:

Before: Every head could access the same long window. Some heads barely used it, but we still paid the cost.
After: Heads get tailored context (or none), caches shrink, kernel launches drop, and FPS rises (e.g., 24.3 FPS vs. 17.6 FPS).

Why it works (intuition): Video dependencies decay with time—nearby frames matter most for motion; a persistent sink anchors style; and a subset of heads specialize in current-frame refinement. If we stop overfeeding history to heads that won’t eat it, we save a lot without starving the model.

Building blocks (preview):

Heterogeneous Memory Allocation (HMA): Sink heads see sink+current; neighbor heads see a short local window+current; dummy heads see current (and optionally i−1 packed).
Dynamic Head Programming (DHP): An optimal, greedy labeling of heads into sink/neighbor/dummy that maximizes kept attention mass while forcing a chosen number of dummy heads.
Packed Attention Forward (PAF): Pack sink and (slightly-extended) dummy heads in one call to cut overhead and safely increase dummy heads.

03Methodology

At a high level: Input video → profile attention heads (who looks where?) → classify heads optimally (sink / neighbor / dummy) → give each head just-enough context (HMA) → pack compatible heads to cut launches (PAF) → generate the next frame faster with quality preserved.

Step 0 — Compute per-head frame attention scores (the diagnostic):

What happens: For each attention head, we estimate how its queries from the current frame distribute attention over (a) sink, (b) neighbor window, and (c) current frame tokens, yielding a triple [α_sink, α_neighbor, α_current]. We can sample a subset of queries for speed.
Why it exists: Without measuring head preferences, we’d be guessing which heads need history and which don’t.
Example: A head has [0.35, 0.55, 0.10] → strong neighbor interest; another has [0.05, 0.10, 0.85] → likely dummy.

🍞 Hook: Like checking which classmates borrow old notes, which skim the current page, and which always consult the course guide. 🥬 The Concept (Frame attention score): It’s a 3-number summary per head telling how much it cares about sink, neighbor, and current frames. How it works:

Read the head’s attention map for the current step.
Sum scores over tokens belonging to sink, neighbor, and current groups.
Normalize so they add up to 1. Why it matters: Without this, we can’t tailor context; we’d risk cutting needed history or keeping waste. 🍞 Anchor: A head with α_current ≈ 0.9 is clearly a current-only “skimmer.”

Step 1 — Dynamic Head Programming (DHP):

What happens: Using each head’s F[h] = [α_sink, α_neighbor, α_current], we assign a head to sink, neighbor, or dummy to maximize retained attention mass, while forcing a chosen number N of dummy heads. This can be solved optimally by a simple greedy rule.
Why it exists: Dummy head positions are mostly stable, but not perfectly. Re-deciding per shot with a fast, near-constant overhead improves robustness.
Example: If a head’s max(α_sink, α_neighbor) is tiny, it’s a great dummy candidate; if α_neighbor is large, keep it as neighbor.

🍞 Hook: Picking teams so the class project scores highest overall. 🥬 The Concept (Dynamic Head Programming): An optimization that labels heads to keep as much useful attention as possible while meeting a target number of dummy heads. How it works:

For each head, compute ℓ_h = max(α_sink, α_neighbor) (the opportunity cost of making it dummy).
Sort ℓ_h across heads; pick the N smallest as dummy (least to lose).
Among the rest, choose sink vs. neighbor by whichever (α_sink vs. α_neighbor) is bigger. Why it matters: Without optimal labeling, we might prune context from important heads and hurt motion consistency. 🍞 Anchor: Like benching the kids who least reduce team performance if they sit out, and assigning others to their best-fit roles.

Step 2 — Heterogeneous Memory Allocation (HMA):

What happens: Each class of head gets a different, right-sized context: sink heads see [sink, current], neighbor heads see [local window, current], dummy heads see [current] (or [i−1, current] if packed).
Why it exists: Different heads specialize; sharing one big context wastes memory and compute.
Example: If L=4 total window size, a neighbor head might only see the last 3 frames plus current, while sink heads skip those 3 and just see the sink frame plus current.

🍞 Hook: Packing lunch boxes differently: the big eater gets a full meal, the snacker gets a small snack, and the water-only friend just gets water. 🥬 The Concept (Heterogeneous Memory Allocation): Give each head only the frames it truly uses. How it works:

Partition heads into sink/neighbor/dummy.
Build per-head attention with the minimal needed frame groups.
Run attention per group and place outputs back to the original head positions. Why it matters: Without it, the cache grows and attention calls are heavier than necessary. 🍞 Anchor: Cutting dummy and sink redundancy yielded up to 27.8% effective cache length (vs. baseline 100%) and large speedups.

Step 3 — Packed Attention Forward (PAF):

What happens: We give dummy heads a tiny extra frame (i−1) so edge cases are safer, and we pack sink and (now-extended) dummy heads into a single attention call, reducing kernel launches.
Why it exists: Simply cranking up the number of dummy heads can misclassify boundary heads and hurt quality; packing fixes that and also reduces runtime overhead.
Example: Instead of three attention kernel calls (sink, neighbor, dummy), we do two (pack sink+dummy, and neighbor), which often outweighs the small cost of adding frame i−1 for dummy heads.

🍞 Hook: Combining two quick errands into one trip saves time even if you add a tiny detour. 🥬 The Concept (Packed Attention Forward): Merge compatible head groups into fewer attention calls while giving dummy heads a tiny safety net. How it works:

Extend dummy context to [i−1, i]; sink stays [sink, i].
Pack sink and dummy heads into one call; run neighbor heads in another.
Profit from fewer launches and safer dummy expansion. Why it matters: Without packing, increasing dummy heads risks quality drops and extra overhead from separate calls. 🍞 Anchor: With packing, models supported >50% dummy heads without noticeable quality loss and achieved 24.3 FPS.

Step 4 — Generate frames and iterate:

What happens: With these assignments and packed calls, the model denoises the next frame, updates caches as needed, and repeats.
Why it exists: The whole pipeline must stay streaming, real-time, and memory-friendly.
Example: On standard 832×480 videos, the method hits 1.4× end-to-end speedup; on 1080P, up to 2.0×.

Secret sauce:

Per-head, frame-level thinking: It’s not token-pruning every step (which adds overhead), but once-per-shot head labeling and fixed rules that are cheap and robust.
Role-specific memory: Treat heads like specialists so each gets just the context it uses.
Kernel packing: Fewer, smarter attention calls across heads bumps speed further.

What breaks without each step:

No head profiling → you prune the wrong context and harm quality.
No DHP → static, hand-picked dummy heads miss variations across prompts.
No HMA → you keep redundant frames and lose speed.
No PAF → you can’t safely raise dummy ratios or cut kernel overhead.

Concrete data mini-example:

Suppose 8 heads; their max(α_sink, α_neighbor) values are [0.55, 0.52, 0.05, 0.10, 0.60, 0.58, 0.12, 0.08]. For N=4, DHP picks heads with 0.05, 0.08, 0.10, 0.12 as dummy; the rest become sink/neighbor by whichever is larger. Then HMA gives each head the right context, and PAF packs sink+dummy into one call.

04Experiments & Results

The test: Evaluate speed (FPS) and quality (VBench/VBench-Long: Quality, Semantic, Total) across short 5s videos, long 30s+ videos, high resolutions (720P/1080P), and long-context storytelling (larger caches).

The competition: Compare with Self Forcing and LongLive baselines (both strong autoregressive diffusions), and with speedup ideas from adjacent areas: R-KV and Infinipot-V (KV pruning), TeaCache (fewer diffusion steps), plus prior non-AR systems (e.g., SkyReels-V2) and others for context.

Scoreboard highlights with context:

Short videos (832×480):
- Baseline Self Forcing: 17.6 FPS, Total ≈ 84.00.
- Dummy Forcing: 24.3 FPS (≈ 1.4×), Total ≈ 83.90 (about 0.1 drop) — that’s like getting another 40% speed but still an A on quality.
- Competing KV pruning (R-KV, Infinipot-V) gave only ~1.1× because per-step token selection time eroded the win.
- TeaCache got ~1.2× but diffusion steps were already few, limiting gains.
Long videos (30s, VBench-Long):
- Self Forcing: 17.6 FPS → Ours: 24.3 FPS (1.4×), small quality movement (e.g., Total 83.53 → 83.19-ish depending on setup), showing stability over time.
High-resolution (720P/1080P):
- 720P: 1.6× speedup with near-identical quality.
- 1080P: up to 2.0× speedup with essentially preserved quality. This matches the intuition that bigger sequences benefit more from per-head context trimming.
Long-context storytelling:
- Sliding-window baselines only remembered ~36 frames. Dummy Forcing reallocated saved cache to neighbor heads, achieving up to 6.58× longer effective context at similar runtime and better consistency across shots.

Surprising findings:

Naively removing all caches for about 25% of heads barely hurts (≈0.26% drop), revealing strong head specialization.
Dummy head positions are fairly stable across prompts, AR steps, and denoising steps (core-set ratio up to ~0.92), making one-shot labeling feasible.
Packed Attention Forward makes it possible to set >50% heads as dummy without noticeable quality loss because the [i−1] safety frame catches boundary cases.
Combining with TeaCache yields over 30 FPS, proving orthogonality.

Module-level profiling:

A single self-attention layer saw up to ~1.7× speedup as context length grew, aligning with the whole-model gains at high resolution or long sequences.

Large-scale model case:

On a 14B-parameter RealTime model with 1600 heads, setting 800 as dummy maintained or even improved total scores on long videos while speeding up, showing the method scales.

Bottom line: Across tasks, resolutions, and models, Dummy Forcing reliably trades near-zero quality for sizable speed—especially when the sequence is long, which is exactly where users need acceleration most.

05Discussion & Limitations

Limitations:

Head labeling depends on measuring attention maps; while labels are quite stable, unusual prompts or architectures could shift head roles more than expected.
The method is currently training-free; light fine-tuning might enable even more aggressive compression or better quality at very high dummy ratios.
If you force too many dummy heads without packing (i−1) context, you can prune context-critical heads and hurt motion consistency.

Required resources:

Access to attention maps once per shot for quick head profiling (≈100 ms typical).
A GPU runtime that supports efficient per-head grouping and 2 packed attention calls.
Enough VRAM to hold smaller, tailored caches (which is less than baseline) and standard diffusion steps.

When NOT to use:

Tiny clips with very short context, where baseline is already real-time and the profiling overhead might not pay back much.
Highly nonstationary content where head roles rapidly change within the same shot (rare; but if so, you may need re-labeling more often).
Architectures without clear multi-head specialization or without KV caching.

Open questions:

Why do dummy heads emerge so strongly during training? Can training objectives encourage even cleaner role separation?
How far can packing and per-head context trimming go with light fine-tuning? Could some heads be fully removed?
Can similar ideas accelerate other causal video generators or world models that have different attention layouts?
What are the best adaptive schedules for re-labeling heads during extremely long interactive sessions?

Overall: Dummy Forcing shows that per-head, frame-aware memory pays off. Future work may fuse this with training-time sparsity or distillation to push efficiency and quality even further.

06Conclusion & Future Work

Three-sentence summary: The paper discovers that many attention heads in autoregressive video diffusion barely use historical frames, creating waste. It introduces Dummy Forcing—a training-free way to give each head just the context it needs and to pack compatible heads—boosting speed while keeping quality. This yields up to 2.0× speedups (24.3 FPS at 832×480, 2.0× at 1080P) with tiny or no quality loss, and enables much longer effective context.

Main achievement: Turning a one-size-fits-all cache into a role-aware, per-head memory plan—validated by an optimal greedy head-labeling algorithm and a practical packing trick—so the model runs faster without retraining.

Future directions:

Add light fine-tuning so models natively learn to place most context work on a smaller subset of heads.
Explore automated schedules for re-labeling during extra-long, interactive sessions.
Extend the idea to world models and other video architectures or to mixed-modality models (audio+video).

Why remember this: It’s a clean, portable insight—heads have roles; give them only what they need. When sequences get long (high-res or long videos), smart per-head context pays bigger dividends than global pruning, making real-time, consistent video generation far more practical.

Practical Applications

•Enable real-time preview in video editing apps so creators can see changes instantly at 24+ FPS.
•Speed up high-resolution (720P/1080P) ad and trailer generation on limited hardware.
•Improve consistency across multi-shot storytelling by reallocating saved memory to longer context.
•Power faster interactive animation tools for classrooms and workshops with modest GPUs.
•Reduce cloud inference costs for video platforms by shrinking per-head context and cache sizes.
•Boost responsiveness of live, prompt-to-video demos at events or product launches.
•Combine with step-skipping methods to reach 30+ FPS for broadcast-grade workflows.
•Deploy on edge devices (laptops, creator tablets) where memory is tight but real-time is desired.
•Accelerate batch production of short social clips without sacrificing quality.

Version: 1