End-to-End Training for Autoregressive Video Diffusion via Self-Resampling

Yuwei Guo; Ceyuan Yang; Hao He; Yang Zhao; Meng Wei; Zhenheng Yang; Weilin Huang; Dahua Lin

End-to-End Training for Autoregressive Video Diffusion via Self-Resampling

Intermediate

Yuwei Guo, Ceyuan Yang, Hao He et al.12/17/2025

arXiv PDF

Key Summary

•This paper fixes a common problem in video-making AIs where tiny mistakes snowball over time and ruin long videos.
•The authors introduce Resampling Forcing, a way to train without any teacher model by letting the AI practice with its own slightly messed-up past frames.
•They simulate real-world mistakes during training using self-resampling, so the model learns to correct errors instead of spreading them.
•A causal mask keeps time flowing forward, so future frames never sneak information back into the past.
•They also add history routing, which lets the model look back only at the most helpful past frames, saving memory and speed.
•The method trains end-to-end from scratch and matches or beats methods that rely on giant teacher models.
•On long videos (like 15 seconds), it stays stable and consistent, avoiding the 'drift' that hurts other models.
•It works well even with 75% sparse attention, keeping quality almost the same while making generation more efficient.
•This approach is a step toward reliable world-simulation videos that follow cause-and-effect rules.

Why This Research Matters

Long videos often fall apart when AI models rely on their own slightly wrong outputs, but this method teaches them to recover instead of crumble. It removes the need for massive future-peeking teachers, keeping cause-and-effect intact and training practical. By routing attention to the few most relevant past frames, it makes long video generation more efficient while staying consistent. That means better world-simulation videos that obey physics, like liquids filling up instead of magically draining. Creators can make longer, steadier videos, and interactive tools can stay responsive without losing quality. Over time, this brings us closer to reliable, controllable video AIs for education, entertainment, and simulation.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine filming a long domino chain falling. If you place even one domino a little crooked, the mistake can spread and the whole show collapses. That's how many video AIs used to behave: a tiny error in one frame could slowly wreck the rest of the video.

🥬 The World Before: AI video makers got good at creating short, pretty clips when they could look at all frames at once (bidirectional models) or when the training setup fed them the perfect past (teacher forcing). But real life is causal: the present depends on the past, not the future. If we want AIs that can predict what comes next (like a world simulator), we need them to generate videos one step at a time—frame after frame—using only what has already happened. That is called autoregressive video generation.

🍞 Anchor: Think of a storyteller who only knows the story up to the last page that’s been read, not the ending. They must keep telling the story one page at a time without peeking ahead.

🍞 Hook: You know how baking cookies from scratch requires following steps in order—mix, scoop, bake—and you can’t taste cookies before the dough is baked? AIs that make videos need that kind of one-way timeline too.

🥬 The Problem: During training, many systems used teacher forcing, where the model always sees the true, perfect past frames. But during real use (inference), the model only has its own previous outputs, which are never perfect. This mismatch is called exposure bias. Small flaws in earlier frames become the ingredients for the next frames, and the mistakes keep compounding, causing long videos to blur, drift in color, or even break physical logic (like liquid levels going down while you’re still pouring).

🍞 Anchor: It’s like practicing piano on a perfectly tuned instrument, then performing on a slightly out-of-tune one. If you never practiced correcting for imperfections, your performance goes off-key quickly.

🍞 Hook: Imagine a strict coach who only lets you shoot basketballs from perfect passes in practice. Then, in a real game, your teammates throw wobbly passes, and you miss, miss, miss. Practice didn’t match reality.

🥬 Failed Attempts: People tried adding a little random noise to past frames, hoping to mimic mistakes. But random noise isn’t the same as the model’s real errors, so the training still didn’t match reality well. Others used post-training tricks like Self Forcing: roll out a whole video with the model, then use a giant teacher model (that can see forward and backward in time) or a discriminator to nudge the student closer to the real distribution. That helped, but it required big extra models, risked leaking future information (breaking causality), and made scaling from scratch hard.

🍞 Anchor: It’s like a student always consulting a future-knowing oracle during practice. They get good grades, but they don’t actually learn to reason step-by-step.

🍞 Hook: Picture a library where every time you add a new page to a story, your stack of earlier pages gets taller. Looking back through all of it becomes slower and slower.

🥬 Another Challenge: As videos grow longer, the model has to attend to more and more past frames. Full attention over all history becomes heavy and slow. A common shortcut is a sliding window (only look back a bit), but that can miss important long-term details and increase drifting.

🍞 Anchor: Like trying to solve a mystery while only allowed to reread the last two pages—you’ll miss clues planted early on.

🍞 Hook: Imagine training wheels that wean you off help as you learn to balance. That’s the missing piece for video AIs: a way to practice on the kinds of imperfect inputs they will actually face.

🥬 The Gap: We needed a teacher-free, end-to-end way to (1) simulate realistic model-made mistakes during training, (2) keep strict cause-and-effect (no peeking into the future), and (3) handle long histories efficiently without losing important context.

🍞 Anchor: The paper’s solution gives the model a controlled taste of its own imperfect history during training and teaches it to recover gracefully—like a cyclist learning to wobble and straighten out without falling.

02Core Idea

🍞 Hook: You know how good rock climbers practice falling safely so they can recover without panic? This paper teaches video AIs to handle their own little 'falls' and keep climbing the long video wall.

🥬 The 'Aha!' Moment (one sentence): Train the model on its own realistically degraded past frames (self-resampling) while still asking it to predict the clean, correct next frame, so small errors stop snowballing.

Multiple Analogies:

Weather forecaster: Instead of always practicing with perfect instruments, the forecaster also trains using slightly faulty readings—so when the real instruments act up, they still predict tomorrow well.
Language storyteller: Rather than only reading from an error-free script, the storyteller also practices continuing the tale from lines they themselves wrote earlier—including typos—so the plot stays on track.
Basketball: You practice shooting off imperfect passes you might actually get in a game, not just perfect chest passes in drills.

Before vs After:

Before: Teacher forcing made training too clean; at test time, the AI used its imperfect past frames and gradually drifted into blur or nonsense (error accumulation).
After: With Resampling Forcing, the AI sees realistic, self-made imperfections during training, learns to correct them, and keeps video quality stable over long timelines.
Before: Long histories slowed attention or forced small sliding windows that lost important context.
After: History routing lets the model selectively look back at only the most helpful frames, keeping computation steady while preserving global consistency.

Why It Works (intuition, not equations):

The root issue is the mismatch between clean training inputs and messy test-time inputs. If you train only on clean, you panic when it’s messy. If you also train on messy (but in a realistic way), you stay calm and correct course.
Self-resampling creates the kind of errors the model actually makes during generation, not random noise. So fixing those errors is the exact skill the model needs.
Keeping the target clean (predict the true next frame) avoids rewarding the model for copying its own mistakes. It learns to denoise history—not to normalize errors.
A causal mask enforces time’s arrow: no cheating by looking at the future.
History routing finds the most relevant clues from the past, so the model stays consistent without drowning in old frames.

Building Blocks:

Autoregressive Video Generation: make each frame using only past frames.
Teacher Forcing (for warmup only): start learning with perfect past to stabilize early training.
Causal Masking: make attention only look backwards.
Diffusion Loss: measure how well the model predicts the clean target frame from its noisy version.
Self-Resampling: simulate realistic history errors by partially re-generating past frames with the current model.
History Routing: dynamically pick the top-k most helpful past frames to attend to, saving compute while keeping consistency.

🍞 Anchor: In action, the model first lightly scrambles past frames and re-makes them with its current skill (introducing its own style of mistakes), then tries to predict the true next frame given that imperfect history. Over time, it learns to spot and fix those mistakes, just like a reader who can still understand a paragraph even if a few words are smudged.

03Methodology

High-Level Overview: Input video clip and text prompt → (A) Simulate realistic history errors with self-resampling → (B) Train with frame-level diffusion loss under a causal mask using the degraded history as condition → (C) Optionally use history routing to select only the most helpful past frames → Output: a model that stays stable over long videos.

Step A: Self-Resampling (simulate errors the model really makes) 🍞 Hook: Imagine practicing violin while someone slightly detunes a few strings to mimic real-stage conditions—you learn to adjust by ear. 🥬 What it is: A way to degrade past frames in a realistic, model-like way by partially redoing their denoising trajectory with the current model. How it works:

Pick a simulation timestep ts that decides how strongly to degrade history (too small = too faithful, too big = drift risk). ts is drawn from a logit-normal distribution and gently shifted to favor moderate strength.
Add noise to each past ground-truth frame up to ts.
Using the current model, autoregressively denoise the remaining steps (no gradients), producing a slightly imperfect, model-like version of each past frame.
Use these degraded frames as the history condition for training the next frames. Why it matters: Without realistic errors, the model learns on perfect inputs and crumbles when facing its own imperfections at test time. 🍞 Anchor: Like rewriting a paragraph from a smudged photocopy you made yourself, then practicing fixing the smudges as you write the next paragraph.

Step B: Causal Training with Frame-Level Diffusion Loss 🍞 Hook: You know how you can’t use tomorrow’s homework answers to solve today’s worksheet? Same rule here. 🥬 What it is: Train each frame in parallel, but only let it look at past frames (enforced by a causal mask), and judge it with diffusion loss against the true clean target. How it works:

For each target frame, create a noisy version at a training timestep ti.
Feed the noisy target frame and the degraded history frames into a Diffusion Transformer with a causal mask (no peeking at the future).
Predict the velocity (the direction to denoise) and compute the diffusion loss against the known clean target.
Do this for all frames in parallel (thanks to masking) and update parameters. Why it matters: Without the causal mask, the model might accidentally learn to use future info; without diffusion loss to a clean target, it could learn to copy its own mistakes instead of fixing them. 🍞 Anchor: It’s like solving a puzzle for each page of a comic book using only earlier pages, not later ones, and checking against the original clean page.

Step C: Teacher-Forcing Warmup (brief training wheels) 🍞 Hook: When you first learn to ride a bike, training wheels help you not crash immediately. 🥬 What it is: Start with teacher forcing for a short period so the model learns basic causal generation before practicing with self-made imperfections. How it works:

Train with perfect past frames for a small number of steps.
Switch to self-resampling once the model can produce meaningful frames. Why it matters: Without warmup, early random outputs produce unhelpful degraded histories, slowing or stalling learning. 🍞 Anchor: First learn balance on smooth ground, then practice bumps and turns.

Step D: History Routing (efficient long-horizon memory) 🍞 Hook: Think of a detective who, instead of rereading the whole case file, jumps back to the few most relevant clues. 🥬 What it is: A parameter-free way to pick the top-k most relevant past frames for each query token so attention cost stays almost constant as videos grow. How it works:

For each query token in the current frame, compute similarity with compact descriptors (mean-pooled keys) of each past frame.
Select the top-k frames with highest similarity (done per head and token, so different tokens can pick different frames).
Use a two-branch attention: one branch attends inside the current frame; the other attends only to the selected history frames.
Fuse results via a stable log-sum-exp trick so it matches a single softmax over the union. Why it matters: Without routing, attention cost grows with video length; with a naive sliding window, the model may miss long-term links and drift. 🍞 Anchor: Like scanning the table of contents and flipping to the two chapters that matter most for your question.

Concrete Mini-Example:

Prompt: “A red balloon floating down a street for 15 seconds.”
Self-resampling picks ts ≈ moderate. Past frames are lightly noised and re-denoised by the current model, adding realistic wiggles (model-style errors).
The model predicts the next frame from this slightly imperfect history, but it is graded against the clean, true target frame.
History routing selects, say, the very first frame (color reference) and the most recent 2 frames (motion continuity), keeping the balloon’s hue and position consistent over time.

Secret Sauce:

Simulate the exact kind of mistakes the model makes (self-resampling), and train it to fix them while keeping a clean target.
Enforce causality strictly (causal mask) to protect physical logic.
Route to the most relevant history (top-k) to scale long videos without losing global consistency.

04Experiments & Results

The Test: The authors evaluate how well the model keeps videos stable over time (temporal quality), how good they look (visual quality), and how well they match the text prompt (text quality). They especially focus on longer videos (15 seconds) where error accumulation usually shows up.

The Competition: They compare against strong recent methods, including SkyReels-V2, MAGI-1, NOVA, Pyramid Flow, CausVid, Self Forcing, and LongLive. Several of these rely on big teacher models or relax strict causality; others struggle with long-term drift.

The Scoreboard (with context): Using VBench metrics, they split 15-second videos into 0–5 s, 5–10 s, and 10–15 s to see how quality holds up over time.

Overall 0–15 s: Their method reaches temporal ≈ 91.2 and visual ≈ 64.7, which is like getting an A in staying consistent while keeping a solid B+ in appearance—without any teacher model.
Mid 5–10 s and late 10–15 s: Their scores remain strong and competitive, showing that quality does not collapse as the video gets longer.
With 75% sparse attention via history routing, quality drops only slightly relative to full attention, which is like running much faster while losing almost no accuracy.

Why these results matter:

Many strict autoregressive models fall apart over long stretches; here, stability stays high because the model has been trained to fix its own kinds of mistakes.
Competing distillation-based methods often need a giant teacher (e.g., 14B parameters) that can leak future info, which can break causality. This approach avoids that and still matches long-video quality.

Qualitative Observations:

Progressive degradation (blur, color shifts) is much slower or absent in this method compared to baselines.
Physical consistency is better. In a 'milk pouring' scene, some distillation methods briefly violate physics (liquid level goes down while pouring continues), while this method preserves monotonic filling—because it never learned from a future-peeking teacher.

Ablations (what components matter):

Error Simulation: Autoregressive self-resampling beats just adding random noise and also beats resampling all frames in parallel. Why? It captures both per-frame imperfections and how errors compound across time.
Timestep Shifting (how strong to degrade history): Too weak and errors accumulate; too strong and content drifts. A moderate shift balances faithfulness and flexibility.
History Strategies: Dynamic routing (top-5 or even top-1) outperforms a sliding window at the same sparsity. Routing lets different tokens pick different long-term references, preserving global consistency better.

Training & Inference Setup (for completeness):

Base model: WAN2.1-1.3B adapted to causal attention; trained first with a short teacher-forcing warmup, then with Resampling Forcing on 5 s and 15 s videos, and finally fine-tuned with history routing.
Inference: Consistent sampler settings across frames; history routing optionally enabled for speed.

Surprising Findings:

Even extreme sparsity (top-1) degrades quality only modestly, and performs better than a fixed sliding window of size 1—because routing chooses informative frames rather than blindly picking the nearest past.
Routing frequency shows the model often selects both very early 'anchor' frames and very recent frames—a smart blend of global identity and local continuity.

05Discussion & Limitations

Limitations:

Speed: Diffusion needs multiple denoising steps at inference, so real-time generation may require later acceleration (e.g., few-step distillation or faster samplers).
Training Memory: The model processes diffusion samples plus history, which increases training memory use. There’s room for architectural streamlining.
Parameter Choices: The strength of history degradation (timestep shifting) must be chosen sensibly; extreme values can either let errors accumulate or cause drift.
Scope: The paper focuses on text-to-video generation; extensions to complex interactive control or multimodal conditioning may need extra components.

Required Resources:

A capable GPU cluster for training on long videos (e.g., 15s, hundreds of frames) with attention and diffusion steps.
Datasets of sufficiently diverse videos to learn robust temporal behavior.
Efficient attention kernels (e.g., FlashAttention) to make routing practical at scale.

When NOT to Use:

If you must generate videos in strict real time without any acceleration techniques, classic diffusion may be too slow.
If your application requires deliberate future-aware editing (e.g., bidirectional refinement of an entire clip), a strictly causal model is not the best fit.
If your history is extremely short and always clean (e.g., two or three frames), the benefits of self-resampling may be smaller.

Open Questions:

Can we fully automate the scheduling of degradation strength (ts shifting) based on model confidence?
How far can history routing sparsity go (beyond top-1) while maintaining global coherence in minute-long videos?
Can we combine self-resampling with improved samplers or diffusion distillation to reach real-time performance without losing causality?
How well does the approach generalize to interactive applications where users edit frames on the fly or provide new conditions mid-generation?

06Conclusion & Future Work

Three-Sentence Summary: The paper introduces Resampling Forcing, a teacher-free training method that lets an autoregressive video diffusion model practice with its own realistically degraded history so it learns to correct mistakes instead of amplifying them. A strict causal mask keeps time flowing forward, and history routing lets the model look back only at the most relevant past frames, making long videos efficient and consistent. The method matches or beats teacher-dependent baselines on long videos while preserving physical logic and temporal stability.

Main Achievement: Demonstrating that end-to-end, teacher-free training with self-resampling can close the train–test gap (exposure bias) in autoregressive video diffusion, stabilizing long-horizon generation without future leakage or giant teacher models.

Future Directions: Pair self-resampling with accelerated samplers or few-step distillation for real-time use; scale history routing to even longer horizons and richer modalities; and explore adaptive strategies that tune degradation strength based on model confidence.

Why Remember This: It’s a clean, scalable recipe for teaching video models to handle the messiness they’ll actually face, turning error accumulation into error correction—and bringing us closer to trustworthy, causally consistent world-simulating videos.

Practical Applications

•Generate long, stable videos for storytelling without visual drift or physics breaks.
•Create interactive video tools where users edit frame-by-frame and the model stays consistent over time.
•Simulate game worlds that respond realistically to player actions with preserved causality.
•Produce training data for robotics by rolling out long, physically consistent visual scenarios.
•Develop education content (science demos, experiments) where cause and effect must hold across many frames.
•Power video summarization-by-generation, keeping key identities (faces, objects) consistent across long spans.
•Enable efficient long-horizon generation on limited hardware via history routing (top-k attention).
•Fine-tune existing video models to reduce error accumulation and improve long-term temporal quality.
•Prototype video-based forecasting systems (e.g., weather-like visuals) that don’t drift over time.

Version: 1