LIVE: Long-horizon Interactive Video World Modeling

Junchao Huang; Ziyang Ye; Xinting Hu; Tianyu He; Guiyu Zhang; Shaoshuai Shi; Jiang Bian; Li Jiang

LIVE: Long-horizon Interactive Video World Modeling

Intermediate

Junchao Huang, Ziyang Ye, Xinting Hu et al.2/3/2026

arXiv PDF

Key Summary

•LIVE is a new way to train video-making AIs so their mistakes don’t snowball over long videos.
•Instead of copying a teacher model, LIVE asks the model to roll forward in time and then roll back to where it started, like walking to the park and retracing your steps home.
•This forward-then-back rule (cycle consistency) gently forces the model to keep its errors small, because big errors make it impossible to get back home.
•LIVE unifies older training tricks (Teacher Forcing and Diffusion Forcing) and adds a simple dial that controls how much real vs. generated video the model sees during training.
•A step-by-step “curriculum” starts easy (mostly real frames) and gets harder (more generated frames), building error tolerance safely.
•On three tests (RealEstate10K homes, UE game engine scenes, and Minecraft gameplay), LIVE stays stable much longer than other methods.
•Compared to baselines, LIVE keeps image quality steady as rollouts get 2–4× longer, like holding an A grade while others drop to C.
•LIVE needs no big teacher model, so it’s simpler and cheaper to deploy once trained.
•It works in interactive settings (with camera moves or actions), not just passive video generation.

Why This Research Matters

Long interactive videos power experiences we care about, like smooth game playthroughs, stable drone flyovers, and reliable robot camera views. When errors spiral, these experiences quickly break—footage jitters, scenes drift, and agents get lost. LIVE makes long-horizon stability a learnable skill, so quality stays steady even as videos stretch from seconds to minutes. Because it needs no large teacher model, it is simpler and more practical to deploy in new domains. This can lower costs, speed up iteration, and broaden access to robust video world modeling. In short, LIVE helps turn flashy demos into dependable everyday tools.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

You know how when you’re telling a long story, a tiny mix-up early on can confuse the ending? Computers that make videos have the same problem: small mistakes can add up over time and ruin long scenes.

🍞 Top Bread (Hook): Imagine a friend retelling a chain of whispers. At the start, the message is clear. But each whisper adds a tiny error, and by the end it’s a mess. 🥬 Filling (The Actual Concept): Autoregressive video generation makes the next frame using the previous ones, over and over. It works step-by-step like a whisper chain.

What it is: A way for AI to create videos one frame at a time, each new frame depending on recent frames and any actions (like camera moves).
How it works: (1) Look at a small window of recent frames, (2) read the action or camera info, (3) predict the next frame, (4) slide the window and repeat.
Why it matters: Without good control, small prediction slips compound, making long videos drift, blur, or go off-track. 🍞 Bottom Bread (Anchor): Think of a stop-motion movie: if each photo is placed a hair off, the final animation wobbles a lot.

The world before LIVE:

Video diffusion models got good at short bursts and whole-clip generation with great quality. But bidirectional models typically generate all frames together and aren’t interactive frame-by-frame, so they’re slow and fixed-length.
Autoregressive (AR) video models are naturally interactive and real-time, great for games and robotics. But they face exposure bias: trained on perfect past frames, tested on their own imperfect outputs.

🍞 Top Bread (Hook): You know how practicing with a spotless worksheet doesn’t prepare you for a messy real exam? 🥬 Filling: Exposure bias is the gap between training on perfect inputs and testing on your imperfect ones.

What: The model never learns to handle its own mistakes.
How: During training, it conditions on ground-truth frames; during inference, it only has its noisy guesses.
Why it matters: Errors snowball, and long videos degrade. 🍞 Bottom Bread (Anchor): It’s like a GPS tested only on straight roads; on real twisty roads, it gets lost fast.

Failed attempts and their limits:

Teacher Forcing (TF): Always feed ground-truth frames during training. Simple, but creates a big train–test mismatch.
Diffusion Forcing (DF): Add noise to ground-truth frames to mimic imperfect inputs. Better for short runs, but noised truth still isn’t the same as genuine rolled-out frames with real errors.
Self-Forcing (SF): Generate your own rollouts and match them to a strong pretrained teacher’s distribution. It helps, but needs a heavy teacher model and doesn’t truly bound how errors grow over very long horizons.

🍞 Top Bread (Hook): Imagine learning to ride a bike only on perfect pavement (TF), then on slightly sandy pavement (DF), or by copying a pro rider’s style (SF). 🥬 Filling: Each helps, but none guarantees you can ride miles without wobbling out of control.

What’s missing: A rule that keeps wobble growth under control no matter how far you go.
Why the gap matters: In real use—games, robotics, camera control—you need stability for hundreds of frames, not just a dozen. 🍞 Bottom Bread (Anchor): If your drone camera drifts after a few seconds, you can’t film a smooth minute-long shot.

The gap LIVE fills:

A training objective that directly teaches recovery from your own rollouts—no teacher needed.
A way to keep errors bounded so long videos stay steady.
A unified view of TF/DF/LIVE with a simple curriculum that smoothly increases difficulty without collapsing training.

02Core Idea

The “Aha!” in one sentence: If the model must be able to roll forward and then reliably roll back to the starting point, it can’t let errors grow unbounded in the first place.

Explain with three analogies:

Walk-and-retrace: You walk from home to the park (forward), then retrace your exact path back home (reverse). If you wander too far off the path on the way out, you won’t find home on the way back—so you naturally learn to keep close to the path.
Draw-and-erase: You sketch a picture, then erase strokes in reverse order to return to a blank page. If your lines get too messy, clean erasing becomes impossible, so you learn to keep lines tidy.
Jenga build-and-unbuild: You stack blocks up, then remove them in reverse. If you place blocks sloppily, you can’t cleanly unstack later—so you learn careful placement that stays stable.

🍞 Top Bread (Hook): You know how a boomerang should come back if you throw it right? 🥬 Filling (Cycle-Consistency Objective):

What it is: A rule that says, “From your own generated future, you must be able to recover your original past.”
How it works: (1) Start with some real frames, (2) roll forward to create new frames, (3) reverse time and conditions, add a bit of randomness so it’s not trivial, and (4) train the model to reconstruct the original frames using diffusion loss.
Why it matters: If forward errors get too big, the model can’t reconstruct the start. Training punishes that, so the model learns to keep errors bounded. 🍞 Bottom Bread (Anchor): Like tossing a boomerang: if it doesn’t come back, you adjust how you throw until it does.

Before vs. After:

Before: We tried to make models robust by giving them noisy inputs or by copying a big teacher. Errors still crept up with longer videos.
After (LIVE): The model treats forward and reverse as a cycle. If it can always get back, then it never drifts too far out. Long-horizon quality stays stable.

Why it works (intuition behind the math):

The recovery objective creates a ceiling on how much the model can deviate—too much deviation makes recovery impossible, so gradients push to reduce drift.
Reversing and adding random per-frame noise stop the model from “cheating” by copying a single clean context frame; it must truly learn to be robust to realistic, messy rollouts.
This turns long-horizon stability from a hope into a trained skill.

🍞 Top Bread (Hook): You know how training wheels come off only after you can balance? 🥬 Filling (Progressive Training Curriculum):

What it is: A step-by-step schedule that starts easy (mostly real frames as context) and gradually gets harder (more generated frames with real errors).
How it works: Begin with p = T (Teacher Forcing, all ground-truth context), then slowly lower p to allow more rollout frames in context, while always enforcing the forward-then-reverse recovery.
Why it matters: Jumping straight to “all rollout” can crash training; gradual exposure builds error tolerance safely. 🍞 Bottom Bread (Anchor): Like moving from training wheels to two wheels: first balance, then pedal alone, then steer around corners.

Building blocks (in simple pieces):

Autoregressive backbone with causal attention and a sliding window for real-time interactivity.
Diffusion loss to supervise frame-level noise prediction efficiently in parallel.
Cycle-consistency that ties forward generation to reverse recovery.
Random per-frame noise in the reversed context to prevent trivial solutions and teach robust recovery.
A unified view with a single “p dial” that morphs TF → DF → LIVE, enabling smooth pretraining and post-training.

03Methodology

At a high level: Input (some real frames + actions/poses) → Forward rollout (generate future frames) → Reverse with noise (use generated frames as context, time-reversed) → Recover the original frames with diffusion loss → Update the model.

Step A: Forward rollout from ground-truth prompts 🍞 Hook: Imagine watching the first few seconds of a scene, then predicting what happens next. 🥬 Concept:

What: Use the first p real frames plus their future actions/poses to generate the remaining frames in the window.
How: With causal attention, the model looks only backward within a fixed window and predicts the next frames. During training, since we already know future actions/poses, we can generate the T − p frames efficiently in parallel.
Why it matters: This creates realistic, self-made context with genuine model errors—not just noised truth—so training sees what will happen at test time. 🍞 Anchor: Like forecasting the next 5 moves in chess after seeing the first 3.

Step B: Reverse generation context and add noise 🍞 Hook: You walked to the park; now turn around to go home. 🥬 Concept:

What: Reverse the generated sequence and also reverse the corresponding actions/poses.
How: Flip time order (… x_T, …, x_{p+1}), then inject random per-frame noise so the model can’t just copy a single clean frame; it must handle varied, messy inputs.
Why it matters: Without noise, the nearest reversed frame might be too clean, letting the model “cheat.” Noise forces real recovery skill, not shortcuts. 🍞 Anchor: It’s like turning off the “snap-to-grid” in a drawing app—you must really learn to draw straight lines yourself.

Step C: Recover the original prompts via diffusion loss 🍞 Hook: Can you get back to the exact starting snapshot from those noisy, reversed frames? 🥬 Concept:

What: Train the model to reconstruct the original p prompt frames given the reversed, noised rollout as context.
How: Use frame-level diffusion loss (noise prediction) so all positions can be supervised efficiently in parallel. This ties recovery quality directly to parameter updates.
Why it matters: If forward rollouts drift too far, recovery fails, producing higher loss. Gradients then encourage keeping forward errors small and making recovery strong. 🍞 Anchor: Like practicing a piano piece forward and backward; if backward falls apart, you slow down and fix the forward fingering until both directions are clean.

Step D: The unified “p dial” and curriculum 🍞 Hook: You know the difficulty slider in video games? 🥬 Concept:

What: p is the number of real frames used as prompts. p = T is Teacher Forcing (all real context), p = T with added noise is Diffusion Forcing, and p < T is LIVE (context includes generated frames with true errors).
How: Start with p = T to warm up; gradually decrease p to expose the model to more of its own rollout errors; keep enforcing cycle-consistency throughout.
Why it matters: A gentle slope prevents crashes. The model learns error tolerance step-by-step. 🍞 Anchor: First you practice with training wheels (p high), then ride mostly solo (p low), but always with a safe route home (cycle consistency).

Concrete mini example:

Inputs: 8-frame window (T=8), prompts p=3, camera poses known for all 8 frames.
Forward: Generate frames 4–8 using frames 1–3 as context.
Reverse: Order becomes 8,7,6,5,4; actions/poses reversed; add random noise to each reversed frame.
Recover: Train to reconstruct frames 1–3 using diffusion loss while attending over the reversed, noisy context.
Update: Backpropagate loss; next batch, maybe set p=2 to make it slightly harder.

Secret sauce:

The cycle makes stability an explicit, learnable skill.
Random per-frame noise in the reversed context blocks trivial solutions and teaches robustness to real rollout errors.
The p-curriculum unifies old methods and provides a safe ramp to long-horizon strength without a heavy teacher model.

04Experiments & Results

The test: Can the model keep video quality steady as we ask for longer and longer rollouts in interactive settings?

Why this matters: In real life and games, you often need hundreds of frames while responding to actions (camera moves, controls).
Metrics: FID and LPIPS (perceptual quality), PSNR and SSIM (fidelity and structural consistency).

The competition: Methods without relying on a big interactive teacher during training.

Baselines: CameraCtrl, DFoT (history-guided diffusion), GF (geometry forcing), and NFD with Teacher Forcing (TF) or Diffusion Forcing (DF). All share a similar backbone where noted, isolating training strategy.

Scoreboard with context:

RealEstate10K (real home tours, diverse camera motion): LIVE’s FID stays stable as rollouts grow from 32 → 64 → 128 → 200 frames, roughly like holding an A while others slide from B to C when the test gets longer. LIVE also shows stronger PSNR/SSIM and lower LPIPS at long horizons, meaning clearer and more consistent frames.
UE Engine Videos (realistic game engine scenes): LIVE consistently improves over TF/DF across 64–400+ frames. Think of it as fewer jitters and less drift in long fly-throughs.
Minecraft (interactive gameplay): LIVE outperforms TF/DF from short (32) to long (200) frames, showing it can handle action-conditioned rollouts typical in games.

Surprising findings:

Post-training a converged DF model with LIVE doesn’t just nudge metrics; improvements grow at longer horizons. That means LIVE especially shines when things usually fall apart.
FID across 128 vs. 200 frames converges to nearly the same value for LIVE—quality stops depending on how long you roll out, indicating bounded error accumulation in practice.
Ablations confirm each ingredient matters: remove cycle-consistency and long-horizon quality drops sharply; skip random timestep noise and recovery becomes too easy at first but fails later; keep p fixed and you miss the smooth ramp-up that stabilizes training.

Plain-English takeaway:

LIVE turns long-horizon stability from wishful thinking into a trained behavior. It’s like teaching a runner not just to sprint the first 50 meters, but to pace perfectly for the full race.

05Discussion & Limitations

Limitations:

LIVE reduces but does not completely remove drift in the hardest scenes (e.g., chaotic motion, large lighting shifts, or rare events); it enforces a practical bound, not perfection.
It assumes access to action/pose signals (camera moves, controls) and a fixed sliding window; extremely long memory beyond the window may still require external memory systems.
Training uses substantial compute (e.g., 32 H100 GPUs in the paper) and benefits from a good DF checkpoint to start—lighter setups may need longer to converge.
Reverse recovery presumes you can meaningfully reverse conditions; settings without clear reversible controls might need adaptations.

Required resources:

A diffusion-transformer backbone with causal attention, a VAE for latent video space, and enough GPUs for multi-step diffusion training.
Datasets with aligned actions/poses for interactive modeling; otherwise, controls must be inferred.

When NOT to use LIVE:

Pure text-to-video without interactivity, where full-sequence bidirectional models may be simpler and excel at global coherence.
Ultra-creative, high-variance storytelling where exact recoverability is unimportant and diversity trumps stability.
Domains without usable control signals or where reversing conditions is ill-defined.

Open questions:

Can we obtain formal error bounds and adaptive schedules that tune p automatically per sequence difficulty?
How to combine LIVE with long-term memory modules for minute-scale rollouts beyond the sliding window?
Can geometry-aware or 3D-consistent signals further tighten the error bound (e.g., fusing with GF-like priors)?
What happens at massive scale (Sora/Genie-level data) and in multi-agent interactive worlds?
Can we extend LIVE to text-to-video interactivity or multimodal controls while keeping training lightweight?

06Conclusion & Future Work

Three-sentence summary:

LIVE trains video models to go forward in time and then come back to where they started, using a cycle-consistency objective so errors can’t grow unchecked.
A simple dial (p) unifies old training tricks and powers a gentle curriculum that safely builds error tolerance without any teacher model.
As a result, LIVE keeps video quality steady across much longer rollouts in interactive settings like camera control and gameplay.

Main achievement:

LIVE turns long-horizon stability into a learnable, enforced behavior—achieving state-of-the-art long-run performance while removing the need for teacher distillation.

Future directions:

Scale LIVE to larger datasets and more varied controls, pair with memory or geometry priors, and explore adaptive curricula that tune difficulty on the fly.

Why remember this:

The core idea—“if you must get back, you won’t drift far”—is simple, powerful, and general. It’s a recipe for reliability in any sequence model that risks drifting over time.

Practical Applications

•Game cameras that remain stable and consistent across long play sessions, even with complex player actions.
•Virtual tours (real estate, museums) where camera paths stay smooth over hundreds of frames.
•Robot vision rollouts for navigation or manipulation, keeping scene understanding stable as tasks get longer.
•Drone and cinematography planning tools that preview long shots without visual drift.
•Autonomous driving simulators that sustain coherent scenes over extended driving scenarios.
•Sports strategy simulators that maintain visual and positional consistency across many plays.
•Education labs where students interact with virtual worlds that respond reliably over long sessions.
•Virtual production and previsualization that require steady long takes for scene blocking and timing.
•Security and monitoring simulations that forecast future views consistently from past footage and actions.

Version: 1