LoL: Longer than Longer, Scaling Video Generation to Hour

Justin Cui; Jie Wu; Ming Li; Tao Yang; Xiaojie Li; Rui Wang; Andrew Bai; Yuanhao Ban; Cho-Jui Hsieh

LoL: Longer than Longer, Scaling Video Generation to Hour

Intermediate

Justin Cui, Jie Wu, Ming Li et al.1/23/2026

arXiv PDF

Key Summary

•This paper fixes a big problem in long video-making AIs where the video keeps snapping back to the beginning, like a movie stuck on rewind.
•The problem comes from how the AI marks time using a clock-like system called RoPE that repeats its pattern after a while.
•When many attention heads look the same way at the same time, they all stare at the starting frames (the sink), causing a sudden reset called sink-collapse.
•The authors add tiny, head-by-head time shifts (multi-head RoPE jitter) so the heads stop moving in lockstep and don’t all fall back to the sink together.
•This fix is training-free, lightweight, and keeps video quality and motion strong while stopping the resets.
•With a causal VAE and streaming tricks, the system can keep generating video indefinitely, even up to 12 hours, in real time.
•Compared to other fixes (like PI, NTK, YARN, and RIFLEx), this method avoids repetition without freezing the motion.
•It runs on a 1.3B-parameter model with a rolling window of recent frames, making it practical and efficient.
•The best jitter strength they found is about 0.8 and it works best when applied to all attention heads.
•This approach opens the door to stable, endless streaming videos for storytelling, education, simulations, and more.

Why This Research Matters

Endless, stable AI video unlocks new kinds of live storytelling, education, and entertainment without constant human editing. Creators can stream scenes that evolve naturally for hours—think nature cams, sports-style drone flights, or explorable worlds—without jittery resets. Teachers can produce long, coherent visual lessons and science demos that carry on without looping or drifting. Simulations for training (aviation, medicine, robotics) can run continuously in real time, improving realism and practice time. Game studios and virtual world builders can prototype persistent environments that don’t break after a few minutes. Social platforms can host continuous AI-generated channels, reducing production cost while keeping viewers engaged.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Top Bread (Hook): Imagine watching a super long cartoon made by a computer—an hour or more—without it glitching, looping, or jumping back to the first scene. Wouldn’t that feel like magic?

🥬 Filling (The Actual Concept)

What it is: This paper studies how to make AI create very long videos continuously, without getting confused and repeating itself.
How it works: The authors look at why long videos made by autoregressive models (which build one frame after another) sometimes “snap back” to the beginning. They discover a timing problem inside the model’s attention system and propose a simple fix that doesn’t need retraining.
Why it matters: Without a fix, long videos glitch into loops, ruining stories, live streams, or lessons.

🍞 Bottom Bread (Anchor): Think of a live nature documentary generated on the fly. If the view keeps jumping back to the first cliff scene every minute, you can’t enjoy the journey.

Now, let’s set the stage—the world before this paper:

The World Before

You know how a flipbook shows motion when you flip the pages in order? Early video AIs were like very careful artists who could paint only short flipbooks very beautifully. Models like Sora, Hunyuan, Wan, and Veo made stunning short clips but were too expensive to stretch into long movies.
To go longer, researchers switched to autoregressive generation: the model makes the next frame by looking at the frames it just made. This saves compute and allows streaming, like making a flipbook one fresh page at a time.

🍞 Sandwich Concept: Generative Models 🍞 Hook: You know how a baker starts with dough and bakes a cake from scratch? 🥬 Concept: A generative model is a computer program that creates new things—like images or videos—from noise and rules it learned.

How it works: (1) Learn patterns from lots of examples. (2) Start with random noise. (3) Slowly shape that noise into a video that matches a prompt. (4) Repeat for each next frame.
Why it matters: Without generative models, we couldn’t make fresh, never-seen-before videos. 🍞 Anchor: Type “a red kite flying over a beach” and the model paints those moving scenes from scratch.

🍞 Sandwich Concept: Autoregression 🍞 Hook: Imagine writing a story one sentence at a time, always re-reading the last lines to decide what to write next. 🥬 Concept: Autoregression means creating the next frame using recent past frames.

How it works: (1) Generate a frame. (2) Store it as context. (3) Use that context to make the next frame. (4) Keep going.
Why it matters: Without it, long videos would be too slow or too costly, because you’d have to look at the entire future and past at once. 🍞 Anchor: A drone-flying video made in real-time where each new frame depends on the just-produced frames.

The Problem

As videos get longer, errors build up and the story can wobble. A known trick called an attention sink keeps a few very early frames in memory to stabilize the flow.

🍞 Sandwich Concept: Attention Sink 🍞 Hook: You know how a stage manager keeps a spotlight on a key actor so the audience always knows where to look? 🥬 Concept: An attention sink keeps a few first frames always available so the model can anchor its attention and stay on track.

How it works: (1) Save the first few frames. (2) Never throw them away. (3) Let the model peek at them anytime. (4) Use them as a steady reference point.
Why it matters: Without it, the model can drift and forget the scene’s style or subjects. 🍞 Anchor: A waterfall scene stays consistently “watery” because the first splashy frames remain visible to the model.

But something bad happens: sink-collapse. The video keeps snapping back to those saved frames, causing abrupt resets.

🍞 Sandwich Concept: Sink-Collapse 🍞 Hook: Ever watched a video that keeps jumping back to the same moment like a stuck YouTube buffer? 🥬 Concept: Sink-collapse is when many frames suddenly look like the first saved frames again, causing a jarring reset.

How it works: (1) The model often checks the early frames. (2) When timing signals line up in an unlucky way, (3) multiple attention heads all over-focus on those early frames, (4) and the current scene gets overwritten.
Why it matters: Without fixing it, long videos loop or stutter, ruining coherence. 🍞 Anchor: A wingsuit flight suddenly teleports back to the takeoff cliff at specific times.

Failed Attempts

Position Interpolation (PI) stretches timing signals but often freezes motion—videos feel stuck.
NTK and YARN tweak frequencies differently—each helps a bit but either doesn’t stop the resets or dampens motion too much.
RIFLEx adjusts a single frequency dimension, which helps bidirectional models, but here the collapse comes from many dimensions and many heads acting together.

The Gap

We needed a fix that: (a) stops synchronized attention heads from all staring at the sink at once, (b) keeps motion lively, and (c) requires no retraining.

Real Stakes

Long storytelling, sports, classes, livestreams, simulations, and games all need smooth, never-resetting motion. If the video keeps snapping back, audiences lose trust and creators can’t deliver.

🍞 Sandwich Concept: Rotary Position Embedding (RoPE) 🍞 Hook: Imagine a clock’s hands going around and around—after 12 hours, they look the same again. 🥬 Concept: RoPE is a way to mark positions (time/space) by rotating features like clock hands, so the model knows how far apart tokens are.

How it works: (1) Give each position a set of rotating “angles.” (2) Compare positions by how much their angles differ. (3) This encodes relative order. (4) But because it’s trigonometric, patterns repeat after a while.
Why it matters: Without position info, the model can’t tell frame 50 from frame 5. 🍞 Anchor: Two frames far apart might accidentally look “close” on the clock if their hands point almost the same way—leading to confusion later.

🍞 Sandwich Concept: Multi-Head Attention 🍞 Hook: Think of a soccer game watched by many cameras—each camera focuses on different players. 🥬 Concept: Multi-head attention gives the model many “views” (heads) to look at different parts of the sequence at once.

How it works: (1) Split features into heads. (2) Each head computes attention. (3) Combine the results. (4) Get a richer understanding.
Why it matters: If all heads look the same way, you lose diversity and can miss important details. 🍞 Anchor: If every camera locks onto the goalie at the same time, you miss the striker scoring.

02Core Idea

The Aha! Moment (One sentence) Tiny, per-head time shifts in RoPE (multi-head RoPE jitter) keep attention heads from moving in lockstep, so they don’t all snap back to the sink at the same time.

🍞 Sandwich Concept: Sink-Collapse (recap for clarity) 🍞 Hook: Like a song that keeps jumping back to the chorus at the wrong moments. 🥬 Concept: Sink-collapse is a reset caused by repeating timing patterns making many heads stare at the saved first frames.

How it works: RoPE’s periodic angles re-align after long distances; several heads then strongly weight sink frames together, forcing a scene reset.
Why it matters: It ruins long-form coherence. 🍞 Anchor: A flying scene suddenly shows the same cliff face again at frame 132 and 201, across different prompts.

🍞 Sandwich Concept: Multi-Head RoPE Jitter 🍞 Hook: You know how a marching band avoids echoing footsteps by having each row start a tiny bit off-beat? 🥬 Concept: Multi-head RoPE jitter means giving each attention head a slightly different RoPE base, so their internal clocks don’t line up perfectly.

How it works:
1. For each attention head, nudge the RoPE base (like 10,000 × (1 ± tiny amount)).
2. This slightly shifts the phase of that head’s timing angles.
3. Because each head is shifted differently, they’re unlikely to all re-align with the sink at once.
4. The diversity of focus returns, and resets are suppressed.
Why it matters: Without jitter, heads synchronize, stare at the sink together, and cause abrupt resets. 🍞 Anchor: If each drummer taps with a tiny timing offset, the stadium never hears a single booming echo; it hears rich, smooth rhythm.

Multiple Analogies

Orchestra analogy: Many violins tuning to the same pitch can create a harsh whine; slightly detuning each violin makes the sound fuller and avoids painful resonance.
Traffic analogy: If every car takes the same detour at once, you get a jam; spreading cars across small alternate routes keeps traffic flowing.
Classroom analogy: If all students copy the same first page of notes when confused, the lecture resets; encouraging students to check different resources prevents group backsliding.

Before vs After

Before: Long videos hit hidden “echo points.” Motion either stutters back to the start (sink-collapse) or, with some fixes, becomes too slow and stiff.
After: The video keeps flowing. Motion stays lively, and those echo points no longer trigger a mass reset.

Why It Works (intuition, no math)

RoPE is like many spinning hands on many clocks (one per feature dimension). Over a long time, some hands point the same way again—phase re-alignment.
If all attention heads share the exact same clocks, they re-align together and jointly choose the sink.
By giving each head slightly different clocks, their re-alignment moments rarely overlap, so there’s no all-at-once dogpile on the sink.

Building Blocks

Detect the cause: Collapses appear at local peaks of “phase concentration” around the sink (many RoPE components line up together).
Understand the mechanism: Inter-head attention homogenization—many heads simultaneously assign high weight to sink frames.
The fix: Head-wise RoPE base jitter breaks synchrony, restoring diversity.
Keep it streaming: Use a local attention window and dynamic RoPE/noise sampling so the process can continue forever.
Decode efficiently: A causal VAE lets us decode with a sliding window, so memory stays low while the video grows long.

🍞 Sandwich Concept: Causal VAE 🍞 Hook: Imagine telling a story where each new sentence depends only on what you’ve already said, not on future pages you haven’t written yet. 🥬 Concept: A causal VAE is a compressor–decompressor for videos that respects time order, so you can decode frames as you go.

How it works: (1) Compress frames into latent tokens. (2) Generate new latents causally. (3) Decode only the recent window you need. (4) Slide forward over time.
Why it matters: Without causality, you’d need the whole video in memory to decode, which breaks streaming. 🍞 Anchor: Like watching a live game—you don’t download the entire match first; you just see the latest seconds smoothly.

🍞 Sandwich Concept: Noise Sampling 🍞 Hook: Like adding a pinch of spice to keep a huge pot of soup tasting fresh all the way through dinner. 🥬 Concept: Noise sampling picks fresh random seeds over time so the model keeps generating varied, non-stale content.

How it works: (1) Start with noise. (2) Denoise to a frame. (3) For new frames, use new or rolled noise consistent with the window. (4) Avoid patterns that drift into loops.
Why it matters: Without it, long sequences can become predictable and flat, making resets or dullness more likely. 🍞 Anchor: A 12-hour underwater scene keeps revealing new jellyfish schools instead of repeating the same swirl.

03Methodology

At a high level: Text prompt + initial noise → (A) Streaming attention with sink frames and per-head RoPE jitter → (B) Generate latent frames autoregressively → (C) Sliding-window causal VAE decoding → Output video (infinite length possible).

Step-by-step recipe

Prepare the ingredients (inputs)

Prompt: e.g., “A cinematic third-person shot of a wingsuit flyer racing down a mountain valley.”
Initial noise: random latent tensors to start diffusion.
A small number of early frames kept as anchors (sink frames) while we roll through time with a local attention window (e.g., 12 latents total, first 3 are sinks).

🍞 Sandwich Concept: Memory of Recent Frames (KV-like cache, simplified) 🍞 Hook: Like keeping the last few photos on your phone screen so you can compare as you take the next shot. 🥬 Concept: The model holds a small rolling set of recent features, plus a few very first frames, to guide each new frame.

How it works: (1) Keep sinks + recent window. (2) Use them to compute attention for the next frame. (3) Slide forward, dropping the oldest (except sinks). (4) Repeat.
Why it matters: Without a rolling memory, the model can’t stay coherent or work in real time. 🍞 Anchor: A drone keeps looking at the last seconds of footage and the very first reference clip to stay steady.

Add the secret seasoning: Multi-head RoPE jitter

For each attention head h, slightly scale the RoPE base (θ → θ × (1 + σθ εh), with εh in [−1, 1]).
Apply these per-head frequencies to rotate queries/keys before attention, just like normal RoPE—but with tiny, unique offsets per head.
Use a jitter intensity σ around 0.8 for strong mitigation without noticeable motion loss.
Jitter all heads for best results; partial jitter helps but is weaker.

Compute attention with local window + sinks

Attention considers: (a) the just-generated frames (self-focus), (b) recent frames inside the window, (c) the sink frames.
With jitter, heads distribute focus more diversely; they do not all fixate on sinks simultaneously.

Generate the next latent frames

Use diffusion steps (e.g., 4-step distilled model) to denoise the next latent conditioned on the window.
Keep moving forward frame by frame or in small chunks (e.g., 3 latent frames per step).

Decode with a causal VAE using a sliding window

Only decode the most recent span needed for playback, not the entire history—this saves memory.
Because the VAE is causal in time, decoding frame t doesn’t require future frames.

Stream forever (infinite generation)

Keep the local attention window fixed size; roll it forward.
Dynamically sample noise and apply RoPE with head-wise jitter every step.
Because attention is local and decoding is sliding, memory stays bounded while time grows.

Why each step exists (what breaks without it)

Without sinks: more drift and style loss over minutes.
Without jitter: heads synchronize and trigger sink-collapse at certain indices (e.g., 132, 201).
Without local window: compute and memory explode over time.
Without causal VAE: decoding requires non-causal context, breaking real-time streaming.
Without fresh noise: motion can grow dull or fall into repetitive patterns.

Concrete mini-example (toy)

Window size = 12 latent frames; sinks = first 3 frames.
Normal RoPE: at frame 132, several heads’ phases re-align with sinks → attention spikes on sinks → scene teleports back.
With jitter: head A’s clock shifts slightly earlier, head B’s later, head C’s hardly at all. At frame 132, their peaks don’t coincide → combined attention stays balanced → no teleport.

The Secret Sauce

It’s just a tiny change to RoPE per head—no retraining, almost zero overhead.
It tackles the real cause: synchronized, periodic phase re-alignment across heads.
It preserves motion and quality because it doesn’t globally slow or stretch time; it only de-synchronizes the heads’ clocks.

🍞 Sandwich Concept: Streaming (Putting it all together) 🍞 Hook: Like a moving sidewalk that never ends—you just keep stepping forward while the scenery stays smooth. 🥬 Concept: Streaming generation makes each new part of the video from recent context while playing it out in real time.

How it works: (1) Keep a fixed window. (2) Add new frames at the end. (3) Drop the oldest (but keep sinks). (4) Decode and display as you go.
Why it matters: Without streaming, you can’t make super long videos live; you’d need to precompute everything. 🍞 Anchor: A 12-hour wingsuit flight plays continuously, never pausing to “load the rest.”

04Experiments & Results

The Test: What and why

They measured how often and how badly the video resets to the sink frames using a stricter version of the No-Repeat score (normalized L2 distance to sinks) and reported the worst drop (Sink-Collapse Max) and average drop (Sink-Collapse Avg).
They also checked motion liveliness (Dynamic Degree) and overall quality/alignment using VBench metrics.

The Competition: Who they compared against

Positional methods: PE (naive extrapolation), PI (interpolation), NTK-aware scaling, YARN, and RIFLEx (strong for bidirectional models).
State-of-the-art autoregressive models: LongLive and Self-Forcing++ (both use attention sinks and local windows), plus broader baselines like NOVA, Pyramid Flow, SkyReels-V2, MAGI-1, CausVid, and Self-Forcing.

The Scoreboard (with context)

In LongLive, naive PE collapsed hard: Sink-Collapse Max 73.06 and Avg 30.54—like flunking the “no-loop” test.
LoL (this paper’s method) cut those to Max 16.67 and Avg 3.93—like turning a failing grade into a solid B+/A- on the toughest part, while keeping motion lively (Dynamic Degree 35.27, about as active as the original).
In Self-Forcing++, PE again did poorly (Max 68.07, Avg 34.11). LoL reduced them to Max 22.70 and Avg 6.12—major improvement—while keeping motion strong (Dynamic Degree ~81.20 in their reporting; high is good for this model’s scale).
PI and YARN reduced resets but at a cost: motion often looked stiff or slowed (think smooth but boring). NTK kept more motion but didn’t stop resets enough. RIFLEx helped bidirectional models but didn’t fix this multi-head, multi-dimension collapse in autoregressive streaming.
Bottom line: LoL reaches collapse scores comparable to PI (good anti-repeat) but preserves motion like PE/NTK (good dynamics)—the best of both worlds.

Surprising Findings

Collapses appeared at the same indices (e.g., 132 and 201) across different prompts and even across different training paradigms (LongLive and Self-Forcing++). That points to a structural timing issue, not prompt content.
It wasn’t a single frequency dimension causing the trouble; adjusting only one (as in RIFLEx) didn’t fix it. The problem is collective: many RoPE components and many heads lining up together.
Changing the global RoPE base (e.g., 6000 → 20000) only moved where collapse happened; it didn’t cure it. Jitter prevented the mass alignment rather than merely postponing it.

Practical Settings They Found

Jitter intensity σ ≈ 0.8 worked best: strong mitigation with minimal motion/quality loss.
Jitter all heads for the strongest robustness; partial jitter helps but less so.
With jitter + causal VAE + local attention, they streamed videos up to 12 hours with little quality decay, in real time on a single H100-class GPU.

Real-time, Infinite Feel

Because attention is local and decoding is sliding-window, memory stays bounded; because jitter prevents collapses, there’s no hidden “stop line.” Together, the system can, in principle, run indefinitely.

05Discussion & Limitations

Limitations (honest assessment)

Long-term memory: The model doesn’t remember far-back details over hours. If a character leaves for a long time and returns, exact identity consistency can slip.
Base capacity: They use a distilled ~1.3B model; ultimate visual richness is capped by this backbone.
Rare local maxima: RoPE’s periodic nature means local phase peaks still exist; jitter suppresses collapse strongly but doesn’t mathematically eliminate all chances under all settings.
Controls: Complex camera or action controls can be improved; integrating stronger control signals is future work.

Required Resources

A 1.3B-parameter diffusion transformer with local attention and a causal 3D VAE, running at ~16–20 FPS on a single high-end GPU (e.g., H100) for streaming.
Implementation of per-head RoPE jitter (tiny code change), local windowing with sinks, and sliding-window decoding.

When NOT to Use

If you need guaranteed long-term subject identity over many hours without any drift, this alone isn’t enough—you need explicit memory modules or tracking.
If your pipeline relies on global attention across the entire history (not local), the compute cost may dominate regardless of jitter.
If your goal is ultra-precise timing replication (like scientific time-series reconstruction), per-head jitter may conflict with that requirement.

Open Questions

Can we design non-periodic or adaptive positional embeddings that avoid phase re-alignment entirely?
Could training-time strategies (e.g., masking sinks near phase peaks) further harden the model?
How do sparse/linear attention and external memory modules interact with jitter to give hour-long identity consistency?
What are the best control signals (text+motion+trajectory) to steer long scenes without drift?
How does this extend to multi-camera or multi-agent settings where coherence must hold across views?

06Conclusion & Future Work

3-Sentence Summary

Long video generators often snap back to their first frames because RoPE’s repeating timing and synchronized attention heads make them all stare at the sink together (sink-collapse).
A tiny, training-free fix—multi-head RoPE jitter—gives each head a slightly different timing base, preventing mass re-alignment and preserving smooth, lively motion.
Combined with local attention and a causal VAE for sliding-window decoding, this enables real-time, hours-long (even effectively infinite) video generation with little quality decay.

Main Achievement

Turning a fragile, collapse-prone streaming system into a robust, indefinitely long generator by breaking inter-head synchronization at the positional-embedding level—without retraining.

Future Directions

Explore non-periodic or adaptive positional embeddings, train-time defenses (e.g., sink masking near phase peaks), stronger control signals, larger base models, and hybrid memory modules for hour-scale identity consistency.

Why Remember This

It’s a simple idea with an outsized impact: tiny per-head time shifts fix a deep structural failure mode. This unlocks long, coherent, real-time AI video—bringing endless storytelling, education, and simulation within practical reach.

Practical Applications

•24/7 AI nature channels that evolve smoothly without looping back to the start.
•Long-form educational videos (e.g., history timelines, science labs) generated on the fly without resets.
•Real-time cinematic flythroughs (cities, mountains, oceans) for tourism and virtual tours.
•Live narrative streams where stories unfold for hours with consistent style and motion.
•Simulation training (drone piloting, driving, surgery) with continuous, coherent scenarios.
•Background ambience generators for events, retail, or wellness spaces that never visibly repeat.
•Pre-visualization for film and game scenes that run long to explore pacing and camera moves.
•Continuous promotional displays in stores or exhibitions that remain fresh over entire days.
•Prototype persistent open worlds for games with stable day-long sessions.
•Research testbeds for studying long-horizon planning and perception under endless video streams.

Version: 1