VideoSSM: Autoregressive Long Video Generation with Hybrid State-Space Memory

Yifei Yu; Xiaoshan Wu; Xinting Hu; Tao Hu; Yangtian Sun; Xiaoyang Lyu; Bo Wang; Lin Ma; Yuewen Ma; Zhongrui Wang; Xiaojuan Qi

VideoSSM: Autoregressive Long Video Generation with Hybrid State-Space Memory

Beginner

Yifei Yu, Xiaoshan Wu, Xinting Hu et al.12/4/2025

arXiv PDF

Key Summary

•VideoSSM is a new way to make long, stable, and lively videos by giving the model two kinds of memory: a short-term window and a long-term state-space memory.
•The short-term window keeps recent details sharp (like faces and small motions) without losing any information.
•The long-term state-space memory compresses everything that happened before, updates itself every step, and prevents the story from drifting or looping.
•A smart router blends these two memories, adding more global memory gradually as the video gets longer, so scenes stay coherent but don’t freeze.
•The model learns from a strong teacher on short clips, then practices on its own for long videos using a special correction loss (DMD) to fix mistakes.
•Compared to other autoregressive video models, VideoSSM scores higher in consistency and stays more dynamic over a full minute of generation.
•It supports interactive prompt changes mid-video by refreshing its local memory while keeping the global story consistent.
•The method runs in linear time with video length, so it scales to long videos without exploding compute.
•User studies prefer VideoSSM because it balances long-term stability with natural, non-repetitive motion.
•This hybrid memory idea creates a blueprint for reliable, real-time long video generation in many applications.

Why This Research Matters

Long videos need both detailed short-term memory and a steady long-term storyline; VideoSSM delivers both, so characters remain themselves while actions keep evolving. This enables live storytelling, tutorials, and documentaries that flow smoothly for minutes without awkward repeats or identity flips. Robots and simulators can use it to create stable, changing worlds for training over time. Creators can adjust prompts mid-video to direct scenes like a live director, and the system adapts without breaking continuity. Because it scales linearly with video length, it’s practical for real-time applications instead of just offline demos. People prefer its results, meaning it’s not just a technical win—it feels right to viewers.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Top Bread (Hook): You know how when you film a school play, it’s easy to record a few seconds perfectly, but keeping the whole 60-minute show steady, clear, and on-topic is much harder? The camera can wobble, the actors move, and things can start to look weird over time.

🥬 Filling (The Actual Concept):

What it is: Long video generation is teaching an AI to make a whole movie, one frame after another, without forgetting who’s who or what’s happening.
How it works (before this paper): Many video AIs made short clips using powerful transformers and diffusion, looking at all frames together; great for quality, but too slow and heavy for streaming and for very long videos.
Why it matters: Without the right design for long memory, the AI’s story drifts, characters morph, and scenes repeat like a stuck record.

🍞 Bottom Bread (Anchor): Imagine asking an AI to create a one-minute video of a red ball rolling across a park. After 10 seconds, the ball might change shape, the park might shift, or the ball might start looping the same motion unless the AI remembers the whole story correctly.

—

Following the concept order that makes learning easiest, let’s introduce each key idea using the Sandwich pattern.

State-Space Model (SSM) 🍞 Hook: Imagine a tiny journal you carry every day. You don’t write every detail, just a summary that you keep updating so you remember the important parts. 🥬 Concept:

What it is: An SSM is a way to keep a compact, constantly updated summary of what has happened so far.
How it works:
1. Start with an empty summary (the “state”).
2. Each new moment updates the state with useful, new information.
3. Old information gently fades, but not all at once.
4. When you need context, you read from the state instead of the whole history.
Why it matters: Without an SSM, remembering everything gets too slow or too big; with it, you keep only what you really need. 🍞 Anchor: In a long soccer video, the SSM remembers which team is pushing forward and where the ball tends to go, without storing every pixel from every past frame.

Key-Value (KV) Caching 🍞 Hook: Think of sticky notes on your desk that save you from flipping through the whole textbook every time. 🥬 Concept:

What it is: KV caching stores recent facts (keys and values) so the model can quickly look them up.
How it works:
1. When a frame is processed, it writes down helpful notes (K and V).
2. New frames can read these notes fast instead of recomputing.
3. To keep memory small, old notes get thrown out in a sliding window.
Why it matters: Without caching, the model would be too slow for streaming. 🍞 Anchor: While generating second 12, the model reuses notes from seconds 10–12 to keep a character’s face consistent.

Autoregressive Diffusion 🍞 Hook: Imagine writing a story one sentence at a time, where each new sentence depends on the ones you already wrote. 🥬 Concept:

What it is: Autoregressive diffusion makes the next frames step by step, always using what it just created as context.
How it works:
1. Add noise to make predicting harder (training time).
2. Learn to remove noise and guess the clean frame.
3. At generation, create a frame, feed it back in, then make the next one.
Why it matters: Without going step by step, you can’t do real-time or interactive video. 🍞 Anchor: A dog runs; the model uses the last frame’s dog pose to predict the next smooth step.

Causal Attention 🍞 Hook: When telling a story, you can only use things that have already happened, not the future. 🥬 Concept:

What it is: Causal attention lets each new frame look only at past frames.
How it works:
1. Mask off the future so the model can’t peek ahead.
2. Attend to past tokens for context.
3. Use a sliding window to keep it efficient.
Why it matters: Without causality, you can’t stream; with only a small window, you may forget long-ago facts. 🍞 Anchor: At frame 30, the model can read frames 20–29, not frame 31.

Local Memory (Sliding Window) 🍞 Hook: Think of the last few steps you just took; you remember them clearly. 🥬 Concept:

What it is: Local memory keeps the most recent frames with perfect detail.
How it works:
1. Store K and V for the last L frames.
2. Attend within this window for crisp motion and appearance.
3. Drop older frames from the window when new ones arrive.
Why it matters: Without a perfect short-term memory, tiny details (like blinking) get messy. 🍞 Anchor: The sparkle on water stays sharp because recent frames are right in the window.

Global Memory 🍞 Hook: A diary doesn’t list every second, but it keeps the storyline straight. 🥬 Concept:

What it is: Global memory summarizes everything older than the local window.
How it works:
1. Take tokens that fall out of the window.
2. Compress them into a compact state (using SSM).
3. Update that state each step and retrieve it when needed.
Why it matters: Without it, the story drifts or repeats because the model forgets earlier events. 🍞 Anchor: The AI remembers the hero still wears a red scarf even after 40 seconds.

Attention Sink (previous fixes) 🍞 Hook: Pinning a reference photo on the wall helps you stay on style—but if you only look at that one photo, you might keep drawing the same pose. 🥬 Concept:

What it is: Attention sink keeps some very early tokens always visible as anchors.
How it works:
1. Save the first frames as fixed reference tokens.
2. Always attend to them plus the local window.
3. Use them to stabilize identity and background.
Why it matters: It reduces drift, but can freeze motion and cause repetition. 🍞 Anchor: A boy looks the same, but he keeps doing the same tiny steps over and over.

The world before this paper: Great short clips, but long videos suffered from three big problems—error build-up, motion drifting, and content repetition. Failed attempts included relying only on small windows (forgetting the past) or on static sinks (over-stabilizing and looping). The missing piece was a dynamic, always-updating long-term memory. That’s the gap VideoSSM fills.

Real stakes: This matters for digital storytelling (consistent characters), live sports or news highlighting (continuous action), robotics and simulation (stable worlds over time), education (long lessons with coherent visuals), and interactive tools where you can change prompts mid-video without breaking the scene.

02Core Idea

🍞 Top Bread (Hook): Imagine making a class movie. One friend remembers the last few lines perfectly (short-term), and another keeps a running summary of the whole plot (long-term). If they work together, the story stays sharp and consistent the whole time.

🥬 Filling (The Actual Concept):

What it is: The key idea is hybrid memory—combine a local, lossless sliding window with a global, compressed state-space memory that updates every step.
How it works (recipe):
1. Local memory keeps recent frames crystal clear for small details and motion.
2. As older frames leave the window, a global SSM compresses them into a compact state with learnable gates: inject new info, gently decay old info.
3. When generating the next frame, the model reads both memories.
4. A position-aware router gradually mixes in more global memory as the video goes on.
Why it matters: Without hybrid memory, you either forget the past (drift) or over-repeat the early scene (freeze). Hybrid memory keeps the story coherent and lively. 🍞 Bottom Bread (Anchor): In a 60-second beach video, waves keep flowing naturally (not looping), the sky color stays consistent, and the surfer remains the same person throughout.

Aha! moment in one sentence: Treat long video generation like a living process with two cooperating memories—short-term for fine details and long-term for the storyline—updated at every step.

Three analogies:

Librarian and Notebook: The librarian (local memory) keeps the last few opened books on the desk; the notebook (global memory) writes a summary of all earlier books so you don’t forget the plot.
Backpack and Journal: The backpack carries today’s essentials (local), while the journal tracks your whole trip (global) so you don’t walk in circles.
Band and Conductor: The band plays the current bar (local), but the conductor remembers the symphony’s theme (global) so the music doesn’t wander.

Before vs After:

Before: Sliding windows alone forgot far-back facts; attention sinks remembered too hard and caused repetition.
After: A dynamic summary (SSM) remembers just enough of everything, while the window keeps details sharp; the router balances them over time, avoiding both drift and freeze.

Why it works (intuition, no equations):

Gated Updates: An injection gate decides how much new information to add; a decay gate gently fades old info so the memory doesn’t overflow.
Novelty-Only Update: The model tries to predict what’s already known; it stores only the truly new parts, keeping the memory compact and fresh.
Output Gating: When reading global memory, another gate controls how strongly it influences the current decision, preventing overpowering the crisp local details.
Position-Aware Fusion: Early on, the model relies more on local detail; as time passes, it trusts global memory more, because there’s more past to remember.

Building Blocks (each introduced with Sandwich):

Hybrid State-Space Memory 🍞 Hook: You know how a good teacher remembers both today’s lesson and the big picture for the semester? 🥬 Concept:

What it is: One module that holds both a short-term window and a long-term SSM state, used together.
How it works: Keep last-L frames losslessly; compress older frames into a state; retrieve both; fuse with a router.
Why it matters: Keeps videos coherent for minutes without repeating. 🍞 Anchor: A cartoon hero keeps the same outfit and personality for a whole episode while still acting in new, exciting ways.

Gated Delta (Novelty) Update 🍞 Hook: If you already know 2+2=4, hearing it again doesn’t help; only new facts should update your memory. 🥬 Concept:

What it is: Update memory using only the part that wasn’t predictable from the past, and decay the rest gently.
How it works: Estimate what’s expected; subtract it; store the surprise; apply slow forgetting.
Why it matters: Prevents memory from bloating or getting stuck. 🍞 Anchor: When a new character appears, the memory grows; when the same background continues, it doesn’t over-update.

Memory Retrieval with Output Gating 🍞 Hook: Sometimes you need a quick reminder; other times you need the full story. 🥬 Concept:

What it is: A way to read the global state and control how much of it to use right now.
How it works: Align the current question with the memory; normalize and gate the result.
Why it matters: Avoids drowning out sharp local details with too much history. 🍞 Anchor: While a dancer spins, global memory keeps style consistent, but local memory makes the spin crisp.

Position-Aware Router (Fusion) 🍞 Hook: At the start of a movie, you only need a little backstory; near the end, the whole plot matters. 🥬 Concept:

What it is: A learned gate that increases the influence of global memory as more frames accumulate.
How it works: Compute a time-based weight; mix local and global accordingly.
Why it matters: Prevents early overpowering and late forgetting. 🍞 Anchor: In a cooking video, early steps focus on chopping details; later, the router brings back the recipe’s big plan so the dish finishes right.

03Methodology

High-level pipeline: Prompt → Latent video tokens → AR DiT with hybrid memory → Frames out (streaming)

Step-by-step recipe (each key step includes what/why/example):

Input and Tokenization 🍞 Hook: Imagine turning a big picture into puzzle pieces so it’s easier to work with. 🥬 Concept:

What it is: Convert frames into compact latent tokens the model can process.
How it works: An encoder turns images into tokens; text prompts become condition tokens; everything flows into the AR DiT.
Why it matters: Working in latents is faster and lighter than pixels. 🍞 Anchor: A 832×480 frame becomes a set of tokens the transformer can handle efficiently.

Local Memory: Sliding-Window Self-Attention 🍞 Hook: You remember the last few words you heard most clearly. 🥬 Concept:

What it is: Keep K/V for the most recent L frames (plus optional sink, if used) and attend within that window.
How it works: For the current frame t, compute Q/K/V; append K/V to cache; attend over [t−L+1…t].
Why it matters: Preserves fine details and smooth motion between neighboring frames. 🍞 Anchor: With L=3, frame 7 attends to frames 5–7 to keep a character’s hand motion smooth.

Gate Caching for Global Memory 🍞 Hook: When writing a journal, you decide how much to add and how fast old notes fade. 🥬 Concept:

What it is: Learn two gates per time step—an injection gate (how much new info to add) and a decay gate (how much to forget).
How it works: From the hidden state just before a token leaves the window, compute β (inject) and α (decay); store them in a gates cache.
Why it matters: Without gates, the global memory would either overflow or forget too fast. 🍞 Anchor: In a marathon video, β is higher when a new runner appears; α controls how the memory slowly forgets earlier scenery.

Global Memory State Update (SSM + Gated Delta) 🍞 Hook: Only write what’s new, and lightly fade the old—like tidy note-taking. 🥬 Concept:

What it is: Compress evicted tokens into a fixed-size state using novelty-only updates and controlled decay.
How it works:
1. Average the evicted tokens and their gates.
2. Predict what should be expected from the old state.
3. Subtract to get the surprise part and add that to memory.
4. Apply a gentle decay using the accumulated decay signal.
Why it matters: Keeps the memory small, stable, and up-to-date over minutes. 🍞 Anchor: As a parade passes, the memory absorbs new floats without re-storing the same marching band details.

Memory Retrieval with Output Gating 🍞 Hook: When answering a quiz, sometimes you peek at your notes, but not always the whole notebook. 🥬 Concept:

What it is: Read from global memory and control how much to use now.
How it works: Align the current query with the memory; normalize; multiply by a learned gate; pass forward.
Why it matters: Prevents global info from washing out crisp local motion. 🍞 Anchor: In a dance scene, retrieval preserves the dancer’s style (global) while local keeps footwork sharp.

Position-Aware Fusion (Router) 🍞 Hook: Early in a journey, you don’t need much map history; later, you rely on it. 🥬 Concept:

What it is: A time-aware gate that mixes local and global outputs.
How it works: Compute a gate from the relative position in the context; add local and gated-global to get the fused hidden state.
Why it matters: Keeps early frames agile and later frames consistent. 🍞 Anchor: In a nature documentary, the first seconds focus on the bird’s features; later, the router keeps its species traits consistent across scenes.

Training, Stage 1: Causal Model Distillation (Self-Forcing style) 🍞 Hook: A coach (teacher) shows you perfect short plays; you practice them until they’re second nature. 🥬 Concept:

What it is: Learn from a high-quality bidirectional teacher on short clips but in a causal way.
How it works:
1. Use the teacher’s short-clip trajectories as targets.
2. Train the student to predict them frame by frame.
3. Let gradients flow through the hybrid memory so it learns to use both memories.
Why it matters: Gives the student strong short-term skills before tackling very long sequences. 🍞 Anchor: The model nails 5-second clips with great fidelity, setting a solid foundation.

Training, Stage 2: Long Video Distillation with DMD Loss 🍞 Hook: After you can do drills, you scrimmage a full game and get feedback on tricky moments. 🥬 Concept:

What it is: Practice long, self-generated rollouts, then apply a special correction (DMD) on random short windows.
How it works:
1. Generate long videos autoregressively to fill both local and global memories.
2. Randomly pick a short window and compare the student’s distribution to the teacher’s using DMD.
3. Update the model so it can recover from errors that appear in long runs.
Why it matters: Fixes drift and artifacts that only show up over long horizons. 🍞 Anchor: In a 60-second city scene, the model learns to keep cars consistent and streets realistic even after many frames.

Interactive Prompt Switching with Local Recache 🍞 Hook: Mid-story, a director can shout, “New scene!” and the crew adapts without forgetting who the hero is. 🥬 Concept:

What it is: When the user changes the prompt, refresh the local window while preserving the global storyline.
How it works: Rebuild local KV cache under the new prompt; keep the global state to maintain identity and setting.
Why it matters: Avoids leftover semantics from old prompts and enables smooth, responsive changes. 🍞 Anchor: A boy walking on grass starts to run when you change the prompt; he remains the same boy, the lawn stays the same place, motion stays smooth.

Concrete mini-example (L=3): Frames 1–3 fill the window. At frame 4, frame 1 is evicted; its info is compressed into global memory. At frame 7, frames 1–4 have been compacted globally. The model predicts frame 7 using frames 5–7 (local) plus the learned global summary (1–4). If you switch the prompt at frame 30, the local window refreshes for the new instruction, while the global state keeps the scene’s identity consistent.

Secret sauce:

Dynamic, gated SSM updates store only the new bits and forget smoothly.
Local–global fusion changes over time so the model stays both lively and stable.
Train short with a strong teacher, then practice long with targeted DMD corrections—just like drills plus scrimmages.

04Experiments & Results

The test: The team evaluated both short 5-second clips and long 60-second videos on VBench, measuring subject and background consistency, motion smoothness, aesthetics, flicker, and a special “Dynamic Degree” (how lively and non-repetitive the motion is). They also ran a user study to check real human preferences for long videos.

The competition: VideoSSM was compared to strong autoregressive baselines like Self Forcing, CausVid, LongLive, and SkyReels-V2, as well as other capable systems. Some rivals use bigger models or attention sinks to stabilize long-range behavior.

The scoreboard with context:

Short videos: VideoSSM reached a Total score around 83.95 and Quality around 84.88 among AR models—like scoring an A when many others are getting solid Bs. This shows the new memory doesn’t hurt short-term fidelity and actually helps.
Long 60-second videos: VideoSSM achieved the top Subject and Background Consistency among AR methods, which means characters and scenes stayed recognizable and stable over time (think: faces don’t morph; rooms don’t teleport). Crucially, Dynamic Degree was about 50.50—much higher than certain sink-based baselines—so the motion didn’t freeze or loop. This is like keeping a steady storyline but still letting exciting things happen.
Motion smoothness and low flicker were on par with or better than strong baselines, showing the hybrid memory doesn’t introduce jitter.
Aesthetics stayed high, so stability didn’t come at the cost of looking good.

Surprising findings:

Attention sinks do reduce drift, but they can over-stabilize, leading to repeated or nearly static motion. VideoSSM’s dynamic global memory avoided that trap, producing minute-long videos that remained both coherent and genuinely evolving.
In challenging scenes (like underwater swimming or busy food shots), VideoSSM kept identity stable without hallucinating duplicates or collapsing into stillness.
In a user study with 40 participants across 32 minute-long videos, VideoSSM received the highest share of first-place votes and the best average ranking, suggesting people notice and prefer the balance of stability and liveliness.

Takeaway: The hybrid memory design doesn’t just maintain consistency; it preserves the feeling that time is moving forward—essential for real storytelling and interactive use.

05Discussion & Limitations

Limitations:

Very long horizons (e.g., many minutes or hours) still demand careful memory tuning; even gated forgetting can accumulate small biases over time.
The system is distilled from a strong teacher; if the teacher is weak in certain styles or domains, the student inherits those blind spots.
Abrupt, extreme prompt changes can still cause brief hiccups; interactive recache helps, but perfect transitions remain challenging.
Highly precise 3D camera control or exact geometry is not explicitly modeled; this approach targets perceptual realism rather than full 3D reconstruction.

Required resources:

A ~1.4B-parameter AR DiT with hybrid memory; GPU memory for the sliding window, gates cache, and SSM state.
Training involves teacher trajectories (short clips) and long self-rollouts with DMD correction; this needs significant compute and data streaming.
For real-time generation, efficient KV cache handling and memory retrieval must be well-engineered.

When not to use:

If you can afford full, offline, bidirectional generation for a short clip (and don’t need streaming), a traditional DiT may be simpler.
If your content intentionally loops (e.g., GIF-like patterns), attention sinks or simple windows might be enough.
For tasks needing exact 3D accuracy (SLAM-level precision), a world-model with explicit geometry could be a better fit.

Open questions:

Can the model learn its global memory structure without a teacher, purely from long videos?
How to adapt the router to different content types (dialog scenes vs. action scenes) automatically?
Can multi-modal signals (audio beats, script timing) guide memory gating to improve pacing and scene transitions?
How to detect and correct slow-burn errors (tiny drifts) during ultra-long runs without human intervention?

06Conclusion & Future Work

Three-sentence summary: VideoSSM generates long, stable, and lively videos by combining a local, lossless window with a global, gated state-space memory that updates at every step. A position-aware router blends these memories so the model neither forgets the past nor gets stuck repeating it. Distillation on short clips plus long self-rollouts with DMD corrections makes the system robust for minute-scale, interactive generation.

Main achievement: Showing that a dynamic, compressed global memory—paired with a crisp local window—solves the drift-vs-freeze dilemma in autoregressive video, enabling consistent yet non-repetitive long-form generation with linear-time scaling.

Future directions: Add camera-aware or geometric priors when needed; fuse audio and other modalities to guide memory; explore teacher-free or self-supervised long-memory learning; and extend the method to long-form video editing and controllable story arcs.

Why remember this: Hybrid memory reframes long video generation as a living process of remembering just enough—keeping your short-term details sharp and your long-term story straight—so videos can keep going, stay consistent, and still surprise you.

Practical Applications

•Live, interactive story generation where viewers can change the plot mid-scene without breaking continuity.
•Educational videos that keep consistent diagrams, characters, and settings over entire lessons.
•Sports highlight synthesis that preserves team identities and game flow over long segments.
•Robotics simulation that maintains a coherent world for extended training episodes.
•Marketing and product demos that evolve naturally while keeping brand elements stable.
•Long-form nature or travel videos that avoid looping and keep motion realistic.
•Game trailer or cutscene creation where character identity and scene layout must persist.
•News explainer videos with smooth transitions between topics while retaining visual consistency.
•Prototype long-form video editors that can insert, extend, or modify scenes without breaking continuity.

Version: 1