MemFlow: Flowing Adaptive Memory for Consistent and Efficient Long Video Narratives

Sihui Ji; Xi Chen; Shuai Yang; Xin Tao; Pengfei Wan; Hengshuang Zhao

MemFlow: Flowing Adaptive Memory for Consistent and Efficient Long Video Narratives

Intermediate

Sihui Ji, Xi Chen, Shuai Yang et al.12/16/2025

arXiv PDF

Key Summary

•MemFlow is a new way for AI to remember the right parts of a long video story while it keeps making new parts, so characters and scenes stay consistent.
•It uses Narrative Adaptive Memory (NAM) to look back and pull only the most relevant past frames based on the current text prompt.
•It also uses Sparse Memory Activation (SMA) to focus attention on just the most important memory pieces, so it stays fast.
•MemFlow plugs into existing streaming text-to-video systems that already use a KV cache, so it’s easy to adopt.
•Compared to strong baselines, it keeps better long-term consistency and prompt-following, even when new characters or scenes appear.
•It reaches 18.7 FPS on a single NVIDIA H100 with only a 7.9% slowdown versus no-memory models, showing strong efficiency.
•On 60-second multi-prompt tests, MemFlow leads on overall quality and maintains high consistency without drifting or duplicating subjects.
•Ablations show both NAM and SMA are needed: NAM boosts coherence while SMA keeps speed high with minimal quality loss.
•Too much memory can hurt: keeping a balanced memory size works best for smooth narratives.
•MemFlow proves that adaptive, prompt-aware memory beats fixed rules for long video storytelling.

Why This Research Matters

MemFlow helps AI make long videos that feel like real stories, not stitched-together fragments. This means creators can guide scenes with new prompts—adding characters, changing locations—and the video still remembers who is who and what just happened. With near real-time speed, it’s practical for live editing, previews, and interactive storytelling. It reduces common annoyances like duplicated people, color drift, and off-topic scenes after prompt switches. The approach can also inspire memory designs in other long-sequence tasks like podcasts, documentaries, or gameplay generation. Ultimately, it makes powerful video tools more reliable and usable for artists, educators, and everyday users.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

You know how when you tell a long story, you have to remember who the characters are, what they look like, and where they are so you don’t mix things up? Making long videos with AI has the same challenge: the AI has to keep track of details over time.

🍞 Top Bread (Hook): Imagine building a Lego movie one small scene at a time. If you forget what the hero looked like in the last scene, the next scene might be wrong. 🥬 The Concept: autoregressive video generation

What it is: It’s a way to make videos piece by piece, where each new chunk depends on what was made before.
How it works:
1. Generate a short clip (a small set of frames).
2. Use that clip as context to make the next clip.
3. Repeat until the whole video is done.
Why it matters: Without it, creating long videos would be too heavy for computers to handle all at once. 🍞 Bottom Bread (Anchor): Like writing a comic strip panel by panel, using the last panel to decide what happens next.

🍞 Top Bread (Hook): You know how your eyes and brain focus on the important parts of a picture first, like a face in a crowd? 🥬 The Concept: attention mechanisms

What it is: A tool that lets AI focus more on the most relevant parts of the input.
How it works:
1. Look at all tokens (pieces of information).
2. Score how related each token is to the current goal.
3. Give more weight to high-scoring tokens.
4. Use the weighted mix to make the next prediction.
Why it matters: Without attention, the AI treats everything as equally important and gets confused. 🍞 Bottom Bread (Anchor): When asked “Where is the puppy?”, attention helps the AI focus on tokens about “puppy,” not the background sky.

🍞 Top Bread (Hook): If you try to carry all your school books at once, you get slow and tired. 🥬 The Concept: memory efficiency

What it is: Using just enough memory in smart ways so the AI stays fast and accurate.
How it works:
1. Store only helpful past information.
2. Compress or select what to keep.
3. Reuse it quickly during generation.
Why it matters: Without efficiency, long videos would be too slow or even crash the GPU. 🍞 Bottom Bread (Anchor): Like packing only the essentials in a backpack so you can walk quickly.

Before MemFlow, people tried fixed rules to remember the past. Some kept only the first chunk forever. Others compressed old frames the same way every time. Others hid memory inside small learned modules. These helped a bit but broke down when the story changed—like when a new character arrived or the scene switched from beach to forest. The AI didn’t know which past moments still mattered.

🍞 Top Bread (Hook): When you search a big book, you don’t read every page—you jump to the sections that match your question. 🥬 The Concept: semantic retrieval

What it is: Finding past information that best matches the meaning of the current request.
How it works:
1. Turn the current text prompt into tokens (questions).
2. Compare these tokens to tokens stored from past frames.
3. Score which past frames are most semantically related.
4. Pick the top ones to use now.
Why it matters: Without it, the AI brings in random history and gets off-track. 🍞 Bottom Bread (Anchor): If the prompt says “the woman in a casual sweater,” the system grabs the frames where that exact woman appears.

🍞 Top Bread (Hook): A chef doesn’t grab every ingredient in the pantry—just the ones needed for the dish. 🥬 The Concept: visual token retrieval

What it is: Selecting only the visual features from past frames that help make the next frames.
How it works:
1. Split past frames into visual tokens.
2. Score how helpful each token is for the new prompt.
3. Keep the best ones and discard the rest.
Why it matters: Without this, the model wastes time on unhelpful visuals and may copy mistakes. 🍞 Bottom Bread (Anchor): For “a red ball rolls,” it focuses on red-ball tokens, not tree-leaf tokens.

🍞 Top Bread (Hook): Think of a quick-reference notebook where you keep the answers you’ll likely need again. 🥬 The Concept: Key-Value (KV) cache

What it is: A fast storage of attention keys and values from earlier steps so the model can reuse them.
How it works:
1. When generating, each layer creates keys and values.
2. Save them for future attention lookups.
3. When a new chunk needs context, read from this cache.
Why it matters: Without a KV cache, the model would recompute everything and slow down a lot. 🍞 Bottom Bread (Anchor): Like sticky notes on your desk you can glance at instead of rereading a whole textbook.

The gap this paper fills is clear: we need a memory that can adapt to the current prompt, pick the right past moments, and stay efficient. Why care? Because real videos are long and interactive. Users might say “Now the dog runs to the boy” and then “Switch to a snowy forest.” If the AI forgets who the boy is, or duplicates the dog, the story breaks. Better memory means smoother movies, clearer instructions-following, and faster tools for creators.

🍞 Top Bread (Hook): You can remember the whole plot of a movie because your brain summarizes what matters across time. 🥬 The Concept: long-context capabilities

What it is: The ability of a model to use distant past information when creating the present.
How it works:
1. Store representative cues from earlier scenes.
2. Retrieve the right ones when needed.
3. Mix short-term (recent) and long-term (older) info.
Why it matters: Without long context, characters drift and scenes feel random. 🍞 Bottom Bread (Anchor): The same child and the same dog look consistent from minute 1 to minute 5.

🍞 Top Bread (Hook): DJs adjust the music live based on the crowd’s vibe. 🥬 The Concept: streaming long-tuning strategy

What it is: Training the model in rolling clips so it learns to manage memory while generating long videos.
How it works:
1. Generate a short clip.
2. Use a strong teacher model to guide corrections on that clip.
3. Roll forward, repeating this over many clips.
Why it matters: Without training in a streaming way, the model won’t learn how to keep consistency over long runs. 🍞 Bottom Bread (Anchor): Like practicing a long song by mastering each section in order, not just random pieces.

🍞 Top Bread (Hook): If you keep adding more furniture to a room, it gets crowded and hard to move. 🥬 The Concept: computational overhead

What it is: The extra compute time and memory a method adds.
How it works:
1. More context = more attention cost.
2. Smarter selection = less cost for similar quality.
3. Balance is key for real-time speed.
Why it matters: Without controlling overhead, interactive video tools can’t run fast enough. 🍞 Bottom Bread (Anchor): MemFlow keeps 18.7 FPS and only slows ~7.9% versus no-memory, so it stays responsive.

In short, the world before MemFlow had solid short clips but struggled with long, changing stories. MemFlow’s adaptive memory changes that by choosing what to remember based on the current prompt and by focusing compute only where it counts.

02Core Idea

The “Aha!” in one sentence: Use the current text prompt to fetch only the most relevant past frames into memory, then sparsely activate just the most helpful tokens so the model stays consistent and fast.

Three analogies to see it clearly:

Detective file system: You ask a question about the case (the prompt), pull the right folders (relevant frames), and then skim only the highlighted lines (sparse tokens) to make your next move.
Movie continuity editor: Before filming the next scene, you check past shots of the same character and location, then look only at the matching wardrobe and set notes, not the entire production history.
Smart backpack: You pack for the next hike (prompt) by taking only the gear you’ll actually need, and you put the heaviest items on top for easy access.

🍞 Top Bread (Hook): When you write a sequel chapter, you reread the parts of the last chapter that match your new plot twist. 🥬 The Concept: Narrative Adaptive Memory (NAM)

What it is: A memory that adapts to the current prompt by retrieving the most semantically relevant past frames and updating with a compact summary of the latest chunk.
How it works:
1. Turn the current prompt into textual tokens.
2. Compare these tokens to visual tokens stored in a memory bank built from prior chunks’ KV caches.
3. Score and pick the top-matching frames (semantic retrieval).
4. Add a prototype from the just-finished chunk (its first frame’s KV) to keep the newest context.
Why it matters: Without NAM, the model may pull the wrong history, causing character duplication or scene mismatch. 🍞 Bottom Bread (Anchor): If the prompt says “the child hugs the dog,” NAM fetches frames where that exact child and dog appeared before and adds the newest chunk’s summary.

🍞 Top Bread (Hook): When solving a big puzzle, you don’t consider every piece at once—you try a few that fit the shape. 🥬 The Concept: Sparse Memory Activation (SMA)

What it is: A way to speed up by letting each query attend only to the top-k most relevant memory frames.
How it works:
1. Build tiny summaries (mean-pooled descriptors) for the current chunk and each stored frame.
2. Compute relevance scores by inner product between the current summary and each frame summary.
3. Pick the top-k frames.
4. Run attention only on those selected frames’ keys and values.
Why it matters: Without SMA, attention over all memory slows things down and may include noisy, off-topic history. 🍞 Bottom Bread (Anchor): Like highlighting just a few key lines in a textbook before answering a question.

Before vs. After:

Before: Fixed memory rules, like “keep the first chunk,” fail when stories change. Compression without selection keeps the wrong stuff. Efficiency hacks often cut important details.
After: NAM brings in only what matches the current prompt; SMA keeps compute tight. Result: better consistency, better prompt-following, and near real-time speed.

Why it works (the intuition):

Prompt-anchored retrieval aligns memory with the user’s current intention, so the right characters and scenes are reused.
A prototype for the latest chunk keeps short-term continuity.
Sparse activation prunes irrelevant or error-prone memory, limiting drift and reducing cost.
Together, they balance short-term and long-term cues without overwhelming attention.

Building blocks (simple pieces that click together):

KV cache: a fast store of past attention keys/values.
Prompt tokens: the “question” guiding what to fetch.
Semantic retrieval: scores that find matching past frames.
Prototype update: a compact summary from the latest chunk.
Sparse selection: top-k gating to make attention light and focused.
AR-diffusion backbone: the engine that turns memory + prompt into the next video chunk.

🍞 Top Bread (Hook): You need a big memory to recall early scenes and a short memory to remember what just happened. 🥬 The Concept: content consistency

What it is: Keeping subjects, backgrounds, and actions steady and believable across long videos.
How it works:
1. Reuse the right past visuals.
2. Align them with the current prompt.
3. Avoid mixing in unrelated or erroneous history.
Why it matters: Without consistency, viewers see duplicates, color drifts, or sudden, odd changes. 🍞 Bottom Bread (Anchor): The same woman in the sweater stays the same person after multiple prompt switches.

🍞 Top Bread (Hook): Artists smooth paint strokes to make a scene look natural. 🥬 The Concept: diffusion models

What it is: A way to generate images/video by gradually refining noisy data into clean frames.
How it works:
1. Start from noise.
2. Use a trained model to denoise toward the target.
3. Repeat for a few steps (distilled) or many steps (classic) to get a sharp video.
Why it matters: Without diffusion, outputs often look less natural or stable. 🍞 Bottom Bread (Anchor): The video becomes smooth and detailed as noise is removed.

🍞 Top Bread (Hook): Choose-your-own-adventure stories change based on your choices. 🥬 The Concept: interactive video generation

What it is: Users give new prompts at any time, and the model updates the story in real-time.
How it works:
1. Read the new prompt.
2. Retrieve matching history.
3. Generate the next chunk that follows both the prompt and the past.
Why it matters: Without interactivity, creators can’t steer or edit stories on the fly. 🍞 Bottom Bread (Anchor): “Now add a deer” leads to a scene where the same man meets the deer without losing who he is.

03Methodology

At a high level: Text prompts + past video → [Narrative Adaptive Memory: retrieve + update] → [Sparse Memory Activation: select] → AR-diffusion generates next chunk → update memory → repeat.

Inputs and Outputs:

Inputs: current text prompt, KV cache memory bank from earlier chunks, and a local window of recent frames.
Output: the next video chunk (T frames) that fits the prompt and stays consistent with the story.

Step 0: The backbone and KV cache

What happens: The model is an autoregressive diffusion transformer (AR-diffusion). As each chunk is generated, every transformer layer creates keys and values that get saved into a memory bank (KV cache) by frame.
Why it exists: The KV cache saves compute and stores visual context the model can reuse.
Example: After generating clip N showing “a boy and a red ball on the beach,” those keys/values are stored, tagged by frames.

Step 1: Narrative Adaptive Memory (NAM) – Retrieval

What happens:
1. Turn the current prompt (for clip N+1) into textual tokens.
2. For each stored frame in the memory bank, compute how much the text tokens “attend to” the frame’s visual tokens—this is a semantic relevance score.
3. Keep only the top-scoring historical frames (top-k by layer) as the most relevant context.
Why it exists: Different prompts need different past cues; fixed memory rules can’t adapt.
Example: Prompt says “the child hugs the dog.” The system retrieves frames where that same child and dog appear together, not random beach shots without them.

Step 2: NAM – Update with a prototype of the latest chunk

What happens:
1. Summarize the immediately previous chunk by taking the KV of its first frame as a compact prototype.
2. Concatenate this prototype with the retrieved historical frames to form the updated memory bank for the next generation step.
Why it exists: Keeps the freshest context while preventing memory from exploding in size.
Example: If the last chunk showed the child turning toward the dog, the prototype captures that up-to-the-minute pose without storing all frames.

Step 3: Sparse Memory Activation (SMA) – Selection for efficient attention

What happens:
1. Create tiny mean-pooled descriptors for the current chunk’s query and each frame in memory.
2. Compute inner-product relevance scores between the current descriptor and each frame descriptor.
3. Pick the top-k frames and run attention only on those frames’ keys/values.
Why it exists: Attention cost grows with context size; pruning to the best frames keeps speed high and filters noisy history.
Example: For “place the star on the tree,” it selects frames that include the same people and the tree, skipping irrelevant beach frames.

Step 4: Generate the next chunk with AR-diffusion

What happens:
1. Use the prompt and the selected memory to condition the denoising steps.
2. Produce T frames for the new chunk.
3. Save their KV to the memory bank for future steps.
Why it exists: This is where images turn from noise into a coherent video that matches both the prompt and past.
Example: The model creates a scene smoothly showing the child hugging the same dog with matching colors and outfits.

Training: Streaming long-tuning with Self-Forcing and DMD

What happens:
1. The student model generates a short clip while rolling forward.
2. A strong teacher guides it with Distribution Matching Distillation (DMD), aligning the student’s output distribution to the teacher’s.
3. NAM and SMA are used during this training so the model learns to retrieve and select memory under real rollout conditions.
Why it exists: Training in the same way you will run the model (streaming) teaches it to manage memory across long sequences.
Example: Over a 60s training sequence with prompt switches every 10s, the model practices pulling correct past characters and pruning irrelevant frames.

Concrete data walk-through:

Suppose prompts switch every 10 seconds across 60s: beach child with balloon → dog runs to ball → child hugs dog → now indoors decorating a Christmas tree → star on top → keep decorating.
NAM retrieval ensures each new chunk pulls the frames that matter (e.g., the same child/dog for early clips, then the same people/tree indoors later).
The prototype update keeps the most recent chunk summarized so short-term motions are preserved.
SMA makes attention look only at the best-matching frames, so inference remains fast.

The secret sauce:

Prompt-anchored memory: Memory retrieval uses the current prompt to score past frames, tightly aligning memory with intent.
Prototype compression: Using the first frame’s KV as a chunk prototype is a simple but powerful way to capture what just happened without heavy compute.
Relevance-gated attention: SMA’s top-k frame selection keeps both quality and speed by avoiding attention on noisy or off-topic memory.

What breaks without each step:

Without NAM retrieval: The model may attend to the wrong history (duplicate characters, off-scene backgrounds).
Without prototype update: The model misses immediate context (awkward jumps between adjacent chunks).
Without SMA: Computation balloons; speed drops; more irrelevant context slips in, increasing drift.

04Experiments & Results

The test: Does MemFlow keep stories consistent and follow prompts across long, changing videos—while staying fast enough for interactive use?

Competitors: SkyReels-V2, Self Forcing, LongLive, and FramePack. For fairness, these were adapted where needed to handle multiple prompts by switching prompts mid-generation.

What was measured and why:

Quality Score (VBench-Long): overall perceptual quality, like getting an A for how good it looks.
Consistency Score: how well subjects and backgrounds stay the same when they should.
Aesthetic Score: how pleasing the video looks.
CLIP Score (per 10s segment): how well each segment matches its prompt text, especially important when prompts switch.
Throughput (FPS): speed matters for real-time creation.

Scoreboard with context:

On 60-second multi-prompt tests, MemFlow achieves the best overall quality among compared models. Think of it like scoring 85.0 when others cluster around 81–84.
Consistency is top-tier (around 96.6), rivaling or beating methods that sometimes “cheat” consistency by reducing motion variety.
Aesthetic scores are strong, showing less error build-up over time.
CLIP scores per segment show strong prompt-following even after multiple switches, meaning it remembers the right people and places when the story changes.
Speed: MemFlow runs at 18.7 FPS on a single NVIDIA H100. That’s only about a 7.9% slowdown versus a no-memory baseline but far faster than heavy architectures (e.g., more than 38× faster than SkyReels-V2 in the authors’ setting).

Ablations that make the numbers meaningful:

Memory mechanism comparison (60s multi-prompt):
- w/o Memory: Attends only to recent frames; results in abrupt scene changes and lower CLIP after switches.
- Frame Sink (keep first chunk): Helps at the start but fails as the story evolves (duplicates, mismatches).
- NAM only: Best overall consistency and semantic alignment but slightly slower.
- NAM + SMA (full MemFlow): Nearly the same quality as NAM-only but with better speed (e.g., 18.7 vs 17.6 FPS), proving SMA’s efficiency win.
Memory capacity study: Bigger isn’t always better. As memory size grows too large, the model can over-rely on distant context and lose short-term flow, making CLIP unstable. A balanced setting (e.g., b=3) achieved the best stability.

Single-prompt scenarios:

Short (5s): MemFlow matches or surpasses state-of-the-art quality with the best semantic alignment among models of similar size/resolution, while running in real-time.
Long (30s): The advantages grow. MemFlow showed higher total, quality, and semantic scores than SkyReels-V2, FramePack, Self Forcing, and LongLive, and still stayed efficient.

Surprising findings:

Simple prototype works: Using just the first frame’s KV of the previous chunk as a prototype was enough to preserve immediate context.
Selective memory helps quality: SMA didn’t just speed up inference; it also filtered noisy history, reducing error accumulation.
Right-size memory wins: Keeping memory neither too small nor too big led to the best prompt alignment and stability.

What the results mean in everyday terms:

MemFlow is like a good storyteller who remembers the right details and doesn’t get bogged down rereading the whole book, so the tale stays smooth and on-topic, even as the plot twists.

05Discussion & Limitations

Limitations:

Slight speed cost: MemFlow is a bit slower than a no-memory baseline (about 7.9%) due to retrieval and activation steps.
Memory size sensitivity: If you make the memory too large, the model can lean too much on far-away history and lose short-term smoothness.
Retrieval quality depends on prompts: Vague prompts can fetch less relevant history, weakening consistency.
Base-model dependency: Performance inherits strengths/weaknesses from the underlying AR-diffusion backbone and tokenizer.

Required resources:

A GPU with enough VRAM to store a modest memory bank of KV caches (the authors used an NVIDIA H100 for 18.7 FPS).
A pre-trained AR-diffusion model with KV cache support and a teacher model for distillation during long-tuning.
Streaming long-tuning data with multi-prompt scripts.

When NOT to use it:

Ultra-tiny devices or strict real-time budgets where even a ~8% slowdown is unacceptable.
Tasks needing exact replay of every frame’s detail (e.g., forensic reconstruction) where a single-frame prototype may be too lossy.
Prompts that constantly and drastically change with no recurring elements; NAM’s benefits shrink when nothing ties across time.

Open questions:

Better prototypes: Could a learned, tiny prototype beat the first-frame heuristic without adding much cost?
Multi-signal retrieval: Can we combine text, audio, and motion cues to pick even better memory?
Adaptive k: Could the model learn to change how many frames it selects based on scene complexity or uncertainty?
Error-aware pruning: Can SMA downweight frames likely containing past mistakes?
Generalization: How well does MemFlow scale to much longer durations (minutes) and higher resolutions while keeping speed and quality?

06Conclusion & Future Work

In three sentences: MemFlow keeps long video stories consistent by using the current prompt to fetch the right past frames (NAM) and by attending only to the most relevant memory (SMA). It balances long-term coherence with speed, achieving high quality and strong prompt-following at near real-time FPS. This shows adaptive, prompt-aware memory is the key to stable, interactive long video generation.

Main achievement: Proving that narrative-adaptive retrieval plus sparse activation beats fixed memory strategies for maintaining coherence across prompt switches—without sacrificing efficiency.

Future directions: Learn smarter prototypes, fuse multiple signals (text, motion, audio) for retrieval, and adapt the number of selected frames dynamically. Explore longer videos and higher resolutions while keeping speed. Apply the idea to other sequence tasks like long-form audio or multimodal storytelling.

Why remember this: MemFlow turns memory from a blunt, fixed tool into a sharp, adaptive one. It remembers what matters, forgets what doesn’t, and stays fast—so your characters don’t clone, your scenes don’t drift, and your story feels like a movie, not a jumble.

Practical Applications

•Interactive filmmaking: Steer characters and scenes mid-generation while keeping continuity.
•Advertising storyboards: Update product shots or slogans on the fly without breaking visual identity.
•Education videos: Add new examples or settings while preserving consistent teachers/characters.
•Game cutscenes: Insert quests or character re-entries with stable looks and environments.
•News and documentary reels: Extend segments with new prompts while keeping subjects consistent.
•Social media content: Rapidly create long, themed videos with prompt switches for trends or holidays.
•Virtual production: Preview scene changes live on set and maintain continuity between takes.
•Creative writing to video: Turn evolving scripts into coherent multi-scene videos.
•Animation drafts: Keep character models consistent across many shots as the script evolves.
•Customer support demos: Insert new product steps mid-video while retaining branding and UI consistency.

Version: 1