OneStory: Coherent Multi-Shot Video Generation with Adaptive Memory

Zhaochong An; Menglin Jia; Haonan Qiu; Zijian Zhou; Xiaoke Huang; Zhiheng Liu; Weiming Ren; Kumara Kahatapitiya; Ding Liu; Sen He; Chenyang Zhang; Tao Xiang; Fanny Yang; Serge Belongie; Tian Xie

OneStory: Coherent Multi-Shot Video Generation with Adaptive Memory

Beginner

Zhaochong An, Menglin Jia, Haonan Qiu et al.12/8/2025

arXiv PDF

Key Summary

•OneStory is a new way to make long videos from many shots that stay consistent with the story, characters, and places across time.
•It treats multi-shot video creation like writing the next paragraph in a book: generate the next shot using all previous shots and the current caption.
•A Frame Selection module picks the most helpful frames from all earlier shots, so the model remembers what matters and ignores distractors.
•An Adaptive Conditioner then shrinks those chosen frames into a small, smart “memory” that the generator can use directly without getting overwhelmed.
•The team built a 60K-video dataset with referential, shot-by-shot captions (like “the same man…”) to teach the model how stories naturally flow.
•Compared to fixed-window attention and keyframe-only methods, OneStory keeps characters and environments more consistent and follows complex prompts better.
•Ablations show both parts—Frame Selection and Adaptive Conditioner—are necessary, and even a tiny memory budget helps a lot.
•It works for both text-to-multi-shot and image-to-multi-shot settings, and even generalizes beyond its training domain.
•This approach is efficient, scalable, and practical for creators who need minute-long, coherent video stories.

Why This Research Matters

Videos in real life are made of many shots that must fit together smoothly, or the story breaks. OneStory helps AI keep track of who’s who and where things are, even after scene changes, so longer videos stay coherent. This lets creators make minute-long narratives that feel professional without heavy manual editing. Teachers can build multi-scene explainers where characters and objects remain consistent across cuts. Marketers can maintain brand identity when switching angles, locations, or products. Filmmakers and game designers can quickly previsualize scenes with reliable character and environment continuity. Overall, it moves AI video closer to how people actually tell stories.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: You know how a movie has lots of different shots—close-ups, wide views, and new places—but it still feels like one story? Viewers expect the same people and places to make sense across those shots.

🥬 The Concept (Multi-Shot Video Generation, MSV): MSV is about making videos with many separate shots that together tell a single, clear story.

How it works (before this paper): Most AI video models made just one continuous clip, not a whole sequence of shots.
Why it matters: Without MSV, AI can make a pretty clip, but not a full scene-by-scene story; characters and places go off-model or change accidentally. 🍞 Anchor: Think of making a school play from several scenes; the same actors must look the same even when the lights, costumes, or locations change.

🍞 Hook: Imagine you have a photo and want to bring it to life like a short movie. 🥬 The Concept (Image-to-Video, I2V): I2V turns a single picture into a short video, guided by text.

How it works: Start with a photo, add instructions like “walk forward and wave,” and the model animates the picture.
Why it matters: I2V gives strong, stable visuals (faces, clothes, scenery) to build on. 🍞 Anchor: It’s like flipping a drawing into a flipbook where the drawing moves.

🍞 Hook: You know how a day goes morning → afternoon → night, and what you do depends on what already happened? 🥬 The Concept (Temporal Conditioning): Temporal conditioning means using what happened earlier to decide what should happen next.

How it works: Read the past frames or shots, keep track of important parts, and guide the next frames.
Why it matters: Without it, the story resets each time and forgets what already happened. 🍞 Anchor: When you play a board game, your next move depends on the last move; otherwise it’s chaos.

The world before OneStory:

AI video got great at single-shot beauty (sharp images, smooth motion).
But real stories need multiple shots and long memory: characters disappear and reappear, locations shift, cameras zoom and cut.
Existing multi-shot methods struggled to remember the right things across shots.

The specific problem:

Long-range context is hard: the model must remember what stays the same (identity, scene) and what changes (angle, action).
Two big prior strategies had flaws:
1. Fixed-window attention: look at only a few recent shots. Problem: the window slides and old-but-important shots get forgotten.
2. Keyframe conditioning: make one image per shot, then expand to video. Problem: a single image can’t carry complex, evolving story cues.

Failed attempts (why they fell short):

Fixed windows = memory loss as the story grows longer.
Single keyframes = thin context; can’t carry subtle relationships or reappearances.
Edit-and-extend tricks (copy last frame and expand) = drift and confusion when the scene truly changes.

The gap:

We need memory that is both global (can look across all past shots) and compact (small enough to be efficient), and smart enough to pick only what matters.
We also need training data that reflects how real stories are told: shot-level captions that refer to what came before (like “the same man…”), not just a fixed global script.

🍞 Hook: Imagine writing a class story where each new paragraph should connect to earlier parts, even if you skipped a few pages. 🥬 The Concept (Referential Captions): Referential captions are shot-by-shot descriptions that refer back to earlier shots (“the same man,” “the same park”), anchoring the story.

How it works: First caption each shot; then rewrite later captions to reference what came before.
Why it matters: Without these references, the model can’t easily know which past details to keep the same. 🍞 Anchor: It’s like using pronouns correctly in English—“she” and “that place” tell you who and where you’re talking about without repeating everything.

Real stakes (why you should care):

Creators want minute-long videos that don’t break character.
Teachers want scene-by-scene explainers that stay consistent.
Marketers need brand and character continuity across multiple cuts.
Everyday users want highlight reels that feel like one story, not random clips.
Game and film previsualization needs reliable multi-shot coherence to plan scenes quickly.

02Core Idea

🍞 Hook: Imagine telling a bedtime story one page at a time. You look back at earlier pages to remember who’s who, then write the next page that fits. 🥬 The Concept (Next-shot Generation): OneStory treats multi-shot video as “write the next shot” using all previous shots and the current caption.

How it works: Build a memory from earlier shots, pick the most relevant frames, shrink them smartly, and feed them to the generator to produce the next shot; repeat.
Why it matters: Without this next-step focus, the model either forgets older-but-important shots or uses too little context. 🍞 Anchor: It’s like adding a new comic panel after rereading the last few panels to keep the plot and characters straight.

The “Aha!” in one sentence: Keep a smart, tiny backpack of only the most relevant memories from all past shots, and use it to write the very next shot—over and over.

Three analogies:

Librarian with bookmarks: Instead of rereading the whole book, you bookmark only the pages that explain today’s chapter, then summarize them compactly.
Travel scrapbook: From hundreds of photos, you pick the few that best remind you of who you were with and where you were, then carry pocket-sized prints to plan the next stop.
Cooking with a pantry: You don’t dump the whole pantry into the pot; you select key ingredients and chop them to the right sizes so the dish tastes right without waste.

Before vs. After:

Before: Models either looked back only a short distance or used a single keyframe; both missed long, complex threads.
After: OneStory scans all past shots, picks the most relevant frames (even if far back), compresses them just enough, and conditions generation directly.

Why it works (intuition):

Relevance beats recency: A frame from Shot 1 may matter more to Shot 6 than anything from Shot 5.
Compact beats bulky: If the memory is too big, it slows or confuses the model; if too small, it forgets. Adaptive shrinking finds the sweet spot.
Autoregressive focus: Writing one next shot at a time with fresh memory lets the model correct course and stay on plot.

Building blocks (with sandwich explanations):

🍞 Hook: Think of choosing the best photos for a class slideshow from a giant folder. 🥬 The Concept (Frame Selection Module): The model scores past frames and picks only the most relevant ones for the next shot.

How it works: It uses the current caption to first understand what’s needed, then queries a bank of all past frames and scores them; top-scoring frames are selected.
Why it matters: Without picking the right frames, the model either carries useless baggage or misses key identity/location cues. 🍞 Anchor: If the next shot returns to the main hero, the selector pulls earlier frames where that hero is clear, not random park shots.

🍞 Hook: Imagine packing for a trip—space is limited, so you fold clothes neatly and only bring what you need. 🥬 The Concept (Adaptive Memory): A smart summary of the chosen frames that keeps what’s important and drops the rest.

How it works: Assign more detail to the most important frames and less detail to others; pack them into a compact set of tokens.
Why it matters: Without adaptive memory, the model either slows down with too much data or loses context by overshrinking everything equally. 🍞 Anchor: You roll your favorite hoodie tightly (high detail) and bring fewer socks (lower detail) so your bag closes.

🍞 Hook: When you chop veggies, you dice some small and leave others bigger, depending on the recipe. 🥬 The Concept (Adaptive Conditioner): It “chops” the selected frames into tokens of different sizes based on importance and feeds them straight into the generator.

How it works: Important frames get fine-grained tokens; less important ones get coarser tokens; then all tokens are concatenated with the generator’s current noisy tokens.
Why it matters: Without right-sized tokens, the generator can’t balance efficiency with expressiveness. 🍞 Anchor: A close-up face frame gets tiny, detailed “dice,” while a background landscape gets bigger “chunks.”

🍞 Hook: Ever write a story where each new sentence refers to what you already said, like “the same girl walked into the same cafe”? 🥬 The Concept (Referential Captions): Shot-level captions that point back to earlier shots help the model know what to keep the same.

How it works: Captions are rewritten to include references like “the same man,” guiding the selector and memory.
Why it matters: Without references, the model may drift—new clothes, new faces, or wrong places. 🍞 Anchor: “The same plant” tells the model to recall earlier plant frames for the close-up.

🍞 Hook: Building with LEGO? You make parts separately, then snap them together. 🥬 The Concept (Compositional Generation): Later shots can combine characters or places introduced in different earlier shots.

How it works: The memory pulls relevant frames for each part and fuses them in the next shot.
Why it matters: Without this, the model fumbles when two story threads meet. 🍞 Anchor: Two people introduced in earlier shots appear together naturally in a final scene.

03Methodology

At a high level: Input (all previous shots + current caption) → Frame Selection (pick relevant frames) → Adaptive Conditioner (compress into tokens) → Generator (DiT) attends to both noise and context → VAE Decoder → Output next shot. Repeat.

Step-by-step with what, why, and examples:

Build a historical memory bank

What happens: Encode every prior shot into latent “frame features” using a 3D VAE and concatenate them into one big memory across all time.
Why it exists: You need a place to search; otherwise, you only see the most recent shots and forget important old ones.
Example: For Shots 1–5 already made, all their frames sit in a memory bank so Shot 6 can look anywhere, not just Shot 5.

Understand the current goal using the caption

What happens: The model has a few learnable query tokens. They first read the current caption (e.g., “the same woman, now in a close-up”) to grasp intent.
Why it exists: If you don’t know what you’re looking for, you can’t search well.
Example: If the caption says “close-up of the same plant,” the model knows to look for frames where the plant appears, not the person.

Query the memory, score, and select frames (Frame Selection)

What happens: The queries, now primed by the caption, attend over the whole memory and assign a relevance score to each historical frame. The top-K frames are selected.
Why it exists: Without selection, the model carries too much or the wrong context, hurting consistency and speed.
Example: For Shot 6, the selector might pick a clear face frame from Shot 1 (even if far back), and a wide scene frame from Shot 3 that matches the location.

Turn selected frames into compact context tokens (Adaptive Conditioner)

What happens: Each selected frame is “patchified” (chopped into tokens). Important frames get fine patches (more tokens); less important get coarse patches (fewer tokens). All tokens are concatenated.
Why it exists: You want rich detail where it matters and efficiency elsewhere; uniform compression either wastes compute or loses key details.
Example: A crucial identity frame (face) gets many small tokens; a sky background gets a few big tokens.

Inject context into the generator (DiT)

What happens: Concatenate the context tokens with the current shot’s noisy tokens and feed the whole set into the diffusion transformer so attention flows across both.
Why it exists: Direct joint attention lets the generator “look at” the memory while denoising the next shot. If you don’t integrate them, the generator can’t use memory effectively.
Example: When denoising a close-up, attention leans on the detailed face tokens to keep identity stable.

Decode and move forward autoregressively

What happens: The denoised latent frames are decoded to a video shot. Append it to history, update memory, and move on to the next caption.
Why it exists: This loop lets the story scale to many shots, with fresh, relevant memory at each step.
Example: After creating Shot 6, the model is ready for Shot 7, now remembering both early and recent key frames.

The secret sauce:

Relevance-aware selection beats fixed windows: pull from anywhere in the past if it matters.
Importance-guided patchification: spend tokens where needed, save tokens where not.
Simple, powerful conditioning: just concatenate memory tokens with noisy tokens so the transformer’s attention does the heavy lifting.

Training “recipe” details:

Unified three-shot training: Most data has 2–3 shots, which can destabilize training. The authors “inflate” two-shot videos into three-shot ones by inserting a synthetic middle shot (from another video or an augmented version of the first) so training always predicts the third shot from the first two. Why: A uniform setup stabilizes learning of cross-shot context. Example: Train on (Shot A, Synthetic B, Shot C) to predict Shot C.
Decoupled-to-coupled curriculum: Early on, the selector is random. So, at first, they sample frames uniformly from the real shot to form the context (ignoring the selector). After warm-up, they switch to full selector-driven conditioning. Why: Avoid bad early choices that confuse the generator. Example: Like using teacher-chosen examples before letting the student pick.
Selector supervision with pseudo-labels: They use feature similarities (e.g., DINOv2, CLIP) to create targets that indicate which historical frames are more relevant to the current shot, helping the selector learn faster. Why: Makes selection more reliable than learning from scratch. Example: Frames clearly unrelated get negative labels; augmented frames get partial relevance.

Concrete mini-walkthrough:

Inputs: Prior Shots 1–3, Caption 4: “The same man now speaks by the lake; medium shot.”
Memory: All frames from Shots 1–3 are encoded and stored.
Selection: Caption-primed queries score frames; it picks a clear face from Shot 1 and a wide lake frame from Shot 3.
Conditioning: The face frame is finely patchified; the lake frame coarsely; tokens are concatenated with Shot 4’s noise.
Generation: The DiT attends to both the face (to keep identity) and the lake (to match place), then decodes a medium shot of the same man by the lake.

04Experiments & Results

The test (what they measured and why):

They evaluated both text-to-multi-shot (T2MSV) and image-to-multi-shot (I2MSV), focusing on two big ideas: within-shot quality and across-shot coherence.
Within-shot (intra-shot): subject consistency, background consistency, aesthetic quality, and motion dynamics—because each shot should look good and stable.
Across-shot (inter-shot): character and environment consistency—because people and places must remain themselves as the story moves.
Semantic alignment: does each shot match its caption? Critical for following the script.

The competition (baselines):

Fixed-window attention (e.g., Mask^2DiT): sees a limited set of recent shots.
Keyframe conditioning (StoryDiffusion + I2V): one image per shot, then expand to video.
Edit-and-extend (e.g., FLUX + I2V): use the last frame and try to continue forward.

The scoreboard (with context):

On T2MSV, OneStory’s inter-shot coherence averaged about 0.581 across character and environment, beating others that clustered around ~0.54–0.57. Think of this like getting an A while others get B’s.
Semantic alignment also improved, meaning OneStory followed complex, evolving captions more faithfully.
In I2MSV, OneStory again led on both across-shot and within-shot metrics. It kept faces/places consistent and produced appealing, correctly moving shots.
Even with a tiny context budget (about one latent-frame worth of tokens), OneStory performed strongly, and adding a bit more memory gave further gains. That’s like packing just one extra index card of notes and still acing the test.

Surprising or notable findings:

Relevance matters more than recency: Selecting older-but-pertinent frames outperformed grabbing the latest frames uniformly or by default “most recent.”
Adaptive patchification punches above its weight: Spending detail on important frames and compressing the rest gave big wins without heavy compute.
Generalization: Despite being trained mainly on human-centric data, the model handled out-of-domain scenes reasonably well, suggesting the adaptive memory travels.

Qualitative highlights:

Reappearance: When a character returns after an intervening shot, OneStory recalls identity and keeps clothes/face stable.
Zoom-ins: It localizes small objects (like a plant) when moving from wide shots to close-ups.
Composition: It merges separate threads (two characters introduced in different shots) into a convincing joint scene.

Dataset and benchmark:

~60K multi-shot videos with referential captions, constructed by shot detection, two-stage captioning (then referential rewriting), and multi-stage filtering for quality.
Benchmarks with classic story patterns: main-subject consistency; insert-and-recall; and compositional generation. This stresses long-range memory, distractor robustness, and thread-merging.

05Discussion & Limitations

Limitations (honest take):

Domain bias: Training data is human-centric, so unusual domains (wild animals, abstract art, medical imagery) may be weaker.
Base-model dependence: Built on a pretrained I2V model; its strengths and weaknesses flow into OneStory.
Very long sequences: While memory is adaptive, extremely long or hour-scale narratives may still overtax the system.
Caption reliance: If captions are vague or incorrect, frame selection can latch onto the wrong memories.
Fine-grained continuity: Micro-details (tiny accessories, precise props under occlusions) can still drift across complex cuts.

Required resources:

A capable pretrained I2V base (e.g., Wan2.1), multi-GPU training (authors used 128×A100 for one epoch), and substantial storage for memory and dataset.
For inference, good GPU memory is helpful but the adaptive conditioner keeps costs manageable.

When not to use:

Ultra-long surveillance or sports broadcasts where hours of footage must remain strictly synchronized.
Cases requiring precise audio-visual alignment (lip sync with real audio) not modeled here.
Highly technical scenes needing exact geometry/physics over many cuts.

Open questions:

Audio and dialogue: How to carry voice, prosody, and lip sync across shots with the same adaptive memory?
Interactive editing: Can users pin or veto certain frames in memory for creative control?
Scaling memory: How to grow memory across tens or hundreds of shots without losing speed?
Better supervision: Beyond CLIP/DINO pseudo-labels, can richer supervision teach even sharper selection?
Evaluation: Can we design human-centric coherence metrics that correlate even better with viewer perception?

06Conclusion & Future Work

Three-sentence summary:

OneStory turns multi-shot video creation into a next-shot task powered by an adaptive memory that is both global (searches all past shots) and compact (packs only what matters).
A Frame Selection module finds the most relevant historical frames, and an Adaptive Conditioner compresses them into right-sized tokens that the generator can attend to directly.
Trained on a 60K referential-caption dataset, OneStory beats strong baselines in both text-to- and image-to-multi-shot settings, delivering coherent, controllable long-form storytelling.

Main achievement:

Showing that relevance-driven selection plus importance-guided compression enables practical, scalable narrative coherence across discontinuous shots—without bloating compute.

Future directions:

Integrate audio and speech consistency, scale memory to even longer narratives, expand beyond human-centric data, and add interactive controls for creators.
Improve frame-selection supervision and evaluation metrics to align even more with human judgment.

Why remember this:

It reframes the problem (write the next shot) and introduces adaptive memory as a simple, powerful lens for long-form video generation. The idea—select what matters, compress wisely, condition directly—can influence how we build story-aware generative systems well beyond video.

Practical Applications

•Create short films or trailers with consistent characters across many shots.
•Produce educational multi-scene videos where key objects and people persist correctly across cuts.
•Generate marketing clips that keep brand colors, logos, and spokesperson identity stable shot-to-shot.
•Storyboard and previsualize complex scenes by iteratively generating the next shot from evolving prompts.
•Make product demos that return to the same item after intervening shots without changing its look.
•Assemble travel vlogs where the same travelers and landmarks reappear consistently across locations.
•Design social-media series (episodic reels) that remain on-character across multiple scene changes.
•Prototype game cutscenes that merge separate character threads into shared scenes reliably.
•Build museum or classroom exhibits with multi-shot narratives that zoom in on details while staying coherent.
•Automate highlight reels that preserve identity and setting even when switching camera angles.

Version: 1