StoryMem: Multi-shot Long Video Storytelling with Memory

Kaiwen Zhang; Liming Jiang; Angtian Wang; Jacob Zhiyuan Fang; Tiancheng Zhi; Qing Yan; Hao Kang; Xin Lu; Xingang Pan

StoryMem: Multi-shot Long Video Storytelling with Memory

Intermediate

Kaiwen Zhang, Liming Jiang, Angtian Wang et al.12/22/2025

arXiv PDF

Key Summary

•StoryMem is a new way to make minute‑long, multi‑shot videos that keep the same characters, places, and style across many clips.
•It teaches a single‑shot video model to remember by saving a few special frames (keyframes) from earlier shots in a small memory bank.
•These memory frames are plugged back into the model using a simple trick: put the memory next to the current shot’s hidden features (latent concatenation) and mark them as happening in the past (negative RoPE shift).
•The system only needs light LoRA fine‑tuning on short videos, so it keeps the high picture quality of the original model.
•Smart picking of memory frames (semantic keyframe selection with CLIP and aesthetic filtering with HPSv3) keeps the memory helpful and not messy.
•StoryMem can smoothly connect shots (MI2V) and even start from a user’s reference images (MR2V) for personalized stories.
•On the new ST‑Bench test, StoryMem beats past methods in cross‑shot consistency while staying great at looks and prompt following.
•User studies also prefer StoryMem’s stories for coherence and natural flow.
•It still struggles a bit with lots of similar characters and with very big motion changes between neighboring shots.
•Overall, StoryMem shows that adding a simple, explicit visual memory lets today’s single‑shot video models tell long, coherent stories.

Why This Research Matters

StoryMem helps AI make videos that feel like real movies instead of a bunch of unrelated clips. This means your hero keeps the same look, the world stays stable, and the mood carries across scenes. Creators can build longer, more believable stories with less effort and without retraining giant models. Brands can keep characters consistent across many shots, and teachers or students can craft coherent visual narratives for projects. By using a tiny, smart memory instead of huge new models, it’s faster, cheaper, and easier to deploy. This shift opens the door to practical minute‑long storytelling for ads, education, entertainment, and personalized content.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine you’re making a class movie. Each friend records one short clip. When you stitch them together, the hero’s hair changes color, the café turns into a beach, and the mood flips—whoops! The story feels broken, not like one smooth film.

🥬 The Concept (The World Before): Video AI got very good at making one beautiful clip at a time (single‑shot videos). These clips can look cinematic and follow a prompt closely. But movies aren’t just one clip—they’re many shots that must fit together: same characters, outfits, places, and style across time.

How it worked before and why it wasn’t enough:

Big all‑in‑one models tried to generate all shots together, learning connections across the whole long video. They used heavy attention across every frame of every shot. This worked for consistency but was very expensive to train and run, and needed tons of special multi‑shot data.
Two‑stage keyframe pipelines first made one image per shot (a keyframe), then expanded each image into a short video. This was efficient and used great single‑shot models—but each shot ignored the others. So details drifted (hair, clothes, scenery), and transitions felt stiff.

Why this mattered (The Problem): Real stories need multi‑layered coherence—same character identity over minutes, consistent backgrounds as the camera moves, a steady visual style, and natural transitions. Without that, the audience gets confused.

What people tried (Failed Attempts):

Joint multi‑shot training with huge attention blocks: consistent, but quadratic cost (gets much slower as you add more frames) and worse visual quality compared to high‑end single‑shot base models.
Decoupled keyframe + expansion: fast and pretty, but blind to history—no memory of earlier shots.

The Gap: We needed a method that:

Keeps the stunning image quality of top single‑shot models.
Shares context across shots to stay consistent.
Trains light, without rare giant multi‑shot datasets.

🍞 Anchor: Think of a flipbook movie. If each page is drawn by a different friend with no notes from the previous page, the hero might randomly grow a hat or swap pets. But if each friend peeks at a few earlier pages before drawing, the story stays steady. That “peek” is the missing ingredient: memory.

— New Concepts —

🍞 Hook: You know how a teacher writes key facts on the board so everyone can stay on track? 🥬 Keyframe‑based storytelling: It tells a story by choosing only the most important frames to represent what’s happening, instead of using every single frame. How it works: (1) Pick key moments as anchors. (2) Use them to plan or expand into shots. Why it matters: Without keyframes, stories can wander or repeat; with them, you get structure without bloat. 🍞 Anchor: Like a comic strip: a few panels are enough to follow the plot.

🍞 Hook: Imagine taking a single cool photo and turning it into a short moving scene. 🥬 Single‑shot video diffusion models: These models turn a prompt (and maybe a first frame) into one high‑quality short video clip. How it works: (1) Start with noise. (2) Step‑by‑step remove noise guided by the prompt until a clip appears. Why it matters: They’re really good at beauty and detail, but they don’t remember previous clips. 🍞 Anchor: You ask for “a cat jumping on a couch,” and you get a gorgeous 5‑second shot—but it knows nothing about the cat from last shot.

02Core Idea

🍞 Hook: Picture a filmmaker carrying a tiny scrapbook with snapshots of earlier scenes. Before filming each new shot, they glance at the scrapbook: “Same hero jacket, same café logo, same warm lighting.” The next shot now matches the story so far.

🥬 The Concept (Aha! in one sentence): StoryMem turns long video storytelling into a repeat‑after‑me process where each new shot is generated while looking at a small, smart memory of keyframes from previous shots.

How it works (big idea):

Keep a compact memory bank of a few keyframes from earlier shots.
Feed those memory frames into a strong single‑shot model as extra context.
Mark those memory frames as coming from the past (negative RoPE shift) so the model treats them like earlier moments.
Lightly fine‑tune the model (LoRA) so it learns to use the memory well.
After each new shot, pick new keyframes to refresh the memory (semantic and aesthetic filters).

Why it matters: Without memory, each shot drifts. With memory, characters, places, and style stay coherent across minutes.

🍞 Anchor: It’s like building a LEGO city one block at a time, while keeping a photo of what you already built. You match the colors and shapes, so the city doesn’t suddenly change style.

Three analogies (same idea, different angles):

Scrapbook analogy: A director flips through a tiny album of earlier scenes to keep costumes and lighting consistent.
Trail markers: Hikers leave ribbons on trees (keyframes) so anyone who follows (the next shot) stays on the same path.
Recipe card: Each cooking step checks the earlier steps’ notes (memory) so the final dish tastes consistent.

Before vs. After:

Before: Either super‑heavy joint models or fast but forgetful shot‑by‑shot pipelines.
After: A lean, memory‑aware shot‑by‑shot approach that keeps high visual quality while staying consistent.

Why it works (intuition, no equations):

The model sees the past (memory frames) and the present (current prompt) together. Because memory frames are slotted as earlier time positions, attention layers naturally connect past looks to current generation.
Since we only lightly adapt (LoRA) the pretrained model, we keep its high-fidelity visuals.
Smart selection keeps memory small but informative, preventing confusion.

Building blocks (introduced with mini sandwiches):

🍞 Hook: You know how we tape yesterday’s best drawings to the wall so we remember the style? 🥬 Memory‑to‑Video (M2V): It’s a way to generate each new shot while being conditioned on a compact visual memory from earlier shots. How it works: (1) Store keyframes as memory. (2) Encode them and feed them alongside the current shot’s latent features. (3) Use a negative RoPE shift to mark them as past. (4) Lightly fine‑tune the network to read memory. Why it matters: Without M2V, the model forgets; with M2V, it reuses past facts. 🍞 Anchor: Like glancing at a style board before painting the next panel of a mural.

🍞 Hook: Imagine clipping two strips of film side‑by‑side so the editor can compare them. 🥬 Latent concatenation: It’s joining the hidden features (latents) of memory frames with those of the current video so the model can see both at once. How it works: (1) Encode memory frames. (2) Place them next to the current video’s latent timeline. (3) Provide a mask so the model knows what to keep vs. what to generate. Why it matters: Without concatenation, the model can’t directly attend to memory content. 🍞 Anchor: Putting your old homework next to your new page so you match handwriting.

🍞 Hook: Think of labeling photos in an album: pages before today get negative page numbers so you never confuse past and present. 🥬 Negative RoPE shift: It marks memory frames with negative time positions so the model treats them as earlier events. How it works: (1) Assign negative indices to memory frames. (2) Keep current shot starting at zero. (3) The transformer’s attention now bridges past-to-present naturally. Why it matters: Without proper time labels, the model might mix up past and present and lose consistency. 🍞 Anchor: A timeline where everything before zero is history, so the next scene builds correctly on it.

🍞 Hook: Like tuning a guitar a little so it harmonizes with the band. 🥬 LoRA fine‑tuning: A lightweight way to adjust a big model by adding small low‑rank adapters. How it works: (1) Insert small LoRA modules. (2) Train only those to learn memory usage. (3) Keep the base model’s visual strength. Why it matters: Full retraining is costly and can hurt quality; LoRA keeps it fast and pretty. 🍞 Anchor: Clip‑on training wheels that guide balance without rebuilding the whole bike.

03Methodology

At a high level: Script shots → (First shot) generate from text → Extract keyframes for memory → (Next shot) memory + text → Encode memory and mask → Latent concatenation + negative RoPE shift in Video DiT → Generate shot → Update memory → Repeat until all shots.

Step‑by‑step, like a recipe:

Inputs and setup

What happens: You start with a story script: one short text description per shot, plus optional cut indicators for smooth transitions.
Why it exists: A story is many beats; we need text to guide each beat.
Example: Shot 1: “A girl in a yellow raincoat runs through a rainy market, warm lights reflecting on puddles.” Shot 2: “She ducks under a red awning, smiling as raindrops slow.”

Generate the first shot (no memory yet)

What happens: The base single‑shot video diffusion model creates the first 5‑second clip from its text.
Why it exists: We need a starting point before there’s any past to remember.
Example: The girl in the yellow raincoat appears in a lively, rainy market.

Extract memory from the shot (semantic keyframe selection + aesthetic filtering)

What happens: From the finished shot, we pick a few frames to remember and store them in a small memory bank.
Why it exists: Not every frame matters. We want only distinct, clear, representative frames to guide future shots.
Example: We keep (a) a clean frame showing the girl’s face and coat, (b) a frame showing the market stalls, (c) a frame with the warm lighting mood.

— New Concepts —

🍞 Hook: When you take notes from a chapter, you don’t copy every sentence—only the important ideas. 🥬 Semantic keyframe selection: Choose the most meaningful, non‑redundant frames using CLIP features to measure how different each frame is from the last chosen keyframe. How it works: (1) Compute CLIP embeddings for frames. (2) Select the first frame, then keep adding a new one only if similarity drops below a threshold (adaptive if too many are picked). Why it matters: Without this, memory fills with look‑alike frames and wastes space. 🍞 Anchor: Like highlighting only new ideas in your textbook, not the same sentence over and over.

🍞 Hook: You wouldn’t pin a blurry photo on your inspiration board. 🥬 Aesthetic preference filtering: Use a learned score (HPSv3) to filter out low‑quality frames so the memory stays clear and helpful. How it works: (1) Score each candidate. (2) Drop the ones below a threshold. Why it matters: Blurry or noisy frames can mislead the model and hurt quality. 🍞 Anchor: Curating your art wall: only crisp, appealing pictures make it up there.

Prepare the current shot’s conditioning package

What happens: Encode memory frames with a 3D VAE to get memory latents. Concatenate them (temporal axis) with empty slots for the new frames. Create a binary mask telling the model which positions are memory (keep) and which are to generate.
Why it exists: The model needs memory latents and a clear map of what to preserve vs. synthesize.
Example: If we have 2 memory frames and need 16 new frames, we build a 18‑step latent timeline with the first 2 marked as memory.

— New Concepts —

🍞 Hook: Turning a bulky video into a tiny LEGO version you can process fast. 🥬 3D VAE: An encoder‑decoder that compresses video frames into smaller, learnable hidden codes (latents) across time (3D). How it works: (1) Encode RGB frames into latents. (2) Decode latents back to video after generation. Why it matters: Operate in latent space to be efficient without losing important detail. 🍞 Anchor: Shrinking a poster to a pocket card for easy reference, then printing it full‑size later.

🍞 Hook: A to‑do list where some lines are pre‑filled (don’t change) and others are blank (please write). 🥬 Mask‑guided conditional diffusion: The model sees memory regions as fixed context and only generates the masked new frames. How it works: (1) Concatenate noisy video latent + conditional latent + mask as channels. (2) Predict the velocity to denoise only where needed. Why it matters: Without the mask, the model might overwrite memory or ignore it. 🍞 Anchor: A coloring page where some parts are already colored; you fill only the white spaces.

Place memory in the past (negative RoPE shift)

What happens: Assign negative time indices to memory latents and start current frames at zero.
Why it exists: The transformer must understand that memory is earlier in time so it can attend from present to past correctly.
Example: Memory frames get times −10 and −5; the new shot frames are 0..15.

Generate the shot with a memory‑aware Video DiT

What happens: The diffusion transformer (DiT) attends over both memory and current frames (plus text) to predict and remove noise step by step.
Why it exists: Attention lets the model copy identity, style, and background cues from memory while following the new prompt.
Example: The girl’s yellow raincoat, market mood, and color tone carry into the new shot under the awning.

— New Concepts —

🍞 Hook: Think of a super‑smart editor that looks at all frames and the script at once. 🥬 Video DiT (Diffusion Transformer): A transformer that predicts how to denoise video latents, guided by text and conditions. How it works: (1) Self‑attention for within‑video relations. (2) Cross‑attention for text conditioning. (3) Position encodings to track space and time. Why it matters: It’s the engine turning noise into coherent, prompt‑accurate video. 🍞 Anchor: A director who reads the script while watching rehearsal footage to guide the next take.

Update the memory bank

What happens: Extract new keyframes from the just‑made shot, compare to old memory using CLIP similarity, add only distinct frames, and enforce capacity with a memory‑sink + sliding‑window strategy: some early anchors stay long‑term; recent ones slide to keep short‑term context.
Why it exists: Prevent memory bloat and keep both global identity and local continuity.
Example: Keep her face + coat as long‑term anchors; keep the red awning briefly for nearby shots.

— New Concepts —

🍞 Hook: A fridge door with limited magnets: old favorite photos stay; new ones rotate in. 🥬 Memory sink + sliding window: Keep a few earliest, most defining keyframes as permanent anchors; manage recent frames in a short rolling window; drop oldest when full. How it works: (1) Fixed anchors. (2) Rolling recent memory. (3) Capacity control. Why it matters: Without it, memory grows messy and slow, or forgets the core identity. 🍞 Anchor: Your class bulletin board: a few permanent rules at the top; weekly updates below.

Smooth transitions and customization (MI2V and MR2V)

What happens: If the script says no cut, reuse the last frame of the previous shot as the first frame of the next (MI2V) for smoother motion. For personalization (MR2V), start the memory with user reference images (characters, places).
Why it exists: MI2V reduces jumpy cuts; MR2V lets users keep the same hero or setting across the whole story.
Example: Keep the last rainy frame to start the next shot gently; or begin with a reference photo of the girl so her look never drifts.

— New Concepts —

🍞 Hook: To keep the music playing smoothly, don’t stop between songs—blend the last note into the next. 🥬 MI2V: Memory + Image‑to‑Video continuity—reuse the final frame of the previous shot when no cut is intended. How it works: (1) Carry over last frame. (2) Maintain motion continuity. Why it matters: Without it, even consistent looks can feel jumpy. 🍞 Anchor: A dance routine that flows from one move to the next without a pause.

🍞 Hook: Show the artist a portrait before they start so they capture the same person every time. 🥬 MR2V: Memory + Reference‑to‑Video—initialize memory with user‑provided images for consistent, customized stories. How it works: (1) Load references into memory. (2) Generate all shots conditioned on them. Why it matters: Users get their exact character or brand across the entire video. 🍞 Anchor: A cosplay guide sheet used by every photographer to keep the costume accurate.

The secret sauce:

Explicit, tiny memory that the model can directly attend to.
Negative RoPE shift so the transformer naturally treats memory as the past.
Lightweight LoRA so we keep the base model’s beauty while adding memory skills.
Smart memory curation so the context stays sharp, small, and on‑point.

04Experiments & Results

🍞 Hook: Think of a school talent show. To judge fairly, you need a program with clear acts, simple rules, and a scoreboard that makes sense to everyone.

🥬 The Test (What they measured and why): The team built ST‑Bench, a new benchmark with 30 diverse stories, each split into 8–12 shots, including cut indicators. They tested three things:

Aesthetic Quality: How good it looks overall (color, realism, appeal).
Prompt Following: How well the video matches the story text, both globally and per shot.
Cross‑shot Consistency: How well characters, scenes, and style stay steady across shots. Why it matters: Great storytelling needs to look good, say what you asked for, and stay consistent over time.

🍞 Anchor: It’s like grading a comic for art quality, story faithfulness, and consistent character faces from panel to panel.

The Competition:

Independent single‑shot baseline: Wan2.2‑T2V (makes each shot alone; strong visuals, no memory/consistency).
Two‑stage keyframe pipelines: StoryDiffusion + Wan2.2‑I2V; IC‑LoRA + Wan2.2‑I2V (use keyframes, then expand; efficient but weak cross‑shot links).
Joint multi‑shot model: HoloCine (trains a big model to generate long sequences at once; more consistent than two‑stage, but heavier and sometimes lower visual quality).

Scoreboard with context:

Cross‑shot Consistency: StoryMem comes out on top, improving overall consistency by about 28.7% over the independent single‑shot baseline and about 9.4% over the strong joint model (HoloCine). This is like getting an A when others are at B or B+.
Aesthetic Quality: Among methods that enforce consistency, StoryMem achieves the highest aesthetic score, close to the independent model that doesn’t try to stay consistent across shots.
Prompt Following: StoryMem scores the best on global story alignment among consistency‑focused methods. Its single‑shot alignment is slightly lower because smooth transitions (MI2V) add extra constraints that can nudge the frame away from a super literal per‑shot match.
Representative numbers: Aesthetic ≈ 0.613; Overall Consistency ≈ 0.507; Top‑10 relevant pairs ≈ 0.534 (higher is better). Exact values vary per setup but the trend is consistent: StoryMem leads in cross‑shot coherence while keeping visuals strong.

User Study (humans watching videos):

People preferred StoryMem over all baselines on most aspects: consistency, narrative flow, and overall preference. The independent model was still liked for single‑shot prettiness but lost on multi‑shot coherence.

Surprising findings and notes:

A little memory goes a long way: Even a tiny, well‑curated memory bank can stabilize identity and style across many shots.
Smooth transitions matter: Reusing the last frame (MI2V) noticeably improves perceived continuity, even if it slightly lowers strict per‑shot prompt matching.
Quality preservation: LoRA fine‑tuning on short clips kept the base model’s cinematic look while adding memory skills, avoiding the common quality drop seen in some joint multi‑shot trainings.

05Discussion & Limitations

Limitations (be specific):

Many similar characters: If several look alike and the prompt is vague, the model may pull the wrong person from memory and mix identities.
Big motion changes: When one shot ends fast and the next begins slow (or vice versa), even MI2V can’t fully guarantee a silky transition.
Purely visual memory: The stored memory doesn’t include structured text tags for who’s who, so retrieval can be ambiguous in crowded scenes.

Required resources:

A strong single‑shot base model (e.g., Wan‑I2V‑style) and a GPU setup that can handle DiT inference with a few extra memory latents.
Light LoRA fine‑tuning data: short, semantically related clips (hundreds of thousands for best results, but still far less demanding than full multi‑shot retraining).
CLIP and an aesthetic scorer (HPSv3) for selecting and filtering keyframes.

When NOT to use it:

Ultra‑precise choreography across long continuous takes where exact motion speed continuity is mandatory (consider specialized motion control tools).
Crowded, multi‑character scenes with minimal textual guidance—unless you add clearer per‑shot character descriptions or structured references.
Scenarios where you cannot afford even small memory overhead (e.g., ultra‑tight latency constraints without batching).

Open questions:

Entity‑aware memory: Can we store and retrieve per‑character slots with names and attributes to remove ambiguity?
Multi‑frame continuity: Beyond reusing one frame, can we overlap several frames or velocity cues to match motion speed better?
Long‑range planning: Can a lightweight planner decide which memories to keep or drop for entire scenes automatically?
Multimodal conditioning: How about adding audio beats or script outlines to shape transitions and pacing?
Robustness: How small can the memory be before consistency suffers, and can adaptive memory budgets keep quality while saving compute?

06Conclusion & Future Work

Three‑sentence summary: StoryMem teaches a powerful single‑shot video model to remember by keeping a tiny bank of keyframes and feeding them back into generation with a negative time shift. This memory‑to‑video design preserves character, scene, and style across many shots while using only light LoRA fine‑tuning, keeping the base model’s cinematic quality. On the new ST‑Bench, it clearly improves cross‑shot consistency over prior methods and is preferred by human viewers.

Main achievement: Proving that an explicit, compact visual memory—plugged in via latent concatenation and negative RoPE shift—can transform single‑shot models into strong multi‑shot storytellers without heavy retraining.

Future directions: Add entity‑aware, text‑linked memory to disambiguate characters; design multi‑frame or motion‑aware transitions for perfectly smooth pacing; explore adaptive memory budgets and smarter selection policies; integrate optional audio or script cues for rhythm and scene flow.

Why remember this: It shows a simple, scalable path from great isolated clips to coherent long stories—by giving models a tiny, smart memory. Instead of building giant all‑at‑once systems, we can reuse today’s best single‑shot models, add memory, and get long‑form narratives that feel like real films. That’s a practical recipe for creators and researchers to make stories that look beautiful and stay true across minutes.

Practical Applications

•Produce consistent multi‑shot ads where a mascot’s look and brand colors never drift.
•Create educational story videos that keep the same characters and settings across lessons.
•Build narrative trailers or animatics that maintain style and identity across many beats.
•Generate episode recaps where characters remain visually stable scene to scene.
•Personalize stories by starting from a user’s reference photos (MR2V) for avatars or branded worlds.
•Improve vlog or travel‑story coherence by keeping landmarks, outfits, and color grading steady.
•Prototype film scenes quickly by adding memory to keep costumes, props, and lighting consistent.
•Design game cutscenes where NPC identities and environments carry across chapters.
•Automate social media series that preserve creator persona and set design over time.

Version: 1