Memory-V2V: Augmenting Video-to-Video Diffusion Models with Memory

Dohun Lee; Chun-Hao Paul Huang; Xuelin Chen; Jong Chul Ye; Duygu Ceylan; Hyeonho Jeong

Memory-V2V: Augmenting Video-to-Video Diffusion Models with Memory

Intermediate

Dohun Lee, Chun-Hao Paul Huang, Xuelin Chen et al.1/22/2026

arXiv PDF

Key Summary

•Memory-V2V teaches video editing AIs to remember what they already changed so new edits stay consistent with old ones.
•It stores past edited videos in a compact memory (video VAE latents), then retrieves only the most relevant ones for the next edit.
•A smart retriever (VideoFOV for camera tasks, DINOv2 for long videos) picks which past edits matter for the current step.
•Dynamic tokenization gives more detail (more tokens) to the most relevant memory clips and fewer tokens to the rest.
•Adaptive token merging compresses unimportant memory tokens based on attention responsiveness, cutting compute by about 30% without losing quality.
•On video novel view synthesis, Memory-V2V makes new camera views match each other better than strong baselines and even improves camera accuracy.
•On long video editing, it keeps looks and motion consistent across 200+ frame videos that exceed the base model’s time window.
•It works as a lightweight add-on to existing video-to-video diffusion transformers (e.g., ReCamMaster, LucyEdit).
•Ablations show each piece (retrieval, dynamic tokenization, merging) helps; together they bring the biggest jump in consistency and speed.
•Limitations include handling multi-shot videos with big scene changes and quality drift if memory stores imperfect generated frames.

Why This Research Matters

Creators don’t edit videos just once; they tweak them many times or split long videos into chunks. Without memory, these edits drift, causing hats to change colors or backgrounds to wobble across segments. Memory-V2V turns editing into a stable, multi-turn process where new changes respect previous decisions. This leads to professional-looking results for social media, film, education, and simulation, even when videos are longer than the model’s normal limit. It also makes camera-controlled re-renders match each other, which is crucial for 3D realism. Finally, it achieves this while keeping computation practical through smart retrieval and compression.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Top Bread (Hook): You know how when you make a scrapbook, each page you add needs to match the style you already picked (colors, stickers, fonts), or the book starts to look messy? Video editing with AI is like building that scrapbook, one page (or one edit) at a time.

🥬 Filling (The Actual Concept — Video-to-Video Diffusion Models):

What it is: A video-to-video diffusion model is an AI that takes a source video and transforms it into a new video while keeping key parts (like the subject and motion) intact.
How it works (recipe):
1. Encode the input video into a compact form (latent space).
2. Add noise and then denoise step by step using a diffusion transformer (DiT), guided by conditions like text prompts or camera poses.
3. Decode back to a finished video.
Why it matters: Without this, we’d have to make every video from scratch, losing the original motion, identity, or timing people want to preserve.

🍞 Bottom Bread (Anchor): Imagine turning a phone video of your dog running into a cartoon version where the dog is still your dog and runs the same way — that’s video-to-video diffusion.

🍞 Top Bread (Hook): Picture filming your toy car and then wishing you could watch it from the other side — even though you never filmed that angle.

🥬 Filling (The Actual Concept — Video Novel View Synthesis):

What it is: Novel view synthesis makes new camera angles of a scene from a single original video, as if you had more cameras.
How it works:
1. Read the original video and its camera info.
2. Ask: “If the camera moved here, what would the scene look like?”
3. Generate those new frames while keeping 3D structure and motion consistent.
Why it matters: Without it, new angles would guess wrong about hidden parts, and different runs wouldn’t match each other.

🍞 Bottom Bread (Anchor): You swing the camera around a statue in your living room, then ask for views from left, right, and behind; the statue should look the same from all new angles.

The world before: Recent video-to-video diffusion tools could change appearance (e.g., “make the shirt red”), adjust motion a bit, or re-render with new camera paths. They did great in a single pass: you give one input video and get one output video. But real editors don’t stop at one pass — they tweak and refine, sometimes across days. This is called multi-turn editing: make an edit, review it, then edit again, and again.

The problem: In multi-turn editing, AIs often forget what they already decided in earlier rounds. If you generate a new camera view today and another view tomorrow, the parts that weren’t visible in the very first input can change unexpectedly between rounds. Or if you cut a long video into smaller chunks to fit the model’s time window, each chunk gets edited, but style and details drift from chunk to chunk.

Failed attempts:

Re-run the editor independently each time: Fast but each run ignores earlier choices; results don’t match across turns.
Autoregressive reuse (feed yesterday’s output as today’s input): Keeps things consistent between neighbors (turn t and t+1) but not across far-apart turns (turn 1 vs. turn 3).
Use heavy 3D reconstructions (CUT3R) or big NVS encoders (LVSM) as “memory”: They didn’t carry enough fine visual detail for the diffusion transformer to reuse reliably.
Keep piling all past frames in context: Too slow and often worse quality — redundant frames dilute the useful signal.

The gap: No simple, efficient way existed to bring “long-term visual memory” into off-the-shelf video-to-video diffusion models so they could recall prior edits and stay consistent across many rounds, without exploding compute.

Real stakes (why you should care):

Social media creators often do many rounds of edits; mismatched earrings or changing backgrounds destroy the vibe.
Film and ads need the same prop, costume, or lighting look to stay stable across shots and days.
Robotics and simulation care about 3D consistency; if the world shifts randomly between re-renders, training fails.
Long-form content (vlogs, lectures, gameplay) doesn’t fit into one model pass; chunking causes style drift unless the model can remember.

🍞 Top Bread (Hook): Think of a comic series where the hero’s hair color keeps changing because the artist forgot last week’s page.

🥬 Filling (The Actual Concept — Cross-Iteration Consistency):

What it is: Making sure all edited videos from different rounds agree on how the same things look and move.
How it works:
1. Keep track of what previous edits looked like.
2. Bring the right pieces of that history into the current edit.
3. Use them as guidance so new outputs don’t drift.
Why it matters: Without it, every round can subtly change colors, shapes, or motion — death by a thousand tiny mismatches.

🍞 Bottom Bread (Anchor): If you add a blue hat in segment 1, cross-iteration consistency helps keep that same blue hat in segment 5 — same shape, same shade.

02Core Idea

🍞 Top Bread (Hook): Imagine you’re building a LEGO city over many afternoons. Each day, you peek at what you built before so new buildings match the old style — same street widths, same brick colors.

🥬 Filling (The Actual Concept — Explicit Memory):

What it is: A plug-in for video editors that stores and recalls visual details from past edits.
How it works:
1. After each edit, save a compact version of the result (video latents) in a memory cache.
2. When making a new edit, retrieve only the most relevant past results.
3. Feed them into the diffusion transformer in a careful, not-too-big way.
Why it matters: Without memory, every new edit starts half-blind, and that’s how drift sneaks in.

🍞 Bottom Bread (Anchor): You dyed the hero’s jacket emerald last week; memory brings that exact color back this week so it matches.

The “Aha!” in one sentence: Don’t throw every past frame at the editor — retrieve only the most relevant previous edits and compress them smartly so the model can remember long-term details without getting overwhelmed.

Three analogies for the same idea:

Library: Instead of carrying the whole library to your desk, you check out just the best-matching books and use sticky notes to summarize long chapters.
Cooking: You don’t dump every spice into the pot; you pick the ones that match tonight’s recipe, and you grind only what you need.
Sports replay: The coach doesn’t rewatch the entire season before a game; they pull a few key clips and a highlight reel.

🍞 Top Bread (Hook): You know how you only bring the notebooks you need to class, not your entire bookshelf?

🥬 Filling (The Actual Concept — Task-Specific Retrieval Mechanisms):

What it is: A way to pick which old videos matter for the current edit.
How it works:
1. For camera tasks, use VideoFOV: measure how much the target camera path “sees” the same directions as a past video.
2. For long edits, use DINOv2 features: find past segments most visually similar to the current source segment.
3. Select the top-k and ignore the rest.
Why it matters: Without retrieval, the model drowns in irrelevant history and slows down or gets confused.

🍞 Bottom Bread (Anchor): If today’s camera looks behind the statue, retrieve past edits that also saw the back of the statue, not the front.

🍞 Top Bread (Hook): Think of zooming in on the important parts of a picture while leaving the background small.

🥬 Filling (The Actual Concept — Dynamic Tokenization):

What it is: Breaking memory videos into tokens at different sizes based on how relevant they are.
How it works:
1. Give the most relevant memory clips fine-grained tokens (more detail).
2. Give less relevant ones coarse tokens (fewer, bigger chunks).
3. Keep the total token count efficient so attention stays fast and focused.
Why it matters: Without dynamic sizing, either you waste compute or you throw away detail where it counts.

🍞 Bottom Bread (Anchor): The top-3 matching past clips get small, detailed patches; the rest get bigger patches to save space.

🍞 Top Bread (Hook): When your backpack is too full, you roll your clothes to fit more.

🥬 Filling (The Actual Concept — Token Compression via Adaptive Token Merging):

What it is: A learnable way to fuse low-importance memory tokens so the model runs faster while keeping key info.
How it works:
1. Estimate frame responsiveness using attention scores: which memory frames do target queries attend to most?
2. For low-responsive frames, merge their tokens with a small neural compressor (not delete them!) at stable mid/late layers.
3. Keep high-responsive frames intact.
Why it matters: Simply discarding tokens breaks motion/appearance; merging preserves essentials and cuts compute (~30% speedup).

🍞 Bottom Bread (Anchor): Instead of tossing your notes, you make a one-page summary for the boring parts and keep full pages for the juicy parts.

Before vs. After:

Before: Single-turn editors forgot earlier decisions across rounds; chunked long edits drifted in style; trying to feed all history was slow and noisy.
After: Memory-V2V remembers through compact latents, picks the right history, and compresses the rest, so multi-turn edits and long videos stay coherent and efficient.

Why it works (intuition):

Represent: Video VAE latents preserve the fine appearance that 3D states (CUT3R) or other NVS states (LVSM) didn’t encode richly enough for a diffusion transformer.
Retrieve: Relevance selects the needles, not the haystack — camera overlap (VideoFOV) or visual similarity (DINOv2) points right to helpful clips.
Resize: Dynamic tokens match detail to importance; attention stays focused.
Merge: Responsiveness-guided merging trims fat without cutting muscle; mid/late-block merging avoids early mistakes.

Building blocks you can stack:

External memory cache of video latents.
Task-specific retriever (VideoFOV for viewpoints; DINOv2 for long edits).
Dynamic tokenizers (1×2×2, 1×4×4, 1×8×8).
Adaptive token merging at stable DiT blocks (e.g., 10 and 20).
Camera-aware conditioning per video and careful RoPE index ranges so time positions don’t clash.

03Methodology

At a high level: Input (current source video + editing instruction/camera) → Retrieve top-k relevant past edited videos from the cache → Dynamically tokenize them by relevance → Compress low-responsive tokens via adaptive merging → Feed everything into the DiT to denoise → Output edited video that matches previous edits.

🍞 Top Bread (Hook): Think of a neat binder system: store last week’s worksheets, pull only the ones you need, place bookmarks where to read in detail, and skim the rest.

🥬 Filling (The Actual Concept — Latent Video Memory Representation):

What it is: The system saves each past edited video as VAE latents, not raw pixels or 3D states.
How it works:
1. After finishing an edit x_j, encode it with the video VAE to get E(x_j) (shape F×H×W×C).
2. Store E(x_j) plus metadata (e.g., camera path) in the external cache Ω.
3. At the next edit, latents are easy to fetch and cheap to process.
Why it matters: Unlike CUT3R/LVSM states, VAE latents carried fine appearance that actually helped the diffusion transformer stay consistent.

🍞 Bottom Bread (Anchor): It’s like saving a small-but-faithful thumbnail of each past edit that’s fast to load and still shows the important details.

Step A — Retrieval tailored to the task:

For video novel view synthesis: VideoFOV retrieval. • Build a dense set of sight directions on a sphere at the target camera’s start (64,800 samples). • Mark which directions fall inside the camera frustum over the whole target path — that’s its FOV set. • Do the same for each cached video’s camera path. • Score relevance by how much the two FOV sets overlap/contain each other; pick top-k. Why it exists: If you’re about to look at the back of the car, old edits that also saw the back are most useful. Example: Target path sees 40,000 sphere points; a memory clip shares 25,000 of them — high overlap wins.
For text-guided long video editing: DINOv2-based segment retrieval. • Split the long video into segments that fit the base model window (e.g., ~81 frames). • Compute a segment descriptor by averaging DINOv2 features over frames. • Rank past segments by cosine similarity to the current source segment; always include the most recent segment to keep local continuity. Why it exists: Text is too vague (“add a hat”); visual similarity of the source segments is a stronger cue. Example: Two hallway segments with similar walls and lighting are more relevant than a kitchen segment.

🍞 Top Bread (Hook): You zoom in on the parts you care about and keep the rest small to save space.

🥬 Filling (The Actual Concept — Dynamic Tokenization, sizes by relevance):

What it is: Convert memory videos to tokens with different spatio-temporal strides.
How it works:
1. User input video: 1×2×2 tokenizer (fine detail).
2. Top-3 retrieved memory videos: 1×4×4 tokenizer (medium detail).
3. Remaining retrieved videos: 1×8×8 tokenizer (coarse detail).
Why it exists: Treating all memory equally would explode tokens and slow attention quadratically.
Example with numbers: Suppose each video has 81 frames, 64×64 latent size. • 1×2×2 makes 81×32×32 tokens. • 1×4×4 makes 81×16×16 tokens. • 1×8×8 makes 81×8×8 tokens. You keep detail where it matters and shrink where it doesn’t.

Step C — Attention-aware compression (Adaptive Token Merging):

Motivation: Even after smart tokenization, attention cost can be high; also, not all memory frames help equally at every layer.
Responsiveness score: • Average key features per frame to get a frame key vector. • Measure max attention response from target queries to each frame’s key. • High score = important; low score = compressible.
Merging strategy: • Don’t delete low-responsive tokens (that caused artifacts in tests). • Instead, learn a small convolutional compressor to fuse tokens from low-responsive frames into fewer tokens. • Apply merging at stable blocks (e.g., transformer blocks ~10 and ~20), where importance stays consistent across layers.
Why it exists: You want to shorten the token sequence intelligently where it won’t harm the result.
Example: Out of 10 memory frames, 3 get high responsiveness → untouched; 7 get merged 2:1 → nearly halves their attention cost but keeps their essence.

Step D — Feeding memory into the DiT correctly:

Camera-aware conditioning per video: each memory video’s tokens use its own camera embedding (for NVS) so geometry stays well-grounded.
Stable positional encoding (RoPE) across multiple videos: • Assign non-overlapping temporal index ranges to target, user input, and memory tokens. • For long-video inference, flip memory-time indexing to match training’s order and avoid a train-test mismatch. • Add light noise to memory tokens during training so the model prefers the clean user input while still using memory.
Why it exists: If all tokens fight for the same time slots, the model gets confused; if memory is too “loud,” it might copy artifacts.

Step E — Extension to long video editing:

Reformulate long videos as multi-turn editing over segments.
Use LucyEdit as the base per-segment editor; at each segment i, retrieve edited segments whose source segments are visually similar (DINOv2) and include the most recent neighbor.
Apply dynamic tokenization and adaptive merging just like in NVS.
Stitch segments back into a full-length video at the end.

The secret sauce:

Using video VAE latents as memory: enough detail to guide the diffusion transformer, unlike other tested encoders.
Retrieval before compression: pick needles first, then pack them efficiently.
Merge instead of drop: compresses compute without breaking motion/style.
Layer-aware merging: do it where attention importance is stable to avoid losing late-emerging cues.

04Experiments & Results

The test (what they measured and why):

Multi-turn Video Novel View Synthesis (NVS): Can the model generate several new camera paths in sequence and keep all previously revealed regions consistent? They used MEt3R to score multi-view 3D consistency (lower is better), plus camera accuracy (rotation/translation errors) and VBench quality metrics.
Long Video Editing: Can the model edit videos longer than the base time window by splitting into segments and still keep subject/background appearance and motion consistent? They used cross-frame DINO/CLIP similarity (higher is better) and VBench quality scores.

The competition (baselines):

ReCamMaster (single-turn NVS); two modes for comparison: • Independent (Ind): each turn ignores past outputs. • Autoregressive (AR): each new turn uses the previous output as input.
TrajectoryCrafter (another NVS approach).
LucyEdit (single-turn instruction-based video editor) for long video editing, plus a FIFO-like diagonal denoising variant to mimic autoregressive behavior.

The scoreboard with context:

NVS multi-turn consistency (MEt3R, lower is better): • Memory-V2V ≈ 0.1357 average across pairs (best), • ReCam (Ind) ≈ 0.1892 (worse consistency), • ReCam (AR) ≈ 0.1485 (keeps neighbors consistent but not all pairs), • TrajectoryCrafter ≈ 0.1818. Translation/rotation errors improved with Memory-V2V, showing better camera adherence. Think of this like getting an A when others get B to C+ on “do all views agree?”
Long video editing (higher is better across consistency and quality): • Subject consistency: Memory-V2V ≈ 0.9326 vs. LucyEdit (Ind) ≈ 0.8683. • Background consistency: Memory-V2V ≈ 0.9233 vs. 0.9042. • Motion smoothness and temporal flicker also improve slightly. In school-speak, Memory-V2V is the student whose style stays the same across a 200+ page essay, while others start changing fonts mid-way.

Speed/efficiency and ablations:

Dynamic tokenization + adaptive token merging cut FLOPs and latency substantially; merging alone delivered about 30% speedup over comparable setups, and dynamic tokenization avoids the massive blow-up you’d get by tokenizing everything finely.
Discard vs. merge: Throwing away low-responsive tokens caused visual artifacts; merging preserved coherence.
Retrieval matters: VideoFOV-based retrieval for NVS maintained long-range consistency even 1st vs. 5th generations; random/loose retrieval drifted.
Memory encoder choice: Using video VAE latents outperformed CUT3R and LVSM encoders as memory representations. CUT3R/LVSM states lacked transferable fine detail for the DiT.

Surprising findings:

Mid/late transformer blocks are the safe spots to compress; early compression risks dropping frames that later turn out to be important (responsiveness becomes more stable deeper in the network).
Carefully adding light noise to memory tokens during training helps the model not overfit to imperfect history — it learns to rely primarily on the clean user input and use memory as guidance rather than a crutch.
Autoregressive reuse (ReCam AR) helped with adjacent turns but couldn’t guarantee that the 1st and 3rd turns match — explicit memory closed that gap.

Qualitative highlights:

NVS: Novel areas like “the back of an object” render consistently across multiple requested camera paths; colors and textures agree.
Long edits: The same hat stays the same hat; the same door stays the same door across many segments, avoiding the classic “segment-by-segment drift.”

05Discussion & Limitations

Limitations (specific):

Multi-shot videos with big scene transitions: If the long video jumps from a living room to an outdoor scene, the memory may carry over objects or textures that shouldn’t persist (e.g., a hand reappears in a new shot). The retriever currently focuses on visual similarity and recent segments; it needs shot-awareness.
Memory quality depends on stored outputs: If earlier edits have mild flicker or blur (e.g., from extended generated segments), those imperfections can accumulate across iterations.
Retrieval signals: For long edits, relying on DINOv2 similarity works better than text, but might still pick visually similar yet context-mismatched segments (e.g., similar color walls, different scene semantics).
Compute and hardware: While efficient for what it does, training/fine-tuning used many GPUs (e.g., 32 A100s) and multi-view synthetic data or curated edit datasets. Not plug-and-play for tiny rigs.

Required resources:

A base video-to-video diffusion transformer (e.g., ReCamMaster, LucyEdit).
Video VAE encoder/decoder for latents.
Storage for the external cache of memory latents.
Retriever features (VideoFOV camera paths or DINOv2 embeddings).
Enough GPU memory to run multi-input attention with tokenizers and merging layers.

When NOT to use it:

Projects with abrupt multi-shot storytelling unless you add shot detection and reset/segment memory by shot.
Cases where perfect single-pass editing already suffices and you don’t need multi-turn or long-form coherence — the extra plumbing may add small overhead.
Settings where earlier outputs are very noisy/low quality; memory will guide you toward those artifacts.

Open questions:

Can we make retrieval shot-aware and scene-aware so memory resets smartly at boundaries?
Can we improve or denoise stored memory over time (e.g., memory distillation) to prevent error accumulation?
How far can compression go with stronger sparsity/attention learning without hurting quality?
Can we generalize beyond camera/text conditions to richer controls (masks, depth, audio cues) while keeping memory efficient?
How to integrate with faster one-step or causal video generators for real-time interactive editing?

06Conclusion & Future Work

Three-sentence summary: Memory-V2V adds a practical, explicit visual memory to existing video-to-video diffusion models so multi-turn and long-form edits stay consistent. It retrieves only the most relevant past edits and represents them as compact video latents, then sizes and compresses their tokens based on attention-driven importance to keep compute in check. The result is better cross-iteration consistency, competitive or improved quality, and meaningful speedups over baselines in both novel view synthesis and long video editing.

Main achievement: Showing that a simple retrieval+compression memory pipeline — with the right representation (video VAE latents), task-specific retrieval (VideoFOV/DINOv2), dynamic tokenization, and adaptive token merging — can turn single-turn editors into consistent multi-turn editors.

Future directions: Make memory shot-aware and self-cleaning to avoid artifact buildup; combine with causal/AR video models and diffusion distillation for faster interactivity; explore richer controls and broader datasets with diverse motions and scene changes; push sparsity further without losing fidelity.

Why remember this: It reframes video editing as a conversation with history — each new edit listens to what came before. That simple shift unlocks stable, longer, and more controllable video workflows that creators, filmmakers, and simulation builders have been waiting for.

Practical Applications

•Keep characters’ outfits and colors consistent across multiple edited scenes in a short film.
•Edit long vlogs or lectures in pieces while preserving the same visual style and lighting throughout.
•Re-render a product demo from new camera angles while ensuring the product looks identical in every view.
•Maintain consistent scene appearance in gameplay highlight reels split into many segments.
•Create multi-angle social media clips from a single take with matching looks across all posts.
•Build training videos for robotics where 3D geometry stays stable across iterative re-renders.
•Speed up iterative client feedback loops in advertising by preserving previous approvals automatically.
•Produce educational animations in stages while locking in character design and background palettes.
•Run design A/B tests on motion or filters and then converge on a final version without losing identity.
•Assist virtual production by keeping props, textures, and lighting coherent across memory-augmented reshoots.

Version: 1