Pretraining Frame Preservation in Autoregressive Video Memory Compression

Lvmin Zhang; Shengqu Cai; Muyang Li; Chong Zeng; Beijia Lu; Anyi Rao; Song Han; Gordon Wetzstein; Maneesh Agrawala

Pretraining Frame Preservation in Autoregressive Video Memory Compression

Intermediate

Lvmin Zhang, Shengqu Cai, Muyang Li et al.12/29/2025

arXiv PDF

Key Summary

•The paper teaches a video model to squeeze long video history into a tiny memory while still keeping sharp details in single frames.
•It does this by first pretraining a special memory compressor to recover random frames from any time in the past with high quality.
•After pretraining, the same compressor becomes the memory for an autoregressive video generator, so the model can continue stories for 20+ seconds using only ~5k context tokens.
•The compressor focuses on preserving high-frequency details (like faces, textures, and text) that usually get lost when you compress too much.
•Compared to simpler compression tricks (like just bigger patches), this method keeps more structure and detail, measured by PSNR/SSIM and human studies.
•The architecture mixes 3D convolutions and attention, adds features after the DiT’s first projection (e.g., 3072 channels), and can be trained with LoRA to keep compute low.
•Randomly choosing which frames to reconstruct during pretraining prevents the model from cheating and forces it to remember the whole history.
•The system improves long-range consistency of characters, clothes, objects, and scenes, even on consumer GPUs like an RTX 4070 12GB.
•Optional add-ons (tiny sliding window, cross-attention, multiple compressors) can push consistency further, at the cost of more compute or context.
•This approach gives a clearer map of the trade-off between context length (shorter is cheaper) and quality (sharper frames), helping find a practical sweet spot.

Why This Research Matters

Long videos need memory that is both small and smart; otherwise, stories get fuzzy or computers run out of room. This paper shows how to pretrain that memory to keep sharp details at any time point, so characters, outfits, and props stay consistent. It means creators can make longer, more coherent videos on regular GPUs instead of giant clusters. For education, news, sports, or short films, consistent scenes feel more believable and professional. It also gives engineers a clearer way to measure and tune the quality-versus-cost balance. With optional add-ons, the system can be customized for hard tasks like keeping exact object layouts. Overall, it moves long-form video generation from fragile demos toward practical tools.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Top Bread (Hook): Imagine you’re telling a friend a long story, but you can only bring a tiny sticky note with reminders. You need to pick the right clues so you don’t forget key parts later.

🥬 Filling (The Actual Concept): The world of video generation is moving from short, flashy clips to real storytelling: scenes with characters, outfits, places, and camera moves that stay consistent over time.

What it is: Autoregressive video models continue a video by looking back at its history (context) and predicting what comes next.
How it works: 1) The model reads many past frames, 2) Packs that history into a context, 3) Uses the context plus a prompt to generate the next chunk, 4) Repeats.
Why it matters: If the context is too short or low-quality, the model forgets details—faces change, clothes morph, props vanish, and stories don’t stay coherent.

🍞 Bottom Bread (Anchor): Think of a YouTube cooking video: if the model forgets that the chef wore a red apron three scenes ago, the apron may suddenly turn blue.

🍞 Top Bread (Hook): You know how a photo looks crisp when you can see hair strands and fabric threads? Those tiny bits make it feel real.

🥬 The Concept: High-frequency detail preservation means keeping the small, sharp parts of an image—edges, textures, letters on signs—that often get blurred by compression.

What it is: A promise to protect fine details while squeezing the video into a smaller memory.
How it works: 1) Use a smart compressor, 2) Train it to reconstruct single frames anywhere in the timeline, 3) Reward it for sharp, accurate reconstructions.
Why it matters: Without those tiny details, characters and scenes look muddy; identity and object consistency suffer across shots.

🍞 Bottom Bread (Anchor): If a character’s sweater has a knitted pattern, preserving those stitches helps the person look like the same person later on.

The problem before this paper: Long videos need long context windows, but that’s expensive to store and compute. A naive sliding window keeps context fixed by cutting off old frames, but the model then forgets long-range events. Other ideas tried—like very compact VAEs, token merging, bigger patches, sparse or linear attention—do help cost, but often smear or drop high-frequency details. Everyone felt a tug-of-war: shorter context is cheaper, but it usually means blurrier or less consistent results.

🍞 Top Bread (Hook): Picture packing a big suitcase. If you cram too much, your clothes wrinkle; if you take too little, you miss outfits you need.

🥬 The Concept: The context length–quality trade-off balances how small we make memory (length) against how good it looks (quality).

What it is: A fundamental balancing act in long video generation.
How it works: 1) Choose a compression rate, 2) Measure how frame quality changes, 3) Pick a sweet spot where memory is short enough and details are still great.
Why it matters: Without balancing, either you blow up compute/memory or you lose the consistency that makes stories believable.

🍞 Bottom Bread (Anchor): If you compress a 20-second history into ~5k tokens and can still retrieve crisp frames anywhere, you’ve packed well.

What was missing: A direct, measurable way to train and test whether a memory compressor truly keeps details from any point in time—before plugging it into a full video generator. The paper fills this gap by pretraining a compressor on a very explicit job: retrieve random frames with high fidelity.

Real stakes: This affects how creators make shorts, documentaries, educational videos, and even movie-like multi-shot scenes. Better memory means fewer weird glitches: faces stay the same, clothes don’t flip styles, rooms keep their props, and camera motions remain steady—all with lower hardware demands.

02Core Idea

🍞 Top Bread (Hook): Imagine practicing for a play by memorizing random lines from anywhere in the script so you can jump to any scene and still act perfectly.

🥬 The Concept: The key insight is to pretrain a memory compressor by forcing it to reconstruct randomly chosen frames from a long video history, then reuse this compressor as the memory for an autoregressive video generator.

What it is: A two-stage plan—(1) learn to compress while preserving details by retrieving random frames, (2) fine-tune the generator that uses this compressed memory.
How it works: 1) Compress a long video history into a short context (~5k tokens/20s), 2) Randomly pick target frames and heavily noise/mask the rest, 3) Train the system to reconstruct the targets accurately, 4) Plug the trained compressor into an AR diffusion model and fine-tune with LoRA, 5) Roll out long videos with consistent memory.
Why it matters: This direct retrieval training prevents the compressor from ignoring parts of the history and gives the final AR model a high-fidelity, compact memory.

🍞 Bottom Bread (Anchor): If a 20-second video is squished into ~5k memory tokens, but the model can still recover a specific face at second 17 clearly, the compressor did its job.

Three analogies for the same idea:

Library card catalog: You shrink a big library (the video) into a tiny card catalog (memory). During training you practice finding random books (frames). Later, the same catalog helps you continue stories without getting lost.
Travel scrapbook: You condense a long trip into a small scrapbook. You practice flipping to random days and redrawing the photo. If you can redraw any day sharply, your scrapbook is great.
Backpack packing: You choose just the essentials to carry (compressed context). You rehearse pulling out the right item for any moment. Later, hiking (generation) is smooth because your pack is light but useful.

Before vs After:

Before: Compressors often blurred details or favored only some parts of the timeline. AR models either dropped long history or paid huge compute costs.
After: The pretrained compressor preserves high-frequency details across the whole timeline and feeds a tiny but rich memory to the AR model, enabling long, coherent rollouts on modest GPUs.

Why it works (intuition):

Training on random frame retrieval blocks cheating (like encoding only the end frames). The compressor must learn a balanced, global summary that supports any moment.
Adding features after the DiT’s first projection (e.g., 3072 channels) avoids narrow bottlenecks (like 16-channel VAE latents), which helps keep fine detail.
3D convolutions reduce time first, then space, so the compressor respects motion patterns before shrinking spatial detail.

Building blocks (each introduced with a mini sandwich):

🍞 You know how you guess the next scene in a cartoon by remembering what just happened? 🥬 Autoregressive Video Model: Predicts the next part of a video by using past frames as context. It reads history, then generates the next chunk, and repeats. If it forgets history, stories fall apart. 🍞 When asked to continue a soccer match video, it knows who has the ball because it kept past plays in context.

🍞 Imagine the crisp edges of comic ink lines. 🥬 High-Frequency Detail Preservation: Keeping tiny, sharp bits like edges, textures, and small text. Train the memory so it can rebuild sharp frames. If lost, faces and logos turn mushy. 🍞 A knitted sweater still shows its weave after compression.

🍞 Picture folding a blanket to fit a small drawer. 🥬 Memory Compression Model: A neural net that shrinks long video history into a short context while keeping important details. It encodes history into compact tokens. If too lossy, identity and objects drift. 🍞 It fits 20s of video into ~5k tokens without losing a character’s freckles.

🍞 Think of pulling a book from a huge shelf quickly. 🥬 Frame Retrieval Quality: How faithfully a single frame can be reconstructed from the compressed memory. During pretraining, random frames are targets. Poor retrieval means memory missed key details. 🍞 Recover the exact close-up at 12.3s with the same facial expression.

🍞 Choosing between many notes or just the essentials. 🥬 Context Length–Quality Trade-off: Finding a sweet spot between tiny memory and sharp reconstructions. Test different compression rates, then pick what stays crisp enough. 🍞 ~5k tokens for 20s that still look great is a practical balance.

🍞 Building with LEGO blocks. 🥬 Neural Architecture Design: Using 3D convolutions (time then space) plus attention, and injecting features after the DiT’s first projection to keep detail. Bad design blurs motion or texture. 🍞 The model remembers hair motion across frames and the strands within a frame.

🍞 Checking with a scorecard. 🥬 PSNR/SSIM Metrics: Numbers that rate reconstruction quality and structure preservation. Higher is better. Without them, we’d only guess by eye. 🍞 A model scoring higher SSIM means it keeps patterns like window grids intact.

03Methodology

At a high level: Long Video History → Memory Compressor φ(H) → (Pretraining: Random Frame Retrieval) → Finetune AR Diffusion with φ(H) → Autoregressive Generation.

Step-by-step recipe:

Prepare history and targets (pretraining):

What happens: Take a long video (e.g., 20s) and compress it with a learnable memory encoder φ(·). Randomly pick some frame indices as targets; the rest are heavily noised (noise-as-mask). Clone the clean targets for the model to reconstruct.
Why this step exists: Random targets force the compressor to care about every moment, not just the start or end. Noise-as-mask makes the task hard enough to learn strong representations.
Example: From a cooking video, choose frames at 3.5s, 9.2s, and 17.1s as targets; mask all others with strong noise; ask the model to recover the true targets using only φ(H).

Train with a diffusion transformer as the retriever:

What happens: A Diffusion Transformer (DiT) tries to denoise the target frames using the compressed context φ(H) and the prompt. We keep the DiT lightweight to train via LoRA. The objective rewards accurate frame reconstructions.
Why this step exists: Diffusion is great at image/video detail. If it can reconstruct random targets well, the memory must be rich.
Example: The DiT, conditioned on φ(H), recovers the close-up of grandma’s face at second 12 with her cardigan texture intact.

Memory encoder architecture (φ):

What happens: Use two branches—(a) low-res/low-fps video processed through the VAE and DiT’s patchifier+first projection; (b) high-res/high-fps branch encoded with 3D convolutions and attention. Add the high-res residual enhancement after the DiT’s first projection (e.g., 3072 channels) so we skip narrow VAE bottlenecks.
Why this step exists: The LR branch gives global structure cheaply; the HR residual injects fine detail. Adding after the first projection keeps sharpness.
Example: The model keeps the room layout from LR and the magazine’s printed letters from HR.

Compression schedule (reduce time, then space):

What happens: 3D convs first downsample temporally (e.g., factor 2 or 4), then spatially (e.g., 2×2). Hidden channels ramp 64→128→256→512, stay at 512, then a 1×1 conv projects to match the DiT inner channels (e.g., 3072 or 5120).
Why this step exists: Motion patterns live across time; handling time first helps preserve dynamics before shrinking space, which protects temporal consistency.
Example: A running dog’s stride stays smooth across frames, and its fur still shows texture.

Preventing shortcuts with random frame sampling:

What happens: Randomly choose target frames across the whole history every time.
Why this step exists: Without randomness, the compressor could cheat by stuffing all capacity into a few frames (like only the ending), ignoring the rest.
Example: If we always picked the last 2s, the compressor might memorize just that part. Randomization blocks that.

Finetune the autoregressive (AR) generator with the pretrained φ:

What happens: Freeze or lightly tune φ, attach it as the history encoder to the DiT, and fine-tune with LoRA so the generator learns to use compact memory for next-chunk prediction.
Why this step exists: Now the generator doesn’t need the full, huge history—just φ(H). This saves context length and compute while keeping fidelity.
Example: The model continues a news interview keeping the reporter’s mic, the guest’s red jacket, and the street background consistent.

Autoregressive inference loop:

What happens: Generate a few seconds, append them to history (or compress on-the-fly since φ is mostly convolutional), compress again, and repeat.
Why this step exists: This turns short chunks into long videos while keeping memory small and consistent.
Example: Build a 30s scene in 5s steps, still remembering a cat that appeared at 6s.

The secret sauce:

Train the memory on the exact skill you need later: retrieving sharp frames anywhere in the timeline. This directly bakes in detail preservation.
Place the HR residual after the DiT’s first projection to avoid thin bottlenecks and keep high-frequency features.
Use time-first 3D downsampling to respect motion.

Mini sandwiches for key tools:

🍞 You know how graphic novels have panels that move through time? 🥬 VAE (Variational Autoencoder): A tool that turns images into compact latents and back again. It helps process frames in a smaller space. Without it, memory and compute explode. 🍞 A 480p frame becomes a small latent grid the model can handle.

🍞 Cutting a big picture into tiles. 🥬 Patchifying: Splitting latents into patches/tokens the transformer can read. Without patches, the transformer can’t scale to big images/videos. 🍞 A frame becomes a grid of tokens like LEGO tiles.

🍞 Watching a flipbook and noticing motion. 🥬 3D Convolution: A filter over height, width, and time. It summarizes motion patterns. Without it, the model might miss temporal rhythms. 🍞 Footsteps and fabric flutters stay coherent.

🍞 Whispering extra hints into each layer. 🥬 Cross-Attention (optional add-on): Connects the compressor’s features to each DiT block for refinement. It boosts consistency in tough cases but costs more compute. 🍞 A store shelf keeps its exact item order across shots.

🍞 Adding a small clip-on to a big backpack. 🥬 LoRA: A lightweight way to fine-tune big models by training small adapter matrices. Without it, fine-tuning would be too heavy. 🍞 You can adapt a 12.8B video model on a lab budget.

Practical settings from the paper:

Compress ~20s history to ~5k tokens; e.g., compression rates like 4×4×2 or 2×2×1 (on latent space) balance cost vs detail.
Trained on ~5M internet videos; LoRA rank 128; works with HunyuanVideo 12.8B and Wan 5B/14B.
Optional tweaks: small sliding window (e.g., 3 latent frames) to keep single shots going; multiple compressors (e.g., 4×4×2 + 2×2×8) to capture different detail types; cross-attention for ultra-difficult detail ordering.

04Experiments & Results

The test: Measure how well the compressor reconstructs frames (PSNR/SSIM/LPIPS), how consistent long AR videos are (clothes, identity, objects), and overall quality (aesthetics, clarity, dynamics, semantic alignment). Also run user studies (ELO) for human preference.

The competition: Compare against simpler compression like a Large Patchifier (akin to FramePack), variants that drop the LR/HR branches, and external pipelines like Wan I2V + Qwen-Image-Edit with 1–3 image inputs as context.

Scoreboard with context:

Reconstruction (Table 1):
- Large Patchifier (4×4×2): PSNR 12.93, SSIM 0.412, LPIPS 0.365 — like a blurry C grade; structure changes a lot.
- Only LR (4×4×2): PSNR 15.21, SSIM 0.472, LPIPS 0.212 — better, but still loses detail.
- Without LR (4×4×2): PSNR 15.73, SSIM 0.423, LPIPS 0.198 — similar story.
- Proposed (4×4×2): PSNR 17.41, SSIM 0.596, LPIPS 0.171 — clearly sharper; structure better preserved.
- Proposed (2×2×2): PSNR 19.12, SSIM 0.683, LPIPS 0.152 — more detail with longer context.
- Proposed (2×2×1): PSNR 20.19, SSIM 0.705, LPIPS 0.121 — the crispest reconstructions here (like moving from a B to an A).
Consistency (Table 2): Cloth / Identity / Instance (higher is better) + Human ELO:
- Wan I2V + QwenEdit (2 images): Cloth 95.09, Identity 68.22, Instance 91.19, ELO 1198 — strong object stability but limited long memory.
- Proposed (4×4×2): Cloth 96.12, Identity 70.73, Instance 89.89, ELO 1216 — balanced, strong consistency with compact memory.
- Proposed (2×2×2): Cloth 96.71, Identity 72.12, Instance 90.27, ELO 1218 — nudges detail higher with more context.
Base models (Table 3):
- HunyuanVideo 12.8B (4×4×2): Aesthetic 61.27, Clarity 67.49, Dynamics 71.22, Semantic 26.29, ELO 1189.
- Wan 2.2 14B (4×4×2): Aesthetic 67.22, Clarity 69.37, Dynamics 69.81, Semantic 27.12, ELO 1231 — a sweet spot for quality.
- Wan 2.2 5B (4×4×2): Aesthetic 66.25, Clarity 69.01, Dynamics 65.13, Semantic 25.99, ELO 1215 — solid with smaller size.

Surprising findings:

Pretraining matters a lot. Without it, models ignored relevant history more often, causing identity and outfit hiccups; with it, temporal consistency improved visibly (faces, clothes, stylistic continuity, and camera motion).
On short-style clips with frequent shot changes, error accumulation (drifting) largely faded; continuous long single shots benefited from a tiny sliding window to keep the same shot flowing across generations.
Cross-attention from the compressor into every DiT block helps with extremely tricky order-sensitive details (like shelf item layout), but adds compute.
Using multiple compressors together (e.g., one more temporal, one more spatial) recovers very fine details (like newspaper text) at the cost of more context length.

Efficiency notes:

The system compresses 20s history into ~5k tokens and runs on consumer GPUs (e.g., RTX 4070 12GB) for inference. Training used 8×H100 for large-scale pretraining; LoRA fine-tunes ran on single H100/A100 nodes without special tricks.

Big picture: Compared to prior compression approaches, this method lands a better balance point—shorter context yet higher frame fidelity—validated by both metrics and humans.

05Discussion & Limitations

Limitations:

Perfect retrieval is impossible at very high compression; some fine patterns (tiny text, micro-textures) can still fade, especially with aggressive rates.
Heavy pretraining data (≈5M videos) and good captions/storyboards help a lot; in niche domains with little data, memory may miss important cues.
Continuous single-shot scenes can still drift without careful settings; a tiny sliding window or cross-attention helps but costs more.
The method focuses on frame-preserving memory; if the downstream task needs heavy creative deviations, too-strong memory might resist changes.
Optional enhancements (cross-attention, dual compressors) increase compute or context length, so they’re not free.

Required resources:

For pretraining at scale: multi-GPU clusters (e.g., 8×H100). For fine-tuning with LoRA and inference: a single H100/A100 or even consumer GPUs for moderate settings.
A dataset of diverse, quality videos and decent captions (VLM-generated storyboards worked well in the paper).

When not to use:

Ultra-creative tasks that must ignore long history cues (e.g., jump-cut surreal edits) might not benefit from strong memory preservation.
Extremely tiny devices where even ~5k tokens are too large, or where 3D conv + attention is too heavy.
Cases where exact long-range consistency is not required and a simpler short-window model suffices.

Open questions:

Adaptive compression: Can the model vary compression rate per scene difficulty (e.g., more tokens for faces and text, fewer for skies)?
Task-aware memory: Could the compressor learn to preserve details important for a given storyboard goal (e.g., identity vs. object layout)?
Better quality metrics: PSNR/SSIM help, but can we design perceptual, story-consistency metrics closer to human judgment?
Unified retrieval and generation training: Can we co-train retrieval and AR generation end-to-end without losing stability or efficiency?
Multimodal memory: How to blend audio, subtitles, and text prompts into one compact, detail-preserving memory for richer storytelling?

06Conclusion & Future Work

Three-sentence summary: This paper pretrains a memory compressor that learns to reconstruct random frames from long video histories, forcing it to preserve high-frequency details while shrinking context to about ~5k tokens per 20 seconds. Plugged into an autoregressive diffusion model and fine-tuned with LoRA, it delivers long-range consistency of characters, clothes, objects, and scenes on modest hardware. Across metrics and user studies, it achieves a strong balance between context length and perceptual fidelity, beating simpler compression baselines.

Main achievement: Turning frame retrieval into the explicit pretraining objective for memory, then reusing that memory to power long, coherent autoregressive video generation with compact context.

Future directions: Develop adaptive, task-aware compression that changes rate by scene complexity; integrate cross-attention or multiple compressors intelligently; co-train retrieval and generation; extend to multimodal memory with audio/text; and design richer human-aligned consistency metrics.

Why remember this: It shows a practical path to long-form video storytelling that fits in small memory without throwing away the details that make scenes feel real—like a tiny, powerful scrapbook for an entire movie.

Practical Applications

•Produce multi-shot story videos (short films, vlogs) with stable characters and outfits on consumer GPUs.
•Create consistent brand ads where logos, colors, and product details stay sharp across scenes.
•Generate educational videos with recurring teachers, diagrams, or lab setups that remain clear over time.
•Previsualize movie scenes with steady props and camera styles to help directors plan shots.
•Extend sports highlights while preserving player identities and jersey numbers across plays.
•Make newsroom explainers where reporters, lower-thirds, and set designs stay consistent.
•Animate comics or storyboards into longer sequences without losing panel-specific details.
•Improve video editing workflows where past footage guides style and object consistency in new shots.
•Build game cutscenes that maintain character identity and environmental details as stories progress.
•Generate product demos that keep fine text (labels, specs) legible across different angles and times.

Version: 1