FSVideo: Fast Speed Video Diffusion Model in a Highly-Compressed Latent Space

FSVideo Team; Qingyu Chen; Zhiyuan Fang; Haibin Huang; Xinwei Huang; Tong Jin; Minxuan Lin; Bo Liu; Celong Liu; Chongyang Ma; Xing Mei; Xiaohui Shen; Yaojie Shen; Fuwen Tan; Angtian Wang; Xiao Yang; Yiding Yang; Jiamin Yuan; Lingxi Zhang; Yuxin Zhang

FSVideo: Fast Speed Video Diffusion Model in a Highly-Compressed Latent Space

Intermediate

FSVideo Team, Qingyu Chen, Zhiyuan Fang et al.2/2/2026

arXiv PDF

Key Summary

•FSVideo is a new image-to-video generator that runs about 42× faster than popular open-source models while keeping similar visual quality.
•It gets this speed by compressing videos into a tiny, smart code (64×64×4 downsampling with 128 channels) so the big transformer has far fewer tokens to process.
•A special 'Layer Memory' lets each transformer layer look back at earlier layers’ features, boosting reuse and stability with almost no extra cost.
•A two-stage approach first makes a low-res video in the tiny space, then sharpens details with a fast latent upsampler and a lightweight high-res refiner.
•The autoencoder uses a new Video VF loss to align its latent space with strong vision features (DINOv2), improving generation and lowering latent complexity.
•Training mixes smart tricks: flow matching, Pseudo-Huber loss, dynamic masking, deviation-based latent estimation, and small-step distillation for speed.
•On VBench (720×1280), FSVideo scores competitively (88.12% total) against larger or slower models.
•In human tests, it beats or matches many open models, especially when speed and consistency from the first frame matter.
•FSVideo is built for real apps like photo re-enactment and VFX, where users often supply a first image—and want fast, good-looking motion.
•The design shows a path forward: keep model capacity big, but cut token count and reuse information to get speed without sacrificing quality.

Why This Research Matters

Fast, good-looking video generation means creators can iterate ideas in seconds, not minutes, saving time and money. Apps like photo re-enactment, advertising mockups, and classroom demos become snappier, making AI video tools more accessible to everyday users. Studios and indie teams can preview shots quickly and reserve heavy-time budgets for only the best candidates. On mobile and edge devices, shrinking token counts brings real-time effects closer to reality. By focusing on token efficiency and information reuse—not just smaller models—FSVideo points toward sustainable scalability as demand for higher resolutions and longer clips grows.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

You know how waiting for a video to finish exporting can feel like forever when your computer has to process tons of pixels? That’s what happened with early AI video models: they were powerful, but slow and expensive to run, especially for long or high-resolution clips. People loved the impressive results from models like Sora, Veo, and Wan, but the long wait times and big GPU bills made them hard to scale to everyone.

The world before FSVideo looked like this: diffusion and flow-based models made great videos, but their sampling required many steps, and each step was heavy. Researchers tried training-free accelerations—like smarter solvers, caches, sparse attention, or sampling at lower resolution—but these helped only so much and sometimes hurt image quality. Others tried training-based accelerations: shrink the model, swap attention for lighter parts, or distill many steps into just a few. Those sped things up, but often lost crisp details, motion integrity, or consistency—especially if you pushed steps down to 1–2. In practice, many production systems still used 4–8 steps for quality, which dulled the promised speedups.

So what was the real problem? Each forward pass was simply too expensive. Even with fewer steps, the model was pushing around a mountain of tokens. Imagine trying to write a book summary by reading every page in full every time—you’ll go slow no matter how fast you skim. The missing piece was to reduce the amount of stuff the model must touch per step without wrecking quality.

One promising idea was to compress the video first, then run the heavy model in that tiny space. Some teams tried this with up to 32× spatial compression and got speed, but reconstruction artifacts and weak generation quality showed up, especially for longer or higher-res clips. A concurrent line of work explored very deep compression, even 64×64 spatial, but either evaluated quality at milder compression, or adapted a pretrained transformer instead of training one from scratch in the highly compressed space. The field lacked a full, from-scratch recipe that balanced extreme compression with strong generation—and then brought back the fine details.

Enter FSVideo. The key insight is simple: keep the model big enough to be smart, but shrink the number of tokens it sees. FSVideo builds a new video autoencoder (FSAE) that compresses by 64×64 spatially and 4× temporally, resulting in a 384× reduction in information, yet still reconstructs well and supports strong generation. On top of that, FSVideo adds a diffusion transformer with a new Layer Memory that reuses useful signals from earlier layers, improving depth-wise information flow at tiny extra cost. Finally, a two-stage generation upsamples and refines the video in the latent space so details come back fast without bloating runtime.

Why should anyone care? Faster, cheaper, and still-good video generation unlocks real applications. Think photo re-enactment where you upload a portrait and instantly get natural head motion, or social media where creators preview effects in seconds. Studios and pros can iterate more quickly on VFX, education, and advertising. With FSVideo, a 5-second 720p clip that used to be painfully slow now comes out much faster (42.3× speedup versus a strong open baseline), while keeping quality competitive on public benchmarks and human studies. The big story is not about squeezing the model capacity—it’s about squeezing the tokens and improving how the model uses what it sees. That shift turns high-quality video generation from a waiting game into something closer to real-time.

02Core Idea

🍞 Top Bread (Hook): Imagine packing a huge suitcase into a tiny, tidy cube bag before a trip. You carry far less bulk, but your clothes still pop back nicely when you unpack.

🥬 The Concept: FSVideo’s aha! is: don’t make the model smaller—make the video representation smaller and smarter, then reuse information cleverly. It compresses video into a highly-compact latent space, runs a capable diffusion transformer with a new Layer Memory for better recall, and finishes with a quick latent upsampler/refiner to restore fine details.

How it works (big picture):

Compress the video into a tiny latent space (64×64×4 downsampling with 128 channels) using a video autoencoder (FSAE).
Generate motion and structure in this compact space using a diffusion transformer (DiT) upgraded with Layer Memory to reuse earlier-layer features.
Upsample and refine the latent to higher resolution with a lightweight second-stage DiT and a convolutional latent upsampler.
Decode the refined latent back to pixels with a strong decoder that also uses the input image’s features for extra sharpness.

Why it matters: Without shrinking tokens and reusing signals, even great transformers are stuck hauling too much data every step—slow and costly.

🍞 Bottom Bread (Anchor): A user uploads a portrait (first frame) and asks for a 5-second smile-and-wave video. FSVideo makes a small latent movie first (fast), then quickly sharpens it, and finally decodes it into a crisp clip—much quicker than older systems.

— Multiple analogies —

Packing analogy: Vacuum-pack the video (autoencoder), sketch the motion in the packed form (DiT with memory), then fluff and iron it (upsampler/refiner) before wearing.
City planner analogy: Plan roads on a mini-map (latent space) so designs are fast, then build details on the chosen plan (refiner), instead of planning full-size streets from the start.
Cooking analogy: Make a concentrated sauce (compact latent), decide the flavors quickly (DiT with memory), then dilute and garnish (upsampler/refiner) for final taste.

Before vs After:

Before: Big models process many tokens per step, rely on many steps or heavy compute; attempts to cut steps or shrink models often lose quality.
After: FSVideo keeps model brains large but slashes tokens with 64×64×4 compression, improves feature reuse with Layer Memory, and restores details via a targeted, few-step latent refiner—achieving order-of-magnitude speedups with competitive quality.

Why it works (intuition):

Token efficiency: Fewer, better tokens mean every attention operation is cheaper.
Memory across layers: Layer Memory prevents each layer from “forgetting,” stabilizing deep reasoning and temporal coherence.
Two-stage generation: Coarse-to-fine splits the job: movement first, micro-details later, so each stage is simple and fast.

Building blocks (with Sandwich explanations):

🍞 You know how a phone camera makes a small preview before saving the full photo? It’s quick but still useful. 🥬 Video Autoencoder (FSAE): It’s a smart compressor that turns a video into a tiny code and can rebuild it later.

How it works: (1) The encoder shrinks H×W×T into a small latent with 64×64×4 downsampling and 128 channels; (2) Training uses pixel, perceptual, and GAN losses; (3) A special Video VF loss aligns latents with strong vision features to make them more generative; (4) The decoder reconstructs frames, boosted by first-frame features for sharper details.
Why it matters: Without a strong compressor, tokens explode and everything runs slowly; too-weak compression hurts quality. 🍞 Example: A 1024×1024×121 video becomes a tiny latent block the transformer can handle quickly, then decodes back into sharp video.

🍞 Imagine you’re doing homework and keep a notes booklet so you can quickly look up earlier steps. 🥬 Diffusion Transformer with Layer Memory: It’s a generator that, at every layer, can peek back at earlier layers’ features instead of only the most recent ones.

How it works: (1) Normal self-attention makes queries from the current layer; (2) Keys/values come from a learned mix of all past layers (a dynamic router picks which to use); (3) This creates a shallow “memory” across depth with tiny overhead; (4) Cross-attention also brings in text and first-frame cues.
Why it matters: Without this memory, deep transformers can collapse into repetitive features and waste capacity. 🍞 Example: To keep the character’s face consistent across frames, later layers can pull details from earlier clean representations.

🍞 Think of drawing a sketch first, then going over it with fine pens. 🥬 Multi-Resolution Generation Strategy (Latent Upsampler + Refiner): First draft in low-res latent, then a tiny refiner sharpens to high-res.

How it works: (1) A convolutional latent upsampler increases spatial size by 2×; (2) A few-step high-res DiT refiner fixes artifacts and restores details; (3) Dynamic masking and deviation-based conditioning help the refiner improve, not just copy.
Why it matters: If you try full detail from the start, it’s slow and brittle; two stages are faster and more reliable. 🍞 Example: The model first nails the motion of waving, then adds crisp hair strands and cloth texture in a few extra steps.

03Methodology

At a high level: Input image and prompt → Encode image/video into compact latent (FSAE) → Base DiT generates low-res latent video → Latent upsampler (×2) → High-res DiT refiner (few steps) → FSAE decoder (with first-frame features) → Output video.

Step 0: Prerequisite concept 🍞 You know how when you read a paragraph, your eyes focus most on the important words? 🥬 Self-Attention: It’s a way for a model to decide which tokens (parts of data) to focus on more.

How it works: (1) Turn tokens into queries, keys, values; (2) Compare queries to keys to get importance scores; (3) Mix values using those scores; (4) Repeat in layers.
Why it matters: Without it, the model treats every token the same and misses key relationships. 🍞 Anchor: When answering “What color is the ball?”, attention boosts “color” and “ball,” not filler words.

Step 1: Compact representation with FSAE (Video Autoencoder)

What happens: The FSAE encoder compresses frames by 64×64 spatially and 4× temporally into a latent with 128 channels; the decoder can reconstruct them. Training uses L1, LPIPS, GAN losses, and a new Video VF loss to align latents with DINOv2 features (lowering intrinsic dimension and improving generation).
Why this step exists: Cutting tokens early makes every later transformer step lighter; the VF loss makes the latent space easier to model, boosting generation quality.
Example data: A 512×512×61 clip becomes a tiny latent grid; the decoder, enhanced with first-frame cross-attention and non-causal/group-causal convs, reconstructs stable, low-flicker video.

🍞 You know how a teacher compares your summary to a textbook to check you captured the meaning? 🥬 Video VF Loss: It aligns the autoencoder’s latent features with strong vision model features (DINOv2) so the latent keeps meaningful semantics.

How it works: (1) Map latent channels to match DINOv2 feature channels; (2) Resize to match space/time; (3) Use cosine and distance-matrix losses to align; (4) Fine-tune AE with this extra loss.
Why it matters: Without semantic alignment, the latent may reconstruct pixels but be harder to generate from. 🍞 Anchor: After alignment, the model more reliably knows “this blob is a dog’s ear,” helping the generator draw ears correctly in new videos.

Step 2: Base DiT with Layer Memory generates low-res latent video

What happens: Tokens come from the compact latent (no extra patchifying). The DiT is trained via flow matching with a logit-normal time schedule and Pseudo-Huber loss for stable gradients. The new Layer Memory builds keys/values from a learned mix of all previous layers using a dynamic router, while queries come from the current layer—enabling depth-wise reuse.
Why this step exists: It models motion and structure efficiently in a tiny space and uses memory to avoid representation collapse and improve temporal consistency.
Example data: A 121-frame sequence at 24 fps produces a latent video; later layers can “borrow” early clean face details while refining motion of eyes and mouth.

🍞 Imagine stacking notes from every class so you can quickly grab any page you need. 🥬 Layer Memory Self-Attention: Each layer can attend to a learned mix of all earlier layers, not just the previous one.

How it works: (1) A router computes weights over past layers conditioned on diffusion time; (2) Softmaxed weights blend past hidden states; (3) Keys/values come from this blend; (4) Attention runs as usual.
Why it matters: Without it, very deep models can repeat themselves and forget helpful signals from earlier layers. 🍞 Anchor: The model keeps a character’s identity stable by reusing earlier, cleaner features even in later, complex motion layers.

Step 3: Latent upsampler and high-res refiner

What happens: A convolutional latent upsampler (projection → pixel-shuffle → 16 residual blocks) increases spatial resolution by 2×. Then a lightweight high-res DiT refiner runs only a few NFEs, guided by special conditioning tricks so it fixes artifacts rather than copying them.
Why this step exists: The base DiT focuses on getting motion/structure right cheaply; the refiner restores fine details (textures, edges) quickly.
Example data: A 720p latent draft becomes sharper textures on jackets and hair after refiner passes.

🍞 Think of using tape to cover the shiny parts while repainting a bike—you protect what’s good and fix what’s not. 🥬 Dynamic Masking: The refiner gets confidence masks telling it where to trust the first frame and where to correct generated frames.

How it works: (1) Compare low-res latent and its upsampled version to estimate error; (2) Turn differences into soft mask scores (0–1); (3) Keep first-frame mask at 1; (4) Randomly replace/shuffle frames during training to strengthen robustness.
Why it matters: Without soft masks, the model may over-trust noisy frames or underuse the reliable first frame, causing inconsistency. 🍞 Anchor: The refiner keeps the exact shirt color from the first frame while gently fixing blurry sleeves in later frames.

🍞 Picture checking your drawing by looking at how far it is from the template, then purposely nudging it so you learn to correct it. 🥬 Deviation-Based Latent Estimation: During training, the refiner sees a deliberately imperfect low-res condition so it learns to repair errors, not copy them.

How it works: (1) Use flow-matching predictions from the base model to estimate a clean latent; (2) Perturb it based on noise level σ to keep it imperfect; (3) Feed this as condition so the refiner learns restoration; (4) At inference, stop perturbing and use the real low-res latent.
Why it matters: If the condition is too perfect, the refiner won’t learn to fix artifacts. 🍞 Anchor: Slightly off facial contours in the condition force the refiner to learn to correct jawlines and eye corners.

🍞 Imagine sliding down a smooth ramp instead of stepping down lots of stairs. 🥬 Flow Matching (training the DiT): It teaches the model the velocity to move from noise to data along a simple path.

How it works: (1) Mix data and noise by a factor σ; (2) Train the model to predict the velocity from the mixture toward the data; (3) Sample by following this learned flow; (4) Use Pseudo-Huber loss for stable training.
Why it matters: Simpler trajectories and stable gradients make training convergence better. 🍞 Anchor: The model learns a smooth way to morph pure noise into a waving person without zig-zagging.

Step 4: Decoding with first-frame features

What happens: The FSAE decoder reconstructs pixels; cross-attention injects first-frame encoder features into decoder blocks to boost faithfulness and reduce flicker.
Why this step exists: The first frame is reliable ground truth in image-to-video; using it at decode time preserves identity, colors, and layout.
Example data: The subject’s unique freckles remain consistent across frames.

Finishing touch: Step distillation for the refiner 🍞 You know how a teacher first shows you the long way to solve a problem, then teaches a quick shortcut that reaches the same answer? 🥬 Step Distillation: Compress many sampling steps into fewer ones so inference runs faster with minimal quality loss.

How it works: (1) Distill classifier-free guidance (CFG); (2) Progressively distill to 32 steps; (3) Use SiDA to 8 steps; (4) Keep enough steps for later RL fine-tuning.
Why it matters: Fewer steps mean faster videos; too few without distillation hurts quality. 🍞 Anchor: The refiner needs just 8 NFEs to sharpen a 5-second clip while staying trainable in RL.

Secret sauce summary:

Extreme token reduction (64×64×4) without killing quality (Video VF loss, strong decoder).
Layer Memory for deeper, more stable DiT representations.
Two-stage latent sharpening with dynamic masks and deviation-based conditioning.
Efficient training: flow matching + Pseudo-Huber loss, and distilled few-step refiner.

04Experiments & Results

The Test: The team evaluated two big things: (1) how well the autoencoder reconstructs videos after huge compression, and (2) how good and how fast the full image-to-video system generates results.

Autoencoder reconstruction: Using 256×256×17 clips from Inter-4K and WebVid-10M, they measured SSIM/PSNR (higher is better), LPIPS/FVD (lower is better). Despite 64×64×4 compression (384× info reduction), FSAE-Standard outperformed other high-compression AEs like LTX-Video and VidTok, and even beat some lower-compression AEs (e.g., Cosmos-CV) on several metrics. Visually, FSAE held textures steady across frames (less flicker), and FSAE-Lite offered similar quality with ~1.75–2× lower memory/time.

Full I2V quality: On VBench 2.0 (720×1280), FSVideo scored 88.12% total, 95.39% I2V score, and 80.85% quality score—competitive with strong open models. Against systems built on Wan 2.1 DiT, FSVideo achieved the best total score while using a much higher compression rate, showing that extreme token cutting did not doom quality. In human pairwise preference tests (GSB), FSVideo strongly beat HunyuanVideo and LTX-Video and performed on par with Wan 2.1 14B. It trailed Wan 2.2 14B—currently a top open I2V system—but given FSVideo’s limited compute/data budget, the gap seems addressable with more training.

Speed: On H100 GPUs with FlashAttention-3, generating a 5-second 720×1280 24 fps clip, FSVideo achieved dramatic speedups. In a two-GPU setup (so parameters fit without heavy offloading), FSVideo ran in 19.4 seconds versus 822.1 seconds for the Wan2.1-14B-720P baseline—about 42.3× faster for a similar number of function evaluations. In a single-GPU setup with offloading, FSVideo completed in 76.6 seconds where the baseline ran out of memory. With future FP8/quantization to avoid memory constraints, the team estimates ≥58.7× speedup. Because FSVideo reduces compute per step, its gains multiply nicely with training-free step-reduction tricks, promising even more speed.

What helped: The Layer Memory improved optimization: from scratch, the loss curve stayed consistently below the baseline and converged faster; when added to a pretrained model, it reached a stable ~4.7% loss improvement in just ~1,000 steps. For RL fine-tuning, domain adapting the reward model (VideoAlign) to FSVideo’s generations, and conditioning the reward on the first frame, were essential to prevent reward hacking and to enforce first-frame consistency.

Scoreboard with context:

Autoencoder: At extreme compression (64×64×4), FSAE-Standard got SSIM/PSNR close to or above lower-compression systems and much better FVD than other high-compression baselines. That’s like scoring an A when others at similar difficulty get B− or C.
VBench (720p): FSVideo’s 88.12% total is within the competitive cluster, while being far faster and more memory-friendly.
Speed: 42.3× faster is the headline—akin to finishing a 5-minute race in 7 seconds when others need 5+ minutes, with similar form.

Surprising findings:

Ultra-compressed latents can still support strong generation if you align semantics (Video VF loss) and help the decoder with first-frame features.
A tiny architectural tweak (Layer Memory) meaningfully stabilizes deep DiTs without big overhead.
The refiner learns better when fed imperfect low-res conditions (deviation-based estimation) and guided by soft, dynamic masks—preventing overfitting to noisy inputs.
RL for videos benefits from reward domain adaptation and explicitly seeing the first frame to reward temporal consistency.

05Discussion & Limitations

Limitations:

Data and compute: FSVideo was trained under tighter resource constraints than some competitors (e.g., Wan 2.2), likely capping current quality. More, cleaner, and motion-rich data would help.
Extreme micro-details: At 64×64×4 compression, some ultra-fine textures or tiny text may still be challenging, especially in fast motion or complex lighting.
Scope: This work targets image-to-video. While it can plug into text-to-image-to-video, pure text-to-video and multi-scene storytelling remain open challenges.
Two-stage complexity: The upsampler + refiner pipeline adds components to manage, and training coordination (masks, deviation, dropout) is non-trivial.
Memory fit: Two 14B DiTs plus AE can strain single-GPU setups without offloading or quantization.

Required resources:

GPUs with high memory (e.g., H100s) for fastest inference and RL; otherwise use FSDP, context parallelism, and offloading.
The FSAE autoencoder (Standard or Lite) checkpoints; the base and refiner DiT weights; reward models (VideoAlign, MPS) for RL stages.
Training infra for multi-stage schedules (flow matching, SFT, RL) and data pipelines with captions and first-frame conditioning.

When not to use:

If you need pure text-to-video without a starting image and have no budget for a text-to-image front-end.
If your scenario demands ultra-long, multi-scene films today (beyond ~5–10 seconds at high fps/res) without stitching.
If you cannot accommodate two-stage generation (latent upsampler + refiner) due to system complexity.

Open questions:

Can we push compression even further while preserving micro-text and logos? Are hybrid tokenizers (semantic + texture codes) helpful?
How far does Layer Memory generalize—to MMDiT, multimodal video+audio, or 3D/4D generation?
Can we automate dynamic masking and deviation schedules per scene for even better restoration?
What are the best strategies for few-step global generation that keep motion integrity while supporting large movements?
How do we integrate audio and editing operations (inpainting, stylization) into the same latent pipeline without blowing up latency?

06Conclusion & Future Work

FSVideo in three sentences: It keeps model capacity high but slashes the number of tokens using a new 64×64×4 video autoencoder, then adds a memory-savvy diffusion transformer and a fast, two-stage latent upsampler/refiner to rebuild details. This design yields competitive video quality with dramatic speedups—about 42× faster than a strong open baseline at 720p, with robust reconstruction and temporal consistency. The approach shows that the right place to optimize is token efficiency and information reuse, not just shrinking the model.

Main achievement: Demonstrating that extreme latent compression plus Layer Memory and targeted latent refinement can deliver order-of-magnitude faster image-to-video generation without a major quality sacrifice.

Future directions: Push to higher resolutions and longer durations by refining the autoencoder and token routing; extend Layer Memory to multimodal and MMDiT architectures; improve training for larger motion and prompt faithfulness; integrate audio; and distill more steps while preserving adaptability for RL.

Why remember this: FSVideo reframes the speed-vs-quality tradeoff. Instead of making the brain smaller, it feeds the brain fewer, better tokens and teaches it to remember and reuse what it already computed—turning high-quality video generation from a waiting game into a near-real-time experience.

Practical Applications

•Instant photo re-enactment: Turn a single portrait into a short, natural motion video (blink, smile, head turn).
•Social media effects: Fast previews of stylized motions or transitions before posting.
•Education: Create short science or history explainer clips from a single image and a prompt.
•Advertising: Rapidly prototype product motion shots for client review.
•VFX previsualization: Generate quick motion drafts to explore camera moves and blocking.
•Gaming: Animate NPC portraits or 2D characters into short expressive loops.
•News/media: Produce B-roll style motion from still photos to enrich segments.
•E-commerce: Animate product images to show fit, spin, or material behavior.
•Accessibility: Generate sign-language snippets from reference frames for quick communication aids.
•Research tooling: Use the compressed latent pipeline as a fast testbed for new training losses or RL rewards.

Version: 1