SemanticGen: Video Generation in Semantic Space

Jianhong Bai; Xiaoshi Wu; Xintao Wang; Xiao Fu; Yuanxing Zhang; Qinghe Wang; Xiaoyu Shi; Menghan Xia; Zuozhu Liu; Haoji Hu; Pengfei Wan; Kun Gai

SemanticGen: Video Generation in Semantic Space

Intermediate

Jianhong Bai, Xiaoshi Wu, Xintao Wang et al.12/23/2025

arXiv PDF

Key Summary

•SemanticGen is a new way to make videos that starts by planning in a small, high-level 'idea space' (semantic space) and then adds the tiny visual details later.
•This two-stage recipe speeds up training and makes it easier to create long videos that stay consistent over time.
•A lightweight MLP squeezes big semantic features into a smaller, easier-to-learn space, which helps the model converge faster.
•The method uses two diffusion models: one to generate compact semantic features and one to turn those features into video latents for decoding to pixels.
•For long videos, SemanticGen does full attention only in the tiny semantic space and uses shifted-window attention in the bigger latent space to save compute.
•On standard tests (VBench and VBench-Long), SemanticGen matches top short-video models and clearly wins on long-video consistency with lower drift.
•A surprising result is that compressing semantic features to very low dimensions (like 8) can make training faster and outputs cleaner.
•Compared to doing everything in the VAE latent space, generating in the semantic space learns faster and keeps global story structure better.
•Limits remain: very fine textures can drift in long clips, and low-fps semantic encoders can miss super-fast motions.
•This approach can make creative tools cheaper, faster, and more reliable for storytelling, education, ads, and more.

Why This Research Matters

SemanticGen makes video generation faster, cheaper, and more reliable by planning in a small, meaningful space before adding visual details. This helps creators produce longer, more coherent stories without massive compute budgets. It also reduces energy use, which is good for the planet. Teachers and students can turn lessons into steady, clear videos that don’t fall apart halfway through. Advertisers and filmmakers can keep characters and scenes consistent for a full minute, not just a few seconds. Overall, it’s a step toward AI tools that think about meaning first and pixels second, which makes them smarter and more useful.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Top Bread (Hook): You know how when you make a movie with friends, you first agree on the story and the scenes, and only later worry about costumes and tiny props? Planning first makes everything smoother.

🥬 Filling (The Actual Concept):

What it is: Video generation is when a computer makes a moving picture from instructions, like a text prompt.
How it works: Step 1) The computer learns patterns from lots of videos. Step 2) Given a prompt, it imagines a rough plan. Step 3) It fills in the details frame by frame. Step 4) It renders pixels you can watch.
Why it matters: Without good planning, the video can look confused: characters change clothes, backgrounds flicker, and long stories drift off-course.

🍞 Bottom Bread (Anchor): If you ask for “a kite flying over the beach at sunset,” good video generation keeps the same beach and the same kite across the whole clip, not a new kite every second.

🍞 Top Bread (Hook): Imagine cleaning a foggy window a little at a time until the view is clear.

🥬 Filling (The Actual Concept):

What it is: A diffusion model is a method that starts from noise (static) and gradually removes it to reveal a clear image or video.
How it works: 1) Add noise during training to learn how to undo it. 2) At generation time, start with noise. 3) Repeatedly denoise with a learned model. 4) End up with a clean result.
Why it matters: If the model can’t undo noise steadily, the video looks muddy or inconsistent.

🍞 Bottom Bread (Anchor): Starting from TV static, the model “unfogs” until you see a person riding a bike along a road, just like the prompt.

🍞 Top Bread (Hook): Think of a photocopier that makes a small, smart summary of a page so you can quickly store and share it.

🥬 Filling (The Actual Concept):

What it is: A Variational Autoencoder (VAE) turns big videos into smaller “latent” codes and back again.
How it works: 1) The encoder squashes pixels into compact latents. 2) A generator learns patterns in those latents. 3) The decoder turns latents back into video frames.
Why it matters: Working in the smaller latent space is faster than pushing around millions of raw pixels.

🍞 Bottom Bread (Anchor): Instead of editing a huge 4K film reel directly, you edit a neat summary that can still reconstruct the full movie.

🍞 Top Bread (Hook): Imagine a cartoon storyboard that shows the main characters, where they stand, and how they move—without tiny textures yet.

🥬 Filling (The Actual Concept):

What it is: Semantic representation is a compact description of what’s happening—who, where, how they move—without every pixel detail.
How it works: 1) A semantic encoder watches frames at low fps. 2) It makes tokens that capture layout, objects, and motion. 3) These tokens are small but meaningful.
Why it matters: If you try to plan with all pixels at once, you get overwhelmed and slow; semantic summaries let you plan globally first.

🍞 Bottom Bread (Anchor): “Two kids toss a red ball in a park” becomes a tidy set of tokens: two kids, grass field, ball’s path—no grass blades counted yet.

The world before this paper: Most top video generators learned how VAE latents are distributed and used heavy, full attention across many frames. That worked for short clips but struggled with long videos: too many tokens, too much compute, slow convergence, and often drift (the scene or style slowly changes). Some tried sparse attention or autoregressive tricks, but then motion got jumpy or visuals degraded.

The specific problem: How do we keep long videos consistent and speed up training without losing visual quality? Existing VAEs don’t compress enough for truly long sequences, and modeling everything with full attention is too expensive.

🍞 Top Bread (Hook): You know how making a plan first and decorating later is easier than decorating while you’re still inventing the plot?

🥬 Filling (The Actual Concept):

What it is: The missing piece was to start generation in an even smaller, smarter space—the semantic space—so the model can plan the whole story globally.
How it works: 1) First generate compact semantic tokens that define the global layout and motion. 2) Then map those tokens into VAE latents that add fine-grained detail.
Why it matters: Without this global planning step, long clips drift, and training takes far more compute.

🍞 Bottom Bread (Anchor): Instead of animating every pixel of a 1-minute snowboarding video from scratch, the model first builds a tiny plan of the slope, the rider’s path, and camera moves, then paints the details.

Real stakes: Faster convergence means fewer GPUs and less energy—greener AI. Better long-term consistency means more believable stories, ads, lessons, and documentaries generated from text. For creators and students, it’s like getting a reliable “co-director” that keeps characters, places, and actions steady across a whole minute, not just a few seconds.

02Core Idea

🍞 Top Bread (Hook): Imagine writing a comic: first sketch the boxes and character poses (big ideas), then ink and color (details).

🥬 Filling (The Actual Concept):

What it is: The key insight is to generate videos first in a compact semantic space for global planning, then add high-frequency visual details in the VAE latent space.
How it works: 1) Train a diffusion model to generate compact semantic tokens from text. 2) Train another diffusion model that, given those tokens, produces VAE latents. 3) Decode latents into pixels. 4) For long videos, do full attention in the tiny semantic space and windowed attention in the larger latent space.
Why it matters: This two-stage path converges faster, uses less compute, and avoids long-video drift.

🍞 Bottom Bread (Anchor): For “a bee hovering above a flower,” the model first creates a simple plan for the bee’s path and the flower’s spot, then paints the wings, light, and textures.

Three analogies for the same idea:

Storyboards then filming: Plan shots first (semantic), film and color later (details).
GPS route then sightseeing: Choose the route (semantic), notice the trees and cafes while driving (details).
Baking: Mix the batter (semantic), then do the frosting patterns (details).

Before vs After:

Before: Models worked in VAE space directly, which was big and heavy; long videos ballooned to hundreds of thousands of tokens, slowing everything and causing drift.
After: Models plan in a much smaller semantic space, so they see the whole story at once, then add details locally—training is quicker and long clips stay on track.

🍞 Top Bread (Hook): You know how a map that only shows major roads is easier to read than a map with every driveway?

🥬 Filling (The Actual Concept — Semantic Encoder):

What it is: A semantic encoder turns a video into compact, meaningful tokens that capture layout, objects, and motion.
How it works: 1) Sample frames at low fps. 2) Group image patches into tokens. 3) Compress across space and time so you’re left with a small set of big-idea tokens.
Why it matters: Planning is possible only if the plan is small enough to process with full attention.

🍞 Bottom Bread (Anchor): Qwen2.5-VL’s vision tower makes a handful of tokens that summarize a minute of action, like “man turns left,” “camera pulls back,” “ball rolls.”

🍞 Top Bread (Hook): Think of rolling clothes tightly to fit a suitcase.

🥬 Filling (The Actual Concept — Semantic Space Compression):

What it is: A lightweight MLP squeezes high-dimensional semantic features into a smaller, easier space that looks roughly Gaussian (bell-shaped) for sampling.
How it works: 1) Take big semantic vectors. 2) Use an MLP to output mean and variance. 3) Add a regularizer so the compressed space behaves nicely for diffusion sampling. 4) Use these compact embeddings to guide video generation.
Why it matters: Without compression, training is slower and outputs are worse at fixed steps; with compression, convergence speeds up.

🍞 Bottom Bread (Anchor): When they shrank from 2048 dimensions to 64 or even 8, the model trained faster and looked cleaner in tests.

🍞 Top Bread (Hook): Imagine whispering a hint to a friend drawing a scene: “Put the girl near the window and make the camera pull back.”

🥬 Filling (The Actual Concept — In-Context Conditioning):

What it is: The model feeds the compact semantic tokens right alongside the noisy video latents so the generator learns to follow the plan.
How it works: 1) Concatenate semantic tokens with the latent tokens. 2) Let attention mix them. 3) The model denoises while respecting the plan.
Why it matters: Without this hint-sharing, the details could ignore the big-picture plan and drift.

🍞 Bottom Bread (Anchor): Injecting semantic tokens from a reference clip made the new video keep the same layout and motion, even if textures changed.

🍞 Top Bread (Hook): Picture reading a whole paragraph for meaning, then zooming into sentences to fix typos.

🥬 Filling (The Actual Concept — Shifted Window Attention for Long Videos):

What it is: For long videos, do full attention only in the tiny semantic space; in the larger VAE latent space, use shifted-window attention that only looks at nearby frames but shifts so information spreads.
How it works: 1) Global full attention on semantic tokens keeps story coherence. 2) Local windows in latents keep compute low. 3) Shift the windows layer by layer so details propagate.
Why it matters: Without this, full attention on all latents explodes in cost; pure local windows lose global story.

🍞 Bottom Bread (Anchor): A one-minute video keeps characters and scenes consistent because the plan is global, and the details spread steadily via sliding windows.

Why it works (intuition): Videos are super redundant—many frames repeat structure. It’s wasteful to model every low-level token globally. Planning in a compact semantic space captures the essentials (who/where/how) with far fewer tokens, so full attention there is cheap and powerful. Then, adding details locally is efficient and stable. That’s the “aha!”

03Methodology

At a high level: Text Prompt → Stage A: Semantic Representation Generation (tiny plan) → Stage B: VAE Latent Generation (details) → VAE Decoder → Final Video.

Step 0. Prerequisites and base model 🍞 Top Bread (Hook): Imagine we already own a smart video painter that works in a smaller sketchbook (latent space) instead of a giant canvas.

🥬 Filling (The Actual Concept):

What it is: A pre-trained text-to-video latent diffusion model with a 3D-VAE and a Transformer (DiT) backbone.
How it works: 1) 3D-VAE encodes/decodes frames to/from latents. 2) DiT denoises latents from noise to clean video conditioned on text. 3) Training uses flow/diffusion steps that teach it to remove noise.
Why it matters: This gives us a strong “detail painter” ready to be guided by semantics.

🍞 Bottom Bread (Anchor): When prompted “a car drives through a neon-lit city,” the base model can already make short, sharp clips.

Stage A. Train the VAE latent generator with semantic conditioning (teach the painter to follow a plan) 🍞 Top Bread (Hook): Think of giving the painter a tiny storyboard to follow while painting.

🥬 Filling (The Actual Concept):

What it is: We fine-tune the base diffusion model to denoise VAE latents while being fed compact semantic tokens (the plan).
How it works: 1) Feed the ground-truth video into the semantic encoder (e.g., Qwen2.5-VL vision tower) to get high-level tokens. 2) Compress them via a small MLP that outputs mean/variance; sample a compact embedding (Gaussian-like). 3) Add noise to VAE latents (training-time diffusion step). 4) Concatenate semantic tokens with noisy latents (in-context conditioning). 5) Train the model to denoise the latents while obeying the semantics.
Why it matters: Without this stage, the detail painter doesn’t learn to listen to the plan and may drift.

🍞 Bottom Bread (Anchor): If the ground-truth shows “the man turns his head left,” the semantic tokens encode that motion, and the denoiser learns to paint frames that follow that turn.

Concrete example with data: Suppose we have a 5-second, 480p, 24fps clip (≈120 frames). The semantic encoder samples at about 1.6 fps (≈8 frames) and compresses space (14×14 patches and more), ending with a small grid of semantic tokens. The MLP squeezes 2048-d vectors down to e.g., 64-d, which we feed with the noisy VAE latents. The model learns to denoise to the original clip.

Stage B. Train the semantic generator (teach the planner to imagine plans from text) 🍞 Top Bread (Hook): Now we want the storyboard itself to be created from the script.

🥬 Filling (The Actual Concept):

What it is: A second diffusion model learns to generate the compact semantic tokens from text prompts.
How it works: 1) Freeze the semantic encoder and the MLP. 2) Use the same compact semantic tokens as targets. 3) Train a diffusion model to denoise in this tiny semantic space from text. 4) Because the space is compressed and Gaussian-like, training converges fast.
Why it matters: Without a good planner, stage A has nothing to follow at inference.

🍞 Bottom Bread (Anchor): For “a bee hovering above a flower,” Stage B makes a tiny plan: bee’s path + flower location + camera framing.

Inference (putting it together) 🍞 Top Bread (Hook): Time to make a new video from scratch, like shooting a film guided by your storyboard.

🥬 Filling (The Actual Concept):

What it is: Two-stage sampling—first plan, then paint.
How it works: 1) Stage B: Sample compact semantic tokens from noise using the text prompt. 2) Stage A: Feed these tokens to the latent denoiser, which generates VAE latents. 3) VAE decoder turns latents into pixels. 4) Optionally add a bit of noise to the semantic tokens during training to match inference behavior and reduce gaps.
Why it matters: This sequence ensures global story first, details second.

🍞 Bottom Bread (Anchor): “A car drives through a neon-lit cyberpunk street” → small plan of layout and motion → detailed neon reflections, rain streaks, and smooth camera move.

Extension to long videos (efficient and coherent) 🍞 Top Bread (Hook): Imagine reading the whole script (global plan) but filming each scene in chunks (manageable detail).

🥬 Filling (The Actual Concept):

What it is: Do full attention in the tiny semantic space for global coherence; do shifted-window attention in the larger VAE latent space for efficiency.
How it works: 1) Global full-attn over semantic tokens keeps characters and scenes consistent across a minute. 2) In latents, use windows of T_w frames; shift by T_w/2 on alternating layers to let info flow long-range without quadratic cost. 3) Interleave semantic and latent tokens so the plan guides the details in each window.
Why it matters: Without this, long videos either cost too much (full attention on everything) or lose the big picture (local-only views).

🍞 Bottom Bread (Anchor): A 60-second conversation scene keeps the same room, outfits, and lighting, while local windows handle the subtle face changes and camera cuts economically.

The secret sauce 🍞 Top Bread (Hook): Think of two superpowers working together: a smart planner and a skilled painter.

🥬 Filling (The Actual Concept):

What it is: Semantic space compression + in-context conditioning + two-stage diffusion + shifted-window attention.
How it works: 1) Compress semantics so sampling is easy and fast. 2) Feed plan tokens directly into the denoiser. 3) Plan globally, then add details. 4) Keep compute low on long videos by mixing global semantics with local latent windows.
Why it matters: Any missing piece breaks the balance—no compression slows training; no semantic conditioning risks drift; no windowing explodes compute for long clips.

🍞 Bottom Bread (Anchor): That’s why SemanticGen can deliver one-minute videos that feel consistent, while training faster than sticking only to VAE latents.

04Experiments & Results

The test: Researchers measured whether videos stayed consistent (same subject/background), avoided flicker, moved smoothly, and looked good (imaging/aesthetics). For long videos, they also checked drift—does quality or style fade from beginning to end?

🍞 Top Bread (Hook): Imagine grading two movies: Are the actors the same across scenes? Do the lights flicker oddly? Does the style hold up until the credits?

🥬 Filling (The Actual Concept):

What it is: Evaluation on VBench (short videos) and VBench-Long (long videos) using standardized prompts and metrics.
How it works: 1) Generate videos for shared prompts. 2) Score subject and background consistency, temporal flicker, motion smoothness, imaging and aesthetic quality. 3) For long clips, compute ΔM_drift—the difference between early and late segments (lower is better).
Why it matters: Without fair tests, we can’t tell if a method really keeps stories steady and visuals clean over time.

🍞 Bottom Bread (Anchor): If a snowflake is supposed to melt, you check that it actually melts (text-following) and the scene quality doesn’t fall apart near the end.

The competition: SemanticGen was compared to strong models like Wan2.1-T2V-14B and HunyuanVideo for short clips, and SkyReels-V2, Self-Forcing, and LongLive for long clips. They also included “continue training the base model” baselines (Base-CT and Base-CT-Swin) using the same data and steps to make the comparison fair.

Scoreboard with context:

Short videos (VBench): SemanticGen’s subject consistency 97.79% and background consistency 97.68% are right up there with top models (e.g., Wan2.1’s 97.23%/98.28%). Motion smoothness is 99.17%—that’s like an A+ in steadiness. Imaging and aesthetic quality (≈65%) are comparable to leading baselines.
Long videos (VBench-Long): SemanticGen shines on coherence: subject consistency 95.07% and background consistency 96.70%. Temporal flicker 98.31% and motion smoothness 99.55% are excellent. Imaging 70.47% and aesthetics 64.09% are strong. Most striking is drift ΔM_drift = 3.58%, the lowest among compared methods (Base-CT-Swin 5.20%, LongLive 4.08%, etc.)—like keeping your tone and style steady from first act to finale.

Surprising findings:

Smaller can be better: Compressing semantic vectors from 2048 dimensions down to 64 or even 8 helped the semantic generator converge faster and reduced artifacts. On a 47-prompt test, lower dimensions improved several VBench metrics—counterintuitive but powerful.
Semantic vs compressed VAE latents: Training the same two-stage setup but modeling compressed VAE latents (instead of semantic features) converged much slower. After equal steps, VAE-space results were still blotchy, while semantic-space results were already reasonable videos. This supports the central claim: plan in semantic space first.
Reference semantics preserve structure: Injecting semantic tokens from a reference video made the generated clip keep the same layout and motion but change textures and colors—evidence that the semantics really encode high-level structure.

🍞 Top Bread (Hook): Like testing two study plans: one that makes a clear outline first, and one that jumps into memorizing details. Which one aces the final?

🥬 Filling (The Actual Concept):

What it is: Semantic-first beats detail-first in convergence speed and long-form consistency.
How it works: 1) Smaller token sets enable global attention without exploding compute. 2) The plan steers details to prevent drift. 3) Windowed attention in latents scales efficiently.
Why it matters: If you need one-minute videos with steady scenes and characters, this difference is decisive.

🍞 Bottom Bread (Anchor): In prompts like “a group talks in a modern home with alternating close-ups,” SemanticGen kept faces and backgrounds consistent much better than baselines, which sometimes shifted colors or introduced artifacts over time.

05Discussion & Limitations

Limitations 🍞 Top Bread (Hook): Even a great movie plan can miss tiny props or super-fast actions.

🥬 Filling (The Actual Concept):

What it is: Current limits include fine texture consistency over very long spans and missing super-fast motions if the semantic encoder samples at low fps.
How it works: 1) Semantic tokens deliberately skip tiny details; over a minute, those details can drift. 2) Low-fps inputs to the semantic encoder can’t capture events that change within 1/24th of a second (like lightning flicker).
Why it matters: For tasks needing micro-detail accuracy or high-speed motion fidelity, quality can drop.

🍞 Bottom Bread (Anchor): A shimmering fabric pattern might subtly change over a long scene; a quick flash of lightning may not be reproduced.

Required resources

Pretrained text-to-video latent diffusion model (with 3D-VAE and DiT).
A strong video semantic encoder (e.g., Qwen2.5-VL vision tower).
GPUs capable of training diffusion models; long-video training benefits from memory-efficient attention (shifted windows).
Captioned video data for training; long-video data helps with consistency.

When NOT to use

If you need pixel-perfect, frame-by-frame microtextures (e.g., forensic restoration).
If your scene depends on ultra-high-fps micro-motions (e.g., hummingbird wings at 240 fps) and your semantic encoder samples too slowly.
If compute is so limited you can’t train two diffusion stages (though inference remains reasonable).

Open questions 🍞 Top Bread (Hook): What if our storyboard got even smarter without getting bigger?

🥬 Filling (The Actual Concept):

What it is: Future directions include testing different semantic encoders, keeping high temporal resolution while staying compact, and tighter coupling between planner and painter.
How it works: 1) Explore self-supervised vs vision-language encoders. 2) Design tokenizers that capture high-frequency motion while remaining small. 3) Blend global and local attention even more smoothly across scenes and shots.
Why it matters: Better semantic tokens could lift both detail fidelity and motion realism without sacrificing speed.

🍞 Bottom Bread (Anchor): A tokenizer that samples at higher fps but stays compact could let the model reproduce lightning flicker and fabric shimmer while keeping minute-long coherence.

06Conclusion & Future Work

Three-sentence summary: SemanticGen generates videos by first creating a compact semantic plan (who, where, how they move) and then adding visual details in the VAE latent space. This two-stage approach converges faster, scales to long videos, and reduces drift by using full attention only where it counts—on small semantic tokens—and shifted-window attention for details. Experiments show short-video quality on par with leading models and clear wins in long-video consistency and stability.

Main achievement: Proving that “plan in semantic space, then paint details” is both faster to train and better for long-form coherence than operating solely in VAE latents.

Future directions: Build even better video semantic encoders that keep high temporal fidelity while staying compact; study more encoder types (vision-language and self-supervised); and further refine the interaction between global semantics and local details for multi-shot narratives, audio alignment, and higher fps.

Why remember this: It reframes video generation from pixel-first to meaning-first. By elevating planning to the semantic level, SemanticGen turns long videos from a compute headache into a tractable, coherent storytelling process—like using a great outline before writing your novel.

Practical Applications

•Create minute-long ads or trailers that keep brand colors, characters, and scenes consistent.
•Generate educational videos that maintain steady layouts (e.g., same lab bench, same teacher) across explanations.
•Produce storyboards that directly turn into high-quality videos with faithful camera moves and character positions.
•Build narrative shorts with multiple shots where faces, outfits, and locations remain coherent.
•Make product demos that keep the same environment while showing different angles and motions.
•Rapidly prototype movie scenes: plan semantics, then iterate on details without redoing the whole clip.
•Generate nature or travel videos with stable landscapes and lighting over longer durations.
•Create video game cutscenes where character continuity and camera choreography persist across minutes.
•Assist accessibility tools by converting text descriptions into consistent, easy-to-follow visual stories.
•Support social media creators with longer, drift-free clips that match prompts more reliably.

Version: 1