Scaling Zero-Shot Reference-to-Video Generation

Zijian Zhou; Shikun Liu; Haozhe Liu; Haonan Qiu; Zhaochong An; Weiming Ren; Zhiheng Liu; Xiaoke Huang; Kam Woh Ng; Tian Xie; Xiao Han; Yuren Cong; Hang Li; Chuyan Zhu; Aditya Patel; Tao Xiang; Sen He

Scaling Zero-Shot Reference-to-Video Generation

Intermediate

Zijian Zhou, Shikun Liu, Haozhe Liu et al.12/7/2025

arXiv PDF

Key Summary

•Saber is a new way to make videos that match a text description while keeping the look of people or objects from reference photos, without needing special triplet datasets.
•Instead of collecting expensive image–video–text triplets, Saber learns only from regular video–text pairs by pretending some video frames are “reference images.”
•During training, Saber hides (masks) parts of randomly chosen frames and learns to use the visible parts as references, which teaches it to keep identities consistent.
•It adds smart “mask augmentations” (like rotations and scaling) so the model doesn’t just copy and paste the reference into the video.
•A tailored attention system, steered by attention masks, helps the model focus on the true reference regions and ignore background noise.
•Saber scales naturally to one or many reference images and even understands multiple views of the same subject.
•On the OpenS2V-Eval benchmark, Saber beats several strong systems that were trained with explicit R2V data, especially on keeping the subject consistent (NexusScore).
•At inference, Saber segments the subject from the reference image (or uses the whole image for background references) and blends it into the generated video according to the text.
•The method is built on a modern video diffusion transformer (Wan2.1-14B), trained with flow matching, and uses standard guidance at sampling time.
•Saber shows that high-quality reference-to-video generation can be trained at scale without building costly bespoke datasets.

Why This Research Matters

Personalized video creation becomes easier and cheaper when you don’t need costly, special datasets to teach identity consistency. Creators can turn a few photos and a short prompt into videos where the main character really looks like the person or object they care about. Small studios, teachers, marketers, and hobbyists can make high-quality customized videos at scale using widely available video–text data. Masked training and attention masks reduce common copy-paste artifacts, so results look more natural and professional. As zero-shot R2V improves, we can expect more inclusive content creation tools that work well across diverse subjects and multiple references.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine you’re making a class movie. You have a script (text) and photos of your classmates (references). You want the movie to follow the script while the classmates look exactly like themselves in every scene.

🥬 Generative Models

What it is: A generative model is a computer program that learns patterns from data and then creates new things that look real, like pictures or videos.
How it works:
1. Study lots of examples (photos, videos, text).
2. Learn the hidden patterns (shapes, colors, movements, words).
3. Use those patterns to create new images or videos.
Why it matters: Without generative models, the computer can’t invent new content; it can only copy. 🍞 Anchor: Like a student who reads many stories and then writes a new story in the same style.

🥬 Variational Autoencoder (VAE)

What it is: A VAE is a tool that shrinks big images/videos into small “codes” (latents) and can rebuild them later.
How it works:
1. Encoder compresses a video into a small, meaningful code.
2. Decoder turns the code back into a video.
3. The code keeps important details but uses fewer tokens, saving compute.
Why it matters: Without a VAE, training on full-resolution videos would be too slow and heavy. 🍞 Anchor: It’s like zipping a big file so it’s easier to send, then unzipping it when needed.

🥬 Diffusion Model

What it is: A diffusion model starts from noisy static and learns to remove noise step by step to reveal a clear image or video.
How it works:
1. Add noise to a clean example to get many noisy versions.
2. Train a model to predict how to go from noisy to clean.
3. At generation time, start from noise and repeatedly denoise.
Why it matters: Without diffusion, generated videos are often blurry or unstable. 🍞 Anchor: Like cleaning a foggy window little by little until you can see the scene clearly.

🥬 Reference-to-Video Generation (R2V)

What it is: R2V means creating a video that follows a text prompt while keeping the subject’s identity from one or more reference images.
How it works:
1. Read the prompt to know what should happen.
2. Look at reference images to know exactly what the subject looks like.
3. Generate a video where the subject moves and acts according to the text but looks like the references.
Why it matters: Without R2V, characters would change faces, colors, or clothes between frames. 🍞 Anchor: Like making a cartoon where the hero always looks the same, even in new adventures.

🥬 Attention Mechanism

What it is: Attention helps the model focus on the most important parts of text and images when deciding what to draw next.
How it works:
1. Compare every part (tokens) with every other part.
2. Give higher scores to helpful parts.
3. Use high-score parts more when generating.
Why it matters: Without attention, the model would treat “background wall” and “main character’s face” as equally important. 🍞 Anchor: When reading a recipe, you pay more attention to “bake for 20 minutes” than to the picture border.

🥬 Video–Text Pair Training

What it is: Training using pairs of videos and their captions.
How it works:
1. Show the model a video and a matching caption.
2. Teach it to connect what it sees (actions, objects) with words.
3. Repeat across many examples so it generalizes.
Why it matters: Without video–text pairs, the model wouldn’t know how words relate to moving pictures. 🍞 Anchor: Matching a nature video with the caption “a deer runs through a forest” teaches what “deer” and “runs” look like.

The World Before: Many R2V systems depended on special triplet datasets: (reference images, videos, and text prompts) tied together. Building these giant datasets is hard: you must find good videos, label them, match references to scenes, filter low quality, and often pay for extra tools. The datasets also skew toward a few categories (like people), so models struggle on unusual subjects.

The Problem: We needed a way to teach a model to keep subject identity consistent in new videos without building huge, expensive triplet datasets.

Failed Attempts: Models that simply paste reference faces into frames often create “sticker” artifacts (harsh edges, awkward poses) and don’t blend with motion or lighting. Others overfit to the dataset’s narrow subjects and fail on unseen categories.

The Gap: A scalable training recipe that uses the plentiful video–text data everyone already has, yet still teaches strong subject consistency like triplet-based R2V.

Real Stakes: This affects personalized storytelling, classroom projects, small businesses making ads, and creators who want custom characters—without massive budgets or special data pipelines.

🥬 Zero-shot Learning (as used here)

What it is: Zero-shot here means training without any explicit R2V triplets, then performing R2V at test time.
How it works:
1. Train only on video–text pairs.
2. Design training tricks that mimic having references.
3. At inference, accept real reference images and generate videos.
Why it matters: Without zero-shot, scaling R2V to lots of subjects and styles stays expensive and slow. 🍞 Anchor: Learning to ride a bike on a quiet street (video–text), then confidently riding in a park you’ve never seen (R2V).

02Core Idea

🍞 Hook: You know how teachers sometimes cover parts of a picture and ask you to guess the hidden parts? That game trains your brain to notice important details.

🥬 The Aha! Moment

What it is: Saber pretends that some video frames are “reference images” by masking parts of them during training, so the model learns to keep subject identity—without ever seeing real R2V triplets.
How it works:
1. Pick random frames from a training video.
2. Create shapes (masks) that hide or reveal regions.
3. Apply geometric tweaks (rotate, scale, shear, flip) so positions don’t match exactly.
4. Feed these masked frames as references alongside the noised target video.
5. Use attention with an attention mask so the model reads only the valid reference parts.
Why it matters: Without this, the model wouldn’t learn to pull identity from references or would just copy-paste and look fake. 🍞 Anchor: Like learning a friend’s look from peekaboo pictures and then drawing them in a new scene that matches a story.

🥬 Multiple Analogies

Puzzle Analogy

What it is: Masked frames are puzzle pieces; the model learns how they fit into the bigger video.
How it works: It studies the visible shapes/colors and completes the scene.
Why it matters: No missing piece confusion; the model knows where each piece belongs.
Anchor: Completing a jigsaw when some pieces show the main character’s face.

Costume Designer Analogy

What it is: References are costume photos; the script is the prompt.
How it works: The designer matches clothes, colors, and style to actors across all scenes.
Why it matters: The hero never randomly changes outfits.
Anchor: Keeping the exact blue jacket and logo across the whole movie.

Tour Guide Analogy

What it is: Attention is a guide pointing at what matters.
How it works: The guide says, “Look here: the face and the hat,” not “Look at the empty wall.”
Why it matters: You won’t waste time staring at unhelpful spots.
Anchor: A museum guide highlighting the painting’s subject, not the frame.

🥬 Building Blocks

Masked Training Strategy
- What it is: Hide parts of sampled frames and use the visible regions as the “reference condition.”
- How it works: Random masks + augmentations teach generalization and prevent copy-paste.
- Why it matters: Gives R2V skills using only video–text data.
- Anchor: Learning a character from peekaboo windows.
Mask Augmentation
- What it is: Rotate/scale/shear/flip masks and images so they don’t align pixel-by-pixel.
- How it works: The model must learn identity, not exact coordinates.
- Why it matters: Stops sticker-like artifacts.
- Anchor: Rearranging puzzle pieces so you can’t cheat by tracing outlines.
Attention Mask
- What it is: A map that says which reference pixels are valid to look at.
- How it works: Self-attention allows video↔reference interaction only where masks say “this is reference.”
- Why it matters: Prevents reading background noise or gray padding.
- Anchor: Highlighter marks the key sentences, not the page margins.
Scalable References
- What it is: Works with one or many references, including multiple views of the same subject.
- How it works: Concatenate references as extra time steps and let attention link them.
- Why it matters: More flexibility without changing training.
- Anchor: Adding more photos of your friend from front/side/back to draw them better.

Before vs After

Before: Needed costly triplets; models overfit to common subjects; sticker artifacts were common.
After: Train on easy-to-get video–text pairs; strong identity consistency; better generalization to new subjects.

Why It Works (Intuition)

The model repeatedly practices pulling identity clues from partially revealed, transformed references. The attention mask ensures it attends to the right spots, and augmentations force it to learn the idea of “this is the same subject,” not “this is the same location.” Over time, the diffusion transformer becomes skilled at weaving reference identity into the generated video while following the text.

🍞 Anchor: In class, you learn to recognize your friend even if they wear a hat, stand in a different spot, or the photo is rotated—because you learned their key features, not just a single snapshot.

03Methodology

🍞 Hook: Think of making a recipe video. You have the script (text), your friend’s photos (references), and you need smooth steps from start to finish so the final dish looks and tastes right.

At a high level: Text + Video (training pairs) → Mask some sampled frames as faux references → Concatenate video tokens with reference tokens → Attention guided by masks + text → Diffusion denoising → Output video.

🥬 Step 1: Prepare Masked Frames as References

What it is: During training, randomly chosen frames from the training video are turned into “reference images” by masking.
How it works:
1. Pick K frames from the video.
2. Generate a binary mask shape (ellipse, Fourier blob, or polygon) with a target area ratio r.
3. Apply the same geometric augmentations (rotation, scale, shear, translation, optional flip) to both the frame and its mask.
4. Keep the masked region inside the frame; use bilinear/nearest-neighbor interpolation appropriately.
5. The result is a masked reference image Î_k = Ī_k ⊙ M̄_k.
Why it matters: Without masking and augmentation, the model might just copy pixels instead of learning identity. 🍞 Anchor: Like cutting peek-holes of different shapes in a photo, then turning or resizing the photo so the holes don’t line up perfectly.

🥬 Step 2: Encode to Latent Space with a VAE

What it is: Convert both the target video and the reference images into compact latent codes.
How it works:
1. Use the Wan2.1 VAE to encode video frames into latents with temporal compression (ratio 4).
2. Encode each masked reference image into a latent of the same spatial size.
3. Resize masks to match the latent grid (produce m_ref where 1 marks valid reference areas).
Why it matters: Without compression, the transformer would be too slow and memory-hungry. 🍞 Anchor: Shrinking big pictures into neat sticker-sized cards that are easy to organize.

🥬 Step 3: Build the Input Sequence

What it is: Concatenate video and reference latents along the time axis, plus companion channels for masks and placeholder zeros.
How it works:
1. Combine noisy video latents (z_t) with clean reference latents (z_ref) temporally.
2. Concatenate masks (m_zero for video part, m_ref for reference part) as channels.
3. Add z_zero (zeros matching the video part) concatenated with z_ref as an auxiliary channel.
Why it matters: Without this structured input, the model couldn’t easily control how video tokens talk to reference tokens. 🍞 Anchor: Stacking the main story pages followed by a reference photo section, with sticky notes (masks) telling where to look.

🥬 Step 4: Attention with an Attention Mask

What it is: A transformer where self-attention allows video↔reference interaction only in valid regions, followed by cross-attention to the text.
How it works:
1. Self-attention: Video tokens attend to each other bidirectionally. Video tokens attend to reference tokens only where m_ref=1, avoiding gray/background.
2. Cross-attention: All tokens consult text features to stay on-script.
3. Feed-forward network polishes features; time-step embedding controls denoising progress.
Why it matters: Without the attention mask, gray padding and non-reference pixels pollute identity learning, causing halos and artifacts. 🍞 Anchor: A teacher lets students ask questions only to the right expert in the room (reference areas), not to random passersby.

🥬 Step 5: Diffusion with Flow Matching

What it is: Train the diffusion transformer (Wan2.1-14B backbone) to predict the velocity that moves noisy latents toward clean ones under conditions (text + references).
How it works:
1. Sample a time t in [0,1] and mix data with noise (linear interpolation).
2. The model predicts the velocity to move toward clean data.
3. Optimize with AdamW (lr 1e-5, batch 64) over large video–text datasets (e.g., Shutterstock Video with captions from Qwen2.5-VL-Instruct).
Why it matters: Without a strong denoising objective, the model won’t learn crisp, coherent videos. 🍞 Anchor: Like following a GPS arrow that steadily guides you from foggy roads back to the clear highway.

🥬 Step 6: Zero-shot Inference with Real References

What it is: At test time, use real reference images (subjects or backgrounds) plus a text prompt to generate the video.
How it works:
1. If it’s a subject: use a segmenter (e.g., BiRefNet) to get the foreground mask; fill background with gray.
2. If it’s a background: skip segmentation; use an all-ones mask to treat the whole image as reference.
3. Resize-and-pad references to match video size; encode and concatenate as in training.
4. Run the Wan sampling pipeline, e.g., 50 denoising steps with CFG scale 5.0.
Why it matters: Without smart preprocessing, references wouldn’t align, and identity/background cues would be weak or messy. 🍞 Anchor: Cropping your friend out of a photo to place them into a new scene, or using the whole photo if you actually want that scene as the backdrop.

🥬 The Secret Sauce

What it is: The trio—masked training, mask augmentations, and attention masks—plus simple temporal concatenation.
How it works: They force the model to learn identity features rather than positions, keep attention on valid areas, and scale to multiple references naturally.
Why it matters: This turns abundant video–text data into a teacher that quietly simulates the R2V task. 🍞 Anchor: Practicing with shuffled, windowed photos and a helpful highlighter so you learn who your friend is in any scene, not just one snapshot.

04Experiments & Results

🍞 Hook: Think of a science fair where everyone brings their best volcano. You need fair rules, honest judges, and clear scores to see whose volcano works best.

🥬 The Test (What and Why)

What it is: Evaluate Saber on OpenS2V-Eval, a public benchmark with 180 prompts covering single- and multi-reference scenarios (faces, humans, objects, combos).
How it works:
1. Feed the same references and prompts to all models.
2. Measure visual quality (Aesthetics), smoothness (MotionSmoothness), how big movements are (MotionAmplitude), identity match (FaceSim), text alignment (GmeScore), subject consistency (NexusScore), and naturalness (NaturalScore).
3. Compare totals and sub-scores.
Why it matters: Without a shared test and clear metrics, claims would be hand-wavy and unfair. 🍞 Anchor: Like grading all volcano projects with the same rubric: height, realism, safety, and eruption power.

🥬 The Competition (Baselines)

What it is: Compare Saber to strong systems, including:
- Closed-source commercial: Kling1.6, Pika2.1, Vidu2.0.
- Explicit R2V-data models: Phantom-14B, VACE-14B, SkyReels-A2, MAGREF, BindWeave.
How it works: Everyone runs on OpenS2V-Eval with the official protocol.
Why it matters: Beating methods trained on expensive triplets shows zero-shot can scale. 🍞 Anchor: Imagine winning a race using regular sneakers against competitors with custom, pricey gear.

🥬 The Scoreboard (With Context)

What it is: Saber achieves the top Total Score (57.91%) among compared methods, even though it’s zero-shot.
How it works:
1. Total Score: 57.91%—edges out Kling1.6 and slightly surpasses triplet-trained models like Phantom-14B and VACE-14B.
2. NexusScore (subject consistency): Saber is best, outperforming Phantom by +9.79% and VACE by +3.14%—like getting an A+ in “keeping identity consistent” when others got B’s.
3. GmeScore and NaturalScore: competitive, showing good text-video alignment and realistic feel.
Why it matters: The hardest part of R2V is keeping the subject consistent; this is where Saber shines most. 🍞 Anchor: It’s like not only finishing first overall but also getting the blue ribbon for “most accurate look-alike character.”

🥬 Surprising Findings and Qualitative Examples

What it is:
1. Single human/object: Others sometimes fail to embed the reference or paste it crudely; Saber blends identity smoothly.
2. Multi-human/multi-object: Some baselines duplicate or miss subjects; Saber coherently includes all.
3. Background references: By adjusting mask ratios during training, Saber adapts to treating references as backgrounds too.
How it works: Mask diversity and attention masks reduce halos and “sticker” looks; augmentations encourage real integration.
Why it matters: Shows generalization to varied setups without special retraining. 🍞 Anchor: Instead of taping cutouts onto a set, Saber paints the scene so everything belongs.

🥬 Ablations (What Makes It Tick)

What it is: Carefully remove or change parts to see what breaks.
How it works:
1. Remove masked training and train on OpenS2V-5M triplets: total score drops (−1.67%), so masked training helps even versus explicit triplets.
2. Single mask type only: worse scores—mask diversity is crucial.
3. Fix mask area (r=0.3): bigger drop—varied area ratios aid generalization.
4. Remove mask augmentation: copy-paste artifacts appear.
5. Remove attention mask: gray halos and boundary artifacts appear.
Why it matters: Each piece (masked training, augmentation, attention mask) is necessary for best results. 🍞 Anchor: Like testing a bike: remove the chain (won’t move), take off training wheels too soon (wobbly), or deflate tires (bumpy)—each part matters.

🥬 Emergent Abilities

What it is: Abilities that weren’t hard-coded but show up anyway.
How it works:
1. Multi-view, single subject: Front/side/back references of a robot combine into one coherent subject.
2. Cross-modal alignment: Swap text descriptions (shirt color, left/right positions); Saber updates the video accordingly.
Why it matters: Shows robust understanding of references and prompts through self-attention + cross-attention interplay. 🍞 Anchor: Recognizing your friend from any angle and following new stage directions without re-training.

05Discussion & Limitations

🍞 Hook: Even great inventions have fine print—like a super-fast scooter that still needs a helmet and a smooth road.

🥬 Limitations

What it is: Situations where Saber struggles.
How it works:
1. Too many references (e.g., ~12): compositions can fragment; the model may mix parts awkwardly.
2. Complex motion control: Very detailed, choreographed actions remain challenging.
3. Temporal consistency under tricky prompts: Long, intricate scenes can show minor drift.
Why it matters: Knowing boundaries helps users plan around them. 🍞 Anchor: Juggling is fun, but tossing a dozen balls at once is still hard.

🥬 Required Resources

What it is: What you need to run or train Saber.
How it works:
1. Training: Large GPU clusters (Wan2.1-14B backbone), big video–text data (e.g., Shutterstock + captioning), hours to days of finetuning.
2. Inference: A strong GPU, segmentation model (e.g., BiRefNet) for subjects, standard diffusion sampling steps (≈50) and CFG ~5.
Why it matters: Ensures realistic expectations for deployment. 🍞 Anchor: Like needing a good kitchen, ingredients, and an oven to bake a big cake.

🥬 When NOT to Use

What it is: Cases where other tools might fit better.
How it works:
1. Exact, frame-level motion control (e.g., precise choreography): consider motion-conditioned or keyframe-driven systems.
2. Ultra-long, story-length videos requiring perfect continuity: consider methods specialized for long-range temporal memory.
3. Extremely high reference counts or complex many-to-many subject mappings.
Why it matters: The right tool for the right job saves time. 🍞 Anchor: Don’t bring a paint roller when you need a tiny brush for details.

🥬 Open Questions

What it is: Future research directions.
How it works:
1. Scaling to many references without fragmentation—better grouping and attention routing.
2. Fine-grained motion control—blend in motion priors or trajectory prompts.
3. Stronger temporal modules—reduce drift across long scenes.
4. Automatic reference-role assignment—decide which images are subjects vs. backgrounds.
Why it matters: Unlocks broader creative control and reliability. 🍞 Anchor: We’ve built a great camera; now we’re exploring new lenses, tripods, and lighting to film even better movies.

06Conclusion & Future Work

🍞 Hook: Picture learning to draw your best friend in any scene, from any angle, just by practicing with regular photos and stories—no special lessons.

3-Sentence Summary

Saber is a zero-shot reference-to-video framework that learns identity-consistent video generation using only video–text pairs, not costly triplets.
It masks and augments sampled frames during training, uses attention masks to focus on true reference regions, and runs on a modern diffusion transformer backbone.
The result is scalable, high-quality R2V that outperforms several triplet-trained systems, especially in keeping subjects consistent.

Main Achievement

Turning plentiful video–text data into an effective teacher for R2V via masked training, mask augmentation, and attention masking—eliminating the need for explicit R2V datasets.

Future Directions

Improve handling of many references, add precise motion controls, and strengthen long-horizon temporal consistency. Explore automatic role assignment for references (subject vs. background) and richer multi-view fusion.

Why Remember This

Saber shows a path to scalable, personalized video generation that regular creators can benefit from—because it learns powerful reference skills without expensive custom data. It’s a blueprint for using clever training design to replace rare data with abundant data, opening the door to more accessible, reliable, and creative video tools for everyone.

Practical Applications

•Create personalized ads where a product from a photo appears naturally in action scenes described by text.
•Generate family highlight videos where a child’s look from a portrait stays consistent across new, imagined adventures.
•Prototype movie storyboards that keep character identities steady while testing different scripts and scenes.
•Make educational clips that place a known classroom mascot into science or history demonstrations guided by captions.
•Design virtual avatars for streamers that look exactly like their uploaded reference images, animated by text prompts.
•Build brand-consistent social videos where logos and mascots match reference images without sticker-like artifacts.
•Produce product demos that keep exact textures and colors from catalog photos while showing usage scenarios.
•Create multi-subject birthday videos that correctly include all friends from reference photos with coherent interactions.
•Generate travel reels that use a reference background photo (e.g., a landmark) and animate scenes described by text.
•Quickly localize content by swapping regional reference objects (e.g., packaging variants) while keeping the same script.

Version: 1