OmniTransfer: All-in-one Framework for Spatio-temporal Video Transfer

Pengze Zhang; Yanze Wu; Mengtian Li; Xu Bai; Songtao Zhao; Fulong Ye; Chong Mou; Xinghui Li; Zhuowei Chen; Qian He; Mingyuan Gao

OmniTransfer: All-in-one Framework for Spatio-temporal Video Transfer

Intermediate

Pengze Zhang, Yanze Wu, Mengtian Li et al.1/20/2026

arXiv PDF

Key Summary

•OmniTransfer is a single system that learns from a whole reference video, not just one image, so it can copy how things look (identity and style) and how they move (motion, camera, effects).
•It adds a tiny, smart "position nudge" so the model knows when to treat time like space and when to spread appearance across frames, keeping videos consistent and clear.
•It keeps the reference video and the new video in separate lanes and lets information flow only from reference to target, which stops lazy copy-paste and speeds things up by about 20%.
•It uses an MLLM (a vision-and-language model) to understand what task you want (ID, style, motion, camera, effects) and guides the video model with the right kind of meaning.
•Across identity, style, camera movement, and effects, OmniTransfer beats previous methods, and it matches pose-based motion transfer—even without using pose.
•It can mix tasks (like ID + effect or style + motion) without being retrained for each combo, showing strong generalization.
•The key insight is that video models naturally keep actions consistent when you line things up in space, so OmniTransfer treats some time problems as space problems.
•This approach works in real-world, messy videos and cuts runtime while improving quality, offering flexible, high-fidelity video generation.
•The paper also includes new evaluation sets and user studies to measure effect fidelity, camera movement, and motion quality where standard metrics are missing.

Why This Research Matters

OmniTransfer gives creators a single, reliable tool to reuse what’s great in any video—how someone looks, moves, and how the camera and effects behave—without stitching together fragile, task-specific hacks. This means faster, higher-quality content for filmmaking, advertising, education, and social media. It lowers the barrier for non-experts, who can now show a reference video and say “make it like this,” and get faithful results. Studios benefit from consistent characters and styles across scenes without re-rigging cameras or extracting poses. The framework is efficient and general, so it scales better to real-world, messy footage. By unifying appearance and temporal transfer, it sets a path toward more controllable, expressive video generation tools.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Top Bread (Hook): Imagine you’re making a flipbook. If you draw only one picture, you can’t show motion. But if you draw many pages, the story comes alive with how things move and change.

🥬 Filling (The Actual Concept): Spatio-temporal video transfer is teaching an AI to reuse both how things look (space) and how they move over time (time) from a reference video to make a new video that follows your instructions.

What it is: A way to copy appearance (identity, style) and timing (motion, camera, effects) from a reference video into a new video.
How it works: (1) Read the whole reference video for multi-view details; (2) Understand which parts are about looks vs motion; (3) Nudge the model’s sense of position to keep either time or appearance lined up; (4) Guide it with language and vision understanding; (5) Generate a new video that obeys the chosen task(s).
Why it matters: Without it, we get videos that look right in one frame but fall apart across frames, or move correctly but lose the person’s identity or style.

🍞 Bottom Bread (Anchor): Think of making a new dance video where a different person dances with the same moves and camera swoops as the reference. The system copies the motion and camera feel while keeping the new person’s look.

The World Before: For a long time, video generation mostly leaned on two ingredients: a single image to say what things should look like, and text to say what should happen. This worked okay for simple cases: you could keep a face similar or paint a picture in a given art style. But videos are richer than images. They have multiple views (front, side, close, far) and time (actions, camera pans, and special effects). Trying to capture all that from just one image is like trying to guess a whole movie from a single screenshot.
The Problem: Past methods didn’t fully use reference videos. For appearance, they often took only one reference image for identity or style, missing multi-view details like side profiles or how clothing folds when someone turns. For time, many methods depended on extra tools like body pose skeletons or camera parameters, needed test-time fine-tuning, or did tricky “inversion” steps—making them brittle and slow. Nothing tied it all together into one flexible, general system that can transfer identity, style, motion, camera movement, and effects—all from a video.
Failed Attempts: Image-only references often broke identity when the person turned or the camera moved. Pose-controlled motion was limited: skeletons can distort clothing or fail in crowds. Inversion and test-time tuning were slow and task-specific. A recent attempt at camera cloning concatenated temporal context, but struggled to generalize and didn’t solve broader temporal tasks like effects or fine-grained motion.
The Gap: We needed a unified, plug-and-play way to read the rich clues inside a reference video—both multi-view appearance and temporal cues—and reuse them for many tasks without special add-ons. We also needed a method that is efficient (not doubling compute) and avoids lazy “copy-paste.”
Real Stakes: This matters outside the lab. Creators want to: keep the same character across scenes, restyle a clip in an artist’s look, mimic a tricky camera glide, transfer a cool visual effect, or make someone else perform the same choreography. Studios need speed and reliability across messy, real-world videos. OmniTransfer turns reference videos into a universal guidebook for new, high-fidelity, controllable generations.

To understand the pieces, let’s introduce a few core building blocks using the Sandwich pattern:

A. 🍞 Hook: You know how looking at a friend from different angles (front, side, above) helps you recognize them anywhere? 🥬 Multi-view video information:

What it is: Using many frames and angles from a reference video to capture full appearance details.
How it works: Gather face, hair, clothing, textures, and lighting across frames so side views match front views.
Why it matters: One image can miss details (like profiles), causing identity drift. 🍞 Anchor: From a video, you learn a singer’s profile nose shape and ear outline, so the AI keeps them correct when they turn.

B. 🍞 Hook: When following a dance routine, you keep steps in the right order. 🥬 Temporal coherence:

What it is: Keeping actions and looks consistent across time.
How it works: Track who is who and how they move every frame without jumps.
Why it matters: Without it, motions stutter and faces morph. 🍞 Anchor: The dancer’s spin starts, reaches peak, and finishes smoothly—no sudden teleporting.

C. 🍞 Hook: Using a travel map, you match landmarks to find your way. 🥬 Video reference cues:

What it is: Helpful hints from a reference video (appearance, motion, camera, effects).
How it works: Extract and reuse these hints to guide a new video.
Why it matters: Text alone misses fine-grained details. 🍞 Anchor: Copy the exact rhythm of waves and the same crane camera swoop from the reference beach clip.

D. 🍞 Hook: Repainting a scene without changing the shapes underneath. 🥬 Spatial appearance transfer:

What it is: Copying look (identity, style) while keeping scene structure.
How it works: Align faces, clothes, textures across positions.
Why it matters: Prevents faces or styles from melting or drifting. 🍞 Anchor: A jazz club scene restyled as watercolor, but the tables and sax remain in place.

E. 🍞 Hook: Learning new games by watching others. 🥬 Spatio-temporal video transfer (the overall task):

What it is: A unified way to transfer both look (space) and motion/camera/effects (time) from a reference video into a new one.
How it works: Read multi-view appearance and temporal cues; decide which to transfer; generate a new video that obeys those cues.
Why it matters: Makes video creation flexible, accurate, and fast. 🍞 Anchor: Make a new person perform the same breakdance and camera sweep as the reference, but in a different setting, in a comic-book style.

02Core Idea

🍞 Top Bread (Hook): You know how some puzzles are easier if you rotate them the right way? Suddenly the pattern pops out.

🥬 Filling (The Actual Concept): The key insight is to let the video model treat time like space when that helps, and space like time when that helps—by adding a tiny, task-aware positional nudge—and to pass information from the reference video to the target in one causal direction, guided by a smart multimodal brain.

What it is (one sentence): OmniTransfer unifies appearance and temporal video transfer by (1) task-aware positional bias, (2) reference-to-target causal attention, and (3) task-adaptive multimodal alignment.
How it works, high-level:
1. Build separate representations for reference and target so sizes can differ and info stays clean.
2. Add a task-aware positional bias (offset) so the model’s attention aligns either over time (for appearance) or across space (for temporal tasks).
3. Use causal attention so reference influences target, but target can’t corrupt reference.
4. Bring in an MLLM with task-specific queries to interpret what to transfer and how.
Why it matters: Without the correct positional nudge, the model muddles tasks; without causal flow, it copy-pastes; without semantic guidance, it misreads what you want.

🍞 Bottom Bread (Anchor): When cloning a camera pan from a reference, OmniTransfer shifts positions so the model aligns motions like side-by-side panels, getting consistent timing; when preserving identity, it shifts over time so the same face features propagate across frames.

Multiple analogies (3 ways):

Rubik’s Cube: The cube becomes easy once you know which face to rotate first (positional bias), follow a one-way recipe (causal flow), and read the color pattern meaningfully (multimodal alignment).
Cooking: Separate the sauce (reference) from the pasta (target), pour sauce onto pasta but not vice versa (causal), adjust spices differently for sweet vs spicy dishes (task-aware bias), and follow the written-and-illustrated recipe (MLLM guidance).
Classroom: The teacher re-seats students (positional bias) so the right kids collaborate, lets notes flow from the answer key to the students but not the other way (causal), and asks different guiding questions per subject (task-adaptive alignment).

Before vs After:

Before: Separate, fragile tools for identity, style, motion, or camera. Often image-only references. Extra pose/camera inputs. Slow and brittle.
After: One framework that reads a whole reference video, keeps identity and style consistent across views, transfers motion/camera/effects without extra priors, mixes tasks, and runs faster.

Why it Works (intuition, no equations):

Modern video diffusion models already connect frames using spatial context. OmniTransfer leans into this: for temporal tasks, it offsets positions sideways so time consistency acts like spatial consistency; for appearance tasks, it offsets along time so looks spread smoothly across frames.
Causal attention stops the shortcut of copying entire frames because the target can’t push back into the reference. This encourages true transfer, not duplication.
The MLLM adds meaning: it knows whether you asked for ID, style, camera, motion, or effects and extracts the right signals.

Building Blocks (with Sandwich explanations):

A. 🍞 Hook: When you watch a magic trick from the side, you catch secrets front-view watchers miss. 🥬 Reference Latent Construction:

What it is: Separate, clean feature packs for the reference video and the target video.
How it works: Encode each into its own latent space, keep the reference noise-free, and add simple task flags.
Why it matters: Prevents blending confusion, supports different sizes, and preserves details. 🍞 Anchor: You can feed a 720p reference and make a 1080p target while keeping identity sharp.

B. 🍞 Hook: Sliding a bookmark to the right line makes you read the right part. 🥬 Task-aware Positional Bias (TPB):

What it is: A small offset added to positional embeddings to align attention differently per task.
How it works: For temporal tasks, offset across width so the model uses spatial in-context strength to keep time consistent; for appearance tasks, offset along time so looks propagate across frames.
Why it matters: Without the right offset, the model mixes tasks and loses consistency. 🍞 Anchor: To copy a camera pan, treat time like two side-by-side panels; to keep a face the same, nudge along the timeline.

C. 🍞 Hook: One-way streets reduce traffic jams. 🥬 Reference-decoupled Causal Learning (RCL):

What it is: Separate reference and target branches with one-way information flow from reference to target.
How it works: Self-attend within reference; in target, attend to both its own tokens and the reference tokens; keep the reference branch at fixed time (t=0) so it’s reused efficiently.
Why it matters: Stops copy-paste and cuts compute by roughly 20%. 🍞 Anchor: The reference provides guidance notes; the target reads them, but the notes never get scribbled on by the target.

D. 🍞 Hook: A bilingual friend helps you understand a sign by translating both the words and the pictures. 🥬 Task-adaptive Multimodal Alignment (TMA):

What it is: Use an MLLM and task-specific MetaQueries to extract the right semantics for the requested task.
How it works: Feed prompt, template, first target frame, and reference frames; ask different MetaQueries for ID/style vs motion/camera/effects; inject the result only into the target.
Why it matters: Prevents task confusion and improves controllability. 🍞 Anchor: If you ask for “same dance,” it focuses on timing; if you ask for “same style,” it focuses on brushstrokes and color.

Together, these pieces make the puzzle snap into place: align positions the smart way, move info in one safe direction, and let a multimodal brain steer the meaning.

03Methodology

At a high level: Inputs (reference video + optional first frame + prompt) → Reference Latent Construction → Task-aware Positional Bias → Reference-decoupled Causal Learning (self-attend reference, cross-attend in target) + Task-adaptive Multimodal Alignment → Diffusion decoding → Output video.

Step-by-step with Sandwich explanations for each key step:

🍞 Hook: Packing suitcases separately keeps your clothes clean and your snacks uncrushed. 🥬 Reference Latent Construction (RLC):

What happens: The system builds one latent for the target video (noise + condition + mask) and a separate, noise-free latent for the reference video, tagging it with task flags (e.g., ID vs style vs temporal tasks).
Why this step exists: Mixing feature streams too early causes the model to blur appearances or overwrite details; keeping reference noise-free preserves cues.
Example: Target wants 81 frames at 480p; the reference is 200 frames at 720p. RLC handles the mismatch cleanly. 🍞 Anchor: Like having two backpacks—one for books (reference facts) and one for sports gear (target actions)—so they don’t get mixed up.

🍞 Hook: When solving a maze, a tiny arrow showing “this way” saves time. 🥬 Task-aware Positional Bias (TPB):

What happens: The model adds a small offset to the reference tokens’ positional encoding: sideways (width) for temporal tasks; forward in time for appearance tasks.
Why this step exists: Video diffusion models are naturally good at spatial consistency; the sideways offset makes time behave like side-by-side panels for better temporal alignment. The time offset spreads appearance across frames.
Example with data: For a 512-pixel-wide target, temporal tasks offset width by 512. For an 81-frame target, appearance tasks offset time by 81. 🍞 Anchor: To copy a camera dolly from a reference, TPB places the reference as if it’s the neighbor panel in a comic strip so motions line up.

🍞 Hook: A one-way hallway prevents people from bumping into each other. 🥬 Reference-decoupled Causal Learning (RCL):

What happens: The reference branch does self-attention with the TPB; the target branch performs attention over its own tokens plus the reference’s tokens. The reference time embedding is fixed at t=0, making it reusable and faster.
Why this step exists: Bidirectional mixing invites copy-paste and quadruples compute (because tokens double, attention cost scales quadratically). Causal flow prevents that and speeds generation.
Example: When transferring an effect (like film grain pulses), the target attends to reference patterns without stamping the exact frames. 🍞 Anchor: The reference is a library: you can read it (target attends to it), but you can’t rewrite it (no reverse flow).

🍞 Hook: Asking the right coach for the right sport. 🥬 Task-adaptive Multimodal Alignment (TMA):

What happens: An MLLM (e.g., Qwen2.5-VL) reads the prompt, a task template, the first target frame, and reference frames. Task-specific MetaQueries pull the right semantic features: identity/style cues for appearance tasks; temporal dynamics for motion/camera/effects. A small connector (MLP) adapts these features and feeds them to the diffusion model’s cross-attention in the target branch only.
Why this step exists: Raw visual matching misses intent. The MLLM injects meaning so the model knows whether to copy hairstyle or camera path.
Example: For “copy camera movement,” the MetaQuery emphasizes parallax and horizon drift; for “copy style,” it emphasizes brush texture and color palette. 🍞 Anchor: Like choosing the dance coach for choreography copying and the art tutor for painting style.

🍞 Hook: Following a recipe: prepare, combine, and bake. 🥬 Diffusion Transformer with TPB + RCL + TMA:

What happens: The DiT blocks run self-attention (with RoPE) and cross-attention, guided by TPB, RCL, and TMA. The system iteratively denoises the target latent into clean video frames.
Why this step exists: Diffusion needs multiple steps to move from noise to video while staying faithful to guidance.
Example with data: Start with noisy target latent; at each step, target tokens look to both their own history and the reference tokens; MLLM guidance keeps task semantics on track. 🍞 Anchor: Like sculpting a statue from a rough block, smoothing layer by layer while looking at a model statue next to you.

Secret Sauce (what’s clever):

Turning temporal alignment into a spatial problem for video models that are already great at spatial consistency (TPB).
Enforcing a causal, one-way street from reference to target (RCL) to avoid copy-paste and slash compute time by ~20%.
Letting a multimodal tutor (TMA) clarify task meaning, so the model emphasizes the right signals.

Extra concept Sandwiches used under the hood:

A. 🍞 Hook: Turning a globe reveals different continents. 🥬 Rotary Positional Embedding (RoPE):

What it is: A way for attention to understand where tokens are in space and time by rotating their features.
How it works: Applies a position-dependent rotation to queries and keys, so attention knows relative positions.
Why it matters: Without RoPE, the model loses track of who is next to whom or which frame follows which. 🍞 Anchor: The model can tell that frame 10 is right after frame 9 and that pixel row 100 is below row 99.

B. 🍞 Hook: Reading a story forward makes sense; reading backward can be confusing. 🥬 Causal attention:

What it is: Attention that allows information to flow in a chosen direction only.
How it works: The target reads from the reference, but not vice versa.
Why it matters: Prevents overwriting clean reference features and discourages copying. 🍞 Anchor: Students read the answer key to learn, but they don’t edit the answer key.

C. 🍞 Hook: Matching socks by both color and pattern. 🥬 Multimodal integration + contextual learning:

What it is: Combining vision and language, and learning from the given context (reference + prompt) rather than retraining per case.
How it works: MLLM features and MetaQueries focus on the right context for each task.
Why it matters: Avoids confusion—ID vs style vs motion have different needs. 🍞 Anchor: For “same dance,” it cares about rhythm; for “same style,” it cares about brushstrokes.

Put together, OmniTransfer is a recipe: keep reference and target tidy, nudge positions for the task, pass info one-way, and let a multimodal coach guide what matters. The result is flexible, high-fidelity, and fast.

04Experiments & Results

The Test (what they measured and why):

Identity transfer: Does the generated person match the reference across views and frames? They used video-level face similarity measures (VSim-Arc/Cur/Glint) and text-video alignment (CLIP-T).
Style transfer: Does the video keep the reference style while following prompts? They checked style consistency (VCSD), text-video alignment (CLIP-T), and Aesthetics Score.
Effects transfer: Can we recreate complex visual effects from a reference video? No standard metric exists, so they ran a user study (effect fidelity, first-frame consistency, quality).
Camera movement: Does the new video follow the reference camera trajectory? Another user study (camera fidelity, image consistency, quality).
Motion transfer: Does the subject’s motion match (e.g., dance) without losing appearance? User study (motion fidelity, image consistency, quality).

The Competition (baselines):

ID: ConsisID, Phantom, Stand-in (mostly image-referenced methods).
Style: StyleCrafter, StyleMaster.
Effects: Wan 2.1 I2V, Seedance 1.0 I2V.
Camera: MotionClone, CamCloneMaster.
Motion: MimicMotion, WanAnimate.

The Scoreboard with Context:

ID Transfer: OmniTransfer reached VSim-Arc 0.48 vs the next best 0.45 (Phantom), with consistent improvements on VSim-Cur and VSim-Glint too. Think of it like raising your report card from a solid A- to an A.
Style Transfer: VCSD 0.51 vs 0.29 (StyleMaster), with higher CLIP-T and Aesthetics. That’s like painting with the same style but more faithfully and more beautifully.
Effects Transfer (user study): Effect Fidelity 3.45 vs 1.95–1.81 for Seedance/Wan I2V; also better image consistency and overall quality. That’s like perfectly recreating the sparkly glitter effect when others only made vague sparkles.
Camera Movement (user study): Camera Fidelity 4.19 vs 1.79–1.75; image consistency and quality also much higher. Imagine copying a pro-level drone shot when others barely manage a pan.
Motion Transfer (user study): Motion Fidelity matched or nearly matched the larger, pose-based baseline while image consistency was higher (3.88), without using pose and even using a smaller base model.

Surprising Findings:

Spatial-in-context for time: They found video diffusion models already keep motion consistent when frames are arranged side-by-side (spatially), but not across separate shots. This supports the idea to treat some time problems as spatial ones via TPB.
No pose needed: Matching pose-based methods without any explicit pose guidance shows the power of learning directly from a reference video, reducing setup and failure modes.
Faster and better: By fixing the reference branch time (t=0) and decoupling branches, they cut runtime by about 20% while improving quality—rare to get both speed and fidelity gains together.

Ablations (what matters most):

Baseline (temporal concat + full attention): Prone to task confusion and weak subtle motion transfer.
+TPB: Big jump in fine-grained temporal control and reduced cross-task leakage.
+RCL: Removes copy-paste artifacts and speeds up inference (~20%).
+TMA: Strong boost in semantic understanding; better scene and detail control (e.g., leather jacket, beard, correct money props).

Real-world Generalization:

The model handled cinematic camera moves, professional tracking shots, complex effects, and multi-view identities better than prior work.
It also combined tasks (e.g., ID + effect, style + motion) without retraining—useful for creative workflows.

Takeaway: The numbers aren’t just small nudges; many are big leaps—like moving from a B- to a clear A+ in user preference—especially on temporal tasks where old metrics don’t exist and human judgment matters most.

05Discussion & Limitations

Limitations (be specific):

Dependence on reference quality: If the reference video is blurry, overexposed, or lacks multiple views, identity or style consistency can still suffer, especially under extreme angles.
Very complex motions or crowds: Without explicit pose priors, some chaotic multi-person scenes may be harder to track perfectly across long durations.
High-resolution compute: While RCL cuts cost by ~20%, generating long, high-res videos with strong temporal coherence still needs significant GPU memory and time.
Metrics gap: For effects and camera movement, field-standard automatic metrics are lacking; user studies help, but repeatable numeric metrics would be better.

Required Resources:

A capable video diffusion backbone (e.g., Wan 2.1 I2V 14B) and GPUs with enough memory for multi-frame attention.
An MLLM (e.g., Qwen2.5-VL) with LoRA tuning to inject task semantics.
Curated reference-target datasets for training stages and evaluation, with diverse identities, styles, motions, and effects.

When NOT to Use:

Single-frame quick edits: If you only need a one-frame change, simpler image editors are faster.
Strict, known-parameter camera control: If you already have exact camera paths and need engineering-precise replicas, a specialized camera-parameter pipeline might be preferable.
Privacy-sensitive content without consent: Reference-based identity transfer should respect permissions; the tech is powerful and must be used responsibly.

Open Questions:

Automatic temporal metrics: Can we design robust, standardized metrics for camera fidelity and effect transfer to complement user studies?
Longer videos and memory: How can we scale to minutes-long, 4K sequences while keeping identity and motion rock-solid?
Multi-person interactions: Can we further boost performance in crowd scenes or complex interactions without explicit pose priors?
Task discovery: Could the model automatically detect which aspects (ID, style, motion, camera, effects) to transfer, given a vague prompt?

Bottom line: OmniTransfer is a strong step toward unified, controllable video generation, but there’s room to improve measurement, efficiency at extreme scales, and handling of complicated real-world edge cases.

06Conclusion & Future Work

3-Sentence Summary:

OmniTransfer is an all-in-one framework that learns from entire reference videos to transfer both how things look (identity, style) and how they change over time (motion, camera, effects) into new videos.
It works by adding a task-aware positional nudge, separating reference and target with one-way causal flow, and using a multimodal model to understand what kind of transfer you want.
This makes videos more consistent, more controllable, faster to generate, and better than prior methods across many tasks, even matching pose-based motion transfer without using pose.

Main Achievement:

Unification: One method that handles identity, style, motion, camera movement, and effects—and their combinations—by cleverly aligning positions, information flow, and semantics.

Future Directions:

Build robust automatic metrics for temporal tasks like camera and effects.
Scale to longer, higher-resolution videos while keeping compute practical.
Improve multi-person interactions and extremely complex motions.
Explore automatic task detection and adaptive transfer strengths.

Why Remember This:

The paper shows a simple but powerful idea: if you shift positions the right way, time problems can become space problems that video models already handle well. Combine that with causal flow and a multimodal brain, and you get a flexible, high-fidelity, practical system. It’s a new paradigm for using videos to make better videos—useful for creators, studios, and everyday storytellers.

Practical Applications

•Clone a cinematic camera move (like a crane or dolly) from a film clip onto a new scene.
•Keep a character’s identity consistent across multiple shots while changing outfits or settings.
•Restyle a whole video in a specific artist’s look with stable brush texture and color palette.
•Recreate complex visual effects (e.g., glow pulses, film grain rhythms) from a reference video.
•Transfer a dancer’s choreography to a new performer without pose extraction.
•Mix tasks, such as applying both an identity and a visual effect from different references.
•Rapidly prototype commercials by matching camera language from brand references.
•Generate consistent multi-shot sequences with unified motion and appearance.
•Localize content by preserving action timing while changing characters and environments.
•Educational demos: show the same experiment recorded elsewhere but with your classroom’s look.

Version: 1