DreamID-V:Bridging the Image-to-Video Gap for High-Fidelity Face Swapping via Diffusion Transformer

Xu Guo; Fulong Ye; Xinghui Li; Pengqi Tu; Pengze Zhang; Qichao Sun; Songtao Zhao; Xiangwang Hou; Qian He

DreamID-V:Bridging the Image-to-Video Gap for High-Fidelity Face Swapping via Diffusion Transformer

Intermediate

Xu Guo, Fulong Ye, Xinghui Li et al.1/4/2026

arXiv PDF

Key Summary

•DreamID-V is a new AI method that swaps faces in videos while keeping the body movements, expressions, lighting, and background steady and natural.
•It borrows the strengths of image face swapping (which is very good at identity matching) and brings them into video face swapping (which needs smooth motion across frames).
•A custom data pipeline called SyncID-Pipe builds special training pairs so the model learns exactly whose face to keep and which video details to preserve.
•An Identity-Anchored Video Synthesizer (IVS) helps create training videos that match poses and expressions while keeping identity separate, so learning is clear and stable.
•The core model is a Diffusion Transformer with a Modality-Aware Conditioning module that feeds in context, structure (pose), and identity in the right places so they don’t get mixed up.
•Training uses a Synthetic-to-Real Curriculum: first learn from clean synthetic data, then fine-tune on real-looking scenes to boost realism without losing identity.
•An Identity-Coherence Reinforcement Learning step focuses extra learning on the hardest frames (like side views or fast motion) to cut flicker and keep identity steady over time.
•On the new IDBench-V benchmark of tough real-world cases, DreamID-V beats past methods in identity, attribute preservation, and video quality.
•The framework is versatile and can be adapted to swap not just faces, but also accessories, outfits, hairstyles, and more.
•The team also discusses safety and releases a benchmark to encourage fair, careful evaluation.

Why This Research Matters

High-quality, steady face swapping can power safer film stunts, dubbing, and creative edits without reshoots, saving time and cost. Newsrooms and educators can localize content ethically by re-speaking with clear consent, while keeping backgrounds and actions faithful. Accessibility improves when presenters can maintain consistent on-screen identity despite camera or lighting changes. Privacy can be protected by swapping in approved proxy identities for public sharing, when done transparently. The benchmark (IDBench-V) encourages fair comparisons so the field moves toward more reliable, less flickery results. The approach also generalizes to outfit or accessory swapping for virtual try-ons and creative design. Built-in discussions of ethics help steer the technology toward responsible, consent-based uses.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: You know how swapping your friend’s face onto a photo can look great, but when you try to do it in a video, the face sometimes wiggles, flickers, or doesn’t move quite right? That’s because videos add the challenge of time—lots of frames that all need to agree.

🥬 The Concept (Video Face Swapping basics): Video Face Swapping (VFS) means putting one person’s identity (who they are) onto someone else’s video while keeping the video’s original pose, expressions, lighting, background, and motion.

How it works (before this paper): Most systems took image face-swapping tricks and ran them frame-by-frame on videos. Each frame could look okay by itself, but together they often flickered or drifted because the identity and movements weren’t learned across time.
Why it matters: Without solid time awareness, you get jittery eyebrows, mismatched lighting, and identity that comes and goes—making results feel fake.

🍞 Anchor: Imagine pasting sticker faces on each page of a flipbook. If the stickers aren’t placed consistently, the animation looks wobbly.

🍞 Hook: You know how taking a single photo is much easier than shooting a smooth movie? That’s like Image Face Swapping (IFS) versus Video Face Swapping (VFS).

🥬 The Concept (Image Face Swapping vs. Video): Image Face Swapping (IFS) swaps a face in one picture and is very good at matching identity and local details; VFS must also make that identity flow smoothly over many frames.

How it works: IFS models learn strong identity features and can keep attributes (like lighting and background) in one image. But when naïvely applied frame-by-frame to a video, they don’t enforce consistency across time.
Why it matters: IFS is great at identity fidelity, so if we could “teach” VFS to use IFS’s strengths while keeping time consistent, we’d get the best of both worlds.

🍞 Anchor: It’s like using a top-notch portrait artist (IFS) to plan each frame, then asking a film director (VFS) to keep the whole scene consistent across a sequence.

🍞 Hook: Think of a marching band: each marcher (frame) needs to step in rhythm, not just look sharp alone.

🥬 The Concept (The Problem): Existing VFS methods struggled to balance three things at once: (1) strong identity similarity, (2) faithful attribute preservation (pose, expression, lighting, background), and (3) temporal consistency (no flicker).

How it works (failed attempts):
1. Frame-by-frame IFS on videos → good identity per frame, but flicker over time.
2. Video diffusion/inpainting methods → better smoothness but often weaker identity or attribute control.
3. Using conditions like pose or masks → helps structure, but still mixes identity and motion signals.
Why it matters: Without a clean way to teach which parts belong to identity and which to attributes/motion, the model confuses “who” with “how they move,” causing identity leaks or stiff faces.

🍞 Anchor: If the drummer keeps perfect beat (motion) but swaps instruments mid-parade (identity confusion), the whole show feels off.

🍞 Hook: Imagine learning to juggle by first practicing with soft beanbags (safe and predictable) before switching to real balls (harder but realistic).

🥬 The Concept (The Gap): Training data for VFS usually lacked explicit, paired supervision that clearly told the model: “Keep this identity, but follow that video’s motion and background.”

How it works: We need videos where the model knows exactly which frames carry identity and which carry attributes—paired and aligned—so it can learn to separate and fuse them correctly.
Why it matters: Without explicit supervision, models guess. With explicit, well-structured pairs, they learn cleanly and quickly.

🍞 Anchor: It’s like giving a student clear answer keys for both “who is in the picture” and “how they’re moving,” so studying is much more effective.

🍞 Hook: Think of movie magic: stunt doubles, rehearsals, and final shoots work together to make the hero look the same while performing wild moves.

🥬 The Concept (This Paper’s Role): DreamID-V builds a bridge from IFS to VFS by creating a special data pipeline and a specialized video model that learn identity, motion, and context the right way.

How it works: A new pipeline (SyncID-Pipe) crafts paired training examples; a Diffusion Transformer model uses a Modality-Aware Conditioning system to feed in context, structure, and identity separately; a two-stage curriculum plus a reinforcement learning step improves realism and identity stability over time.
Why it matters: With this combo, videos look realistic, identities are accurate, and motion is smooth—no more trade-offs.

🍞 Anchor: The final videos are like a well-edited movie: the actor’s face is always “them,” while the background, lighting, and actions match the original scene perfectly.

02Core Idea

🍞 Hook: Imagine you want to build the ultimate LEGO figure: you borrow the best face from one set (image model), the coolest moving joints from another (video model), and a clear instruction guide that tells you which piece goes where.

🥬 The Concept (Aha! in one sentence): The key idea is to transfer the excellent identity skills of image face swapping into videos by building perfect paired training data and a video model that keeps identity, structure, and context cleanly separated and fused.

How it works (the recipe):
1. Build paired training examples (bidirectional ID quadruplets) so the model knows what to copy for identity and what to keep for motion/background.
2. Use a Diffusion Transformer with Modality-Aware Conditioning (MC) so identity, pose/structure, and context each enter the model in the most helpful way.
3. Train in two steps: first on clean synthetic data (fast, stable identity), then on real-augmented data (better realism).
4. Add Identity-Coherence Reinforcement Learning to focus on hard frames (like side views) and reduce flicker.
Why it matters: Without this, models either lose identity, break motion, or look fake. With it, they keep all three in balance.

🍞 Anchor: The result is a face swap that feels like watching the same person act through the whole scene—no slips, no jumps.

Three analogies:

Orchestra: IFS is the soloist with a beautiful melody (identity), VFS needs the whole orchestra to stay in sync (motion/background). DreamID-V is the conductor that keeps them together.
Sports team: One star player (IFS identity) can’t win alone; you need teammates (pose, masks, background) and a coach (MC + curriculum + IRL) to coordinate a victory (smooth high-fidelity videos).
Baking: Use the best chocolate (IFS identity), the right oven settings (pose + context), and a two-stage bake (synthetic→real). A final glaze (IRL) smooths any imperfections.

Before vs After:

Before: Frame-by-frame tricks gave good faces per frame but shaky videos; video-only models kept smoothness but often blurred identity or messed up attributes.
After: DreamID-V keeps identity strong like top IFS, preserves attributes like pose/lighting/background, and maintains smooth motion across frames.

Why it works (intuition):

Clean supervision: The bidirectional ID quadruplets act like answer keys that remove confusion between “who” and “how they move.”
Smart conditioning: MC routes each kind of information (context, structure, identity) to the right place in the model so they don’t tangle.
Stepwise learning: Synthetic first (safe practice), then real augmentation (authentic polish), so the model learns quickly and robustly.
Focus on hard parts: IRL pushes extra attention to tricky frames, cutting flicker where it matters most.

Building blocks (each explained with sandwiches as first-time concepts appear):

Image Face Swapping (IFS) → strong identity features to borrow.
Video Face Swapping (VFS) → needs temporal consistency.
Diffusion Transformer (DiT) → a powerful video generator backbone.
SyncID-Pipe → data pipeline that constructs perfect paired supervision.
Identity-Anchored Video Synthesizer (IVS) → creates identity-preserving, pose-guided videos.
Modality-Aware Conditioning (MC) → splits and injects context/structure/identity cleanly.
Synthetic-to-Real Curriculum → two-phase training for stability then realism.
Identity-Coherence Reinforcement Learning (IRL) → focuses learning on the toughest frames.

03Methodology

At a high level: Input (source identity image + target video) → SyncID-Pipe builds paired training data → DreamID-V (a Diffusion Transformer with Modality-Aware Conditioning) learns to swap → Training uses Synthetic-to-Real Curriculum → IRL refines temporal identity consistency → Output is a high-fidelity, temporally smooth swapped video.

Concept sandwiches for key components (first introductions):

🍞 Hook: Think of a movie rehearsal where actors first practice with stand-ins before the real shoot. 🥬 The Concept (SyncID-Pipe): SyncID-Pipe is a data pipeline that makes perfectly matched training pairs so the model clearly learns identity vs. motion/background.

How it works:
1. Pre-train an Identity-Anchored Video Synthesizer (IVS) that can rebuild videos from keyframes and pose.
2. Use a strong image face-swapping model to create identity-consistent first/last frames for a video.
3. Generate a synthetic video (same motion, new identity) and pair it with the real video to form bidirectional ID quadruplets.
4. Add expression adaptation and enhanced background recomposition so pairs align in motion and environment.
Why it matters: Without clean, paired guidance, the model confuses who the person is with how they move, causing identity leaks and shaky results. 🍞 Anchor: It’s like giving the student two scripts—one says “who the character is,” the other says “how to act”—so they never mix them up.

🍞 Hook: Imagine a dance choreographer telling a performer exactly when to turn, tilt, and nod. 🥬 The Concept (Identity-Anchored Video Synthesizer, IVS): IVS is a video generator that keeps identity while following a pose sequence between a first and last frame.

How it works:
1. Take the first and last frames of a portrait video and extract the pose sequence.
2. Feed them into a First-Last-Frame video foundation model.
3. Use Adaptive Pose Attention to inject motion precisely into the Diffusion Transformer blocks.
4. Train with Flow Matching so it learns to reconstruct consistent videos.
Why it matters: IVS can make synthetic training clips that align with the base video model’s distribution, speeding up learning and stabilizing identity. 🍞 Anchor: Like animating a character between two key photos by following a dance map of head movements.

🍞 Hook: Picture making a sandwich where bread, lettuce, and cheese each go in their own layer so flavors don’t get muddled. 🥬 The Concept (Modality-Aware Conditioning, MC): MC feeds three types of information—context, structure, identity—into the model in different, best-suited ways so they don’t tangle.

How it works:
1. Spatio-Temporal Context Module: Concatenate the target video and a face mask with the latent so background/lighting are preserved.
2. Structural Guidance Module: Inject pose sequence via Pose Attention (initialized from IVS) to capture expressions and head motion.
3. Identity Information Module: Encode identity into embeddings and concatenate as tokens so attention mixes identity across space and time.
Why it matters: Without this separation, identity can override background or pose can distort identity, leading to fake-looking results. 🍞 Anchor: Like three clear lanes on a highway—cars (identity), buses (pose), and bikes (context) don’t crash because each has a proper lane.

🍞 Hook: Think of practicing piano on a digital keyboard (forgives mistakes) before performing on a grand piano (less forgiving but richer sound). 🥬 The Concept (Synthetic-to-Real Curriculum Learning): Learn first from synthetic pairs (easy, clean) and then fine-tune on real-augmented pairs (harder, realistic) to boost photorealism.

How it works:
1. Stage 1: Train on forward-generated pairs from IVS so identity converges fast (clean supervision, matched distribution).
2. Stage 2: Fine-tune on backward-real pairs enhanced with background recomposition to learn real-world textures and stability.
Why it matters: Jumping straight to messy real data slows learning and blurs identity; this curriculum keeps identity strong then adds realism. 🍞 Anchor: Like learning to drive in a simulator, then on quiet streets, before handling busy roads.

🍞 Hook: When you study, you focus extra time on the toughest questions. 🥬 The Concept (Identity-Coherence Reinforcement Learning, IRL): IRL re-weights the training loss to emphasize frames where identity fidelity is low (e.g., side views, fast motion).

How it works:
1. Generate a video pass and compute an identity score per frame.
2. Average scores per chunk and use them to weight the training loss (harder frames get more weight).
3. Update the model so it pays more attention to challenging moments, reducing temporal flicker.
Why it matters: Without IRL, identity quality dips in tough frames; IRL raises the floor where it’s weakest. 🍞 Anchor: Like a coach making the team practice the trickiest plays more often until they’re solid.

More sandwiches for important helpers:

🍞 Hook: Imagine swapping smiles without changing who’s smiling. 🥬 The Concept (Expression Adaptation): A module separates who the person is (identity) from how they emote (expression), then recombines identity from the target image with expression and pose from the source video.

How it works: Reconstruct a 3D face, split identity vs. expression/pose coefficients, retarget landmarks, and guide the pose sequence.
Why it matters: Prevents identity leakage and lets expressions transfer cleanly. 🍞 Anchor: Like putting your friend’s unique face on the same dance moves.

🍞 Hook: Think of green-screen compositing for movies. 🥬 The Concept (Enhanced Background Recomposition): Cleans the background from the real video and pastes the synthetic foreground into it so the model learns to preserve real backgrounds.

How it works: Segment foreground with SAM2, remove it, keep a clean background clip, then paste the generated face/body back with feathered edges.
Why it matters: Models learn real-world background stability without copying paste artifacts. 🍞 Anchor: Like filming an actor on a set and carefully blending them into a real scene.

🍞 Hook: A relay race starts and ends with clear batons. 🥬 The Concept (Bidirectional ID Quadruplets): Each training sample has four parts: original image and video (ID A), target image (ID B), and a synthesized video following the source motion but with ID B.

How it works: Use IFS to create identity-correct keyframes, IVS to make the synthetic video, then pair forward (I_A, V_B^syn vs V_A^real) and backward to supervise both directions.
Why it matters: Gives the model crystal-clear instructions on identity vs. motion/background. 🍞 Anchor: Like practicing both “A becomes B” and “B aligns with A’s scene” so the handoff is always clean.

🍞 Hook: Following a GPS line from point A to B. 🥬 The Concept (Flow Matching, training backbone): Flow Matching makes training stable by learning a direct “velocity” from noise to data along a straight path in latent space.

How it works: Interpolate between noise and data, train the model to predict the velocity that moves you toward the data.
Why it matters: Stable, efficient training for video diffusion. 🍞 Anchor: Like an arrow that always points you smoothly toward your destination.

Putting it all together (example walkthrough):

Input: A source video (with motion, expressions, background) and a target face image (identity to swap in).
Data prep: SyncID-Pipe builds paired examples with IVS, expression adaptation, and background recomposition.
Model: DreamID-V (a DiT) gets three condition streams via MC: context (video + mask), structure (pose), and identity (ID embeddings).
Training: Stage 1 on synthetic → strong identity; Stage 2 on real-augmented → strong realism; IRL → steady identity in tough frames.
Output: A video where the target identity appears naturally, keeping the source video’s pose, expressions, lighting, and background.

04Experiments & Results

🍞 Hook: Think of a school field day with tough challenges—sprinting, balancing, and teamwork—and a scoreboard to see who really excels.

🥬 The Concept (The Test): The team built IDBench-V, a new benchmark of 200 challenging real-world video-and-image pairs to test identity accuracy, attribute preservation, and video quality.

How it works:
1. Identity Consistency: Compare the generated face to the target identity using strong face recognizers (like ArcFace) and check stability over frames (variance).
2. Attribute Preservation: Measure how well pose and expressions match the source video, plus background and subject consistency and motion smoothness (from VBench).
3. Video Quality: Use Fréchet Video Distance (FVD) to judge overall visual realism.
Why it matters: These tests catch different failure modes: identity drift, pose/expression mismatch, shaky backgrounds, or low realism.

🍞 Anchor: It’s like grading a performance on “looks like the right actor,” “follows the choreography,” and “the film looks polished.”

Competition (Baselines):

Image-based: FSGAN, REFace, Face-Adapter, DreamID (applied frame-by-frame to videos).
Video-based: Stand-In, CanonSwap.

Scoreboard with context:

Identity: DreamID-V scores top identity similarity across metrics (e.g., ID-Arc ≈ 0.659), which is like getting an A+ when others hover around B to C. It even slightly surpasses the IFS model DreamID when used frame-by-frame, thanks to clean supervision and training strategy.
Attributes: It’s near the best on pose/expression/background, close to CanonSwap’s high preservation—but unlike CanonSwap, DreamID-V achieves this while also keeping very strong identity.
Quality: FVD is strong, better than image-only baselines (which can flicker) and competitive with video models, showing smooth, realistic videos.

Surprising findings:

Clean synthetic-first training (from IVS) accelerates learning and boosts identity beyond what you’d expect, because synthetic data matches the model’s own distribution.
The IRL step doesn’t just lift average identity—it reduces frame-to-frame variance, meaning fewer identity “hiccups” during tough angles or fast motion.

User study:

Human raters scored DreamID-V best overall for identity, attributes, and quality (scores ~4+ out of 5), matching the numbers with what people actually perceive.

Ablations (what breaks without each piece):

Without Quadruplets (just inpainting): Identity drops a lot—shows the data pipeline is key.
Only Real Training: Realism is okay, but identity weakens—too messy, too soon.
Only Synthetic Training: Identity is strong, but realism suffers—needs real augmentation.
Without IRL: Average identity is good, but consistency suffers on side views—IRL fixes the “final mile.”

🍞 Anchor: Imagine a gymnast’s routine: without warm-up (synthetic), they wobble; without real practice, polish is missing; without a coach’s targeted drills (IRL), the hardest moves falter. DreamID-V brings all three.

05Discussion & Limitations

🍞 Hook: Every superhero has a weakness, and every tool has a best-use zone.

🥬 The Concept (Honest assessment): DreamID-V is powerful, but there are trade-offs and practical needs.

Limitations (what it can’t do easily):
1. Extremely wild backgrounds or camera motions can still challenge perfect preservation.
2. Very tiny faces or ultra-fast spins may briefly lower identity confidence (though IRL helps).
3. Training depends on curated data and careful pairing; poor data reduces gains.
4. Ethical risk: like any high-fidelity swapper, misuse is possible without consent or transparency.
Required resources: A modern GPU setup, access to large-scale human-centric video data, and IFS-quality keyframes are needed to realize full performance.
When not to use: If you need low-compute, on-device swaps, or if faces are too small/occluded to extract reliable pose/ID signals, simpler methods may suffice. Also, never use without clear consent and proper disclosure.
Open questions:
1. Can we further improve side-profile identity with multi-view identity encoders?
2. Can background preservation be strengthened under extreme camera movement?
3. Can we reduce compute via distillation or streaming inference while keeping quality?
4. How to watermark or detect swaps robustly without harming quality?

🍞 Anchor: Think of it like a high-end camera rig: amazing results with the right setup and responsibility, but not the best fit for every quick snapshot.

06Conclusion & Future Work

🍞 Hook: Picture a puzzle where one box has the perfect face pieces (images) and another has the motion pieces (videos). If you can fit them together just right, you complete a stunning moving portrait.

🥬 The Concept (3-sentence summary): DreamID-V bridges image and video face swapping by building explicit paired training data (via SyncID-Pipe and IVS) and by using a Diffusion Transformer with Modality-Aware Conditioning to keep identity, structure, and context cleanly separated. A Synthetic-to-Real Curriculum sets identity first, realism second, and an Identity-Coherence Reinforcement Learning step polishes the toughest frames. On the new IDBench-V benchmark, it achieves state-of-the-art identity fidelity, strong attribute preservation, and high video quality.

Main achievement: Showing that carefully crafted data pairs plus a modality-aware DiT—and a thoughtful training schedule—can deliver high-fidelity, temporally steady face swaps that outperform prior art.

Future directions: Stronger side-profile identity via multi-view cues, more robust background handling in dynamic scenes, lighter compute via distillation, and built-in watermarking/detection for safer deployment.

Why remember this: DreamID-V demonstrates a practical recipe for uniting image-level strengths with video-level smoothness, proving that the right supervision and conditioning can turn wobbly swaps into movie-ready results—while reminding us to use such power responsibly.

Practical Applications

•Film and TV post-production: Replace faces for stunt scenes while keeping the original camera work and lighting.
•Dubbing and localization: Match new speakers’ identities or approved proxies to original performances with accurate lip and expression sync.
•Advertising and creative edits: Quickly generate brand-consistent faces across many shots without reshooting.
•Virtual try-on: Swap accessories, hairstyles, or outfits on moving people to preview styles in motion.
•Privacy-preserving sharing: Publish videos with a consented proxy identity while preserving the original scene and actions.
•Education and training: Demonstrate motion or expression changes while keeping scene context for clearer instruction.
•Game and VR content: Create consistent character identity across cutscenes with complex camera moves.
•Live event highlights: Post-process clips to maintain performer identity under tough angles and lighting.
•Forensics and media analysis research: Test detectors and watermarking methods on a strong, realistic swap baseline.
•Telepresence and avatars: Keep a user’s chosen identity stable in video calls while matching natural expressions.

Version: 1