MoCha:End-to-End Video Character Replacement without Structural Guidance

Zhengbo Xu; Jie Ma; Ziheng Wang; Zhan Peng; Jun Liang; Jing Li

MoCha:End-to-End Video Character Replacement without Structural Guidance

Intermediate

Zhengbo Xu, Jie Ma, Ziheng Wang et al.1/13/2026

arXiv PDF

Key Summary

•MoCha is a new AI that swaps a person in a video with a new character using only one mask on one frame and a few reference photos.
•Unlike older methods, it does not need a mask for every frame or extra guides like skeletons or depth maps, so it handles tricky scenes better.
•It works by learning in context: it looks at the source video, the single-frame mask, and the reference images all at once and figures out what to replace and how to move.
•A special “condition-aware RoPE” tells the model how all these pieces line up in space and time without locking it to a fixed video length.
•To keep the face looking like the references, MoCha adds a short reinforcement learning (RL) post-training that rewards accurate identity while avoiding copy-paste cheats.
•Because real paired training videos are rare, the authors built three datasets: UE5-rendered pairs, expression-driven portrait pairs, and augmented real video–mask pairs.
•Across many tests, MoCha beats strong baselines on image quality, temporal smoothness, identity consistency, and background preservation.
•It stays robust under occlusions, fast motion, complex lighting, and multi-character interactions.
•The technique generalizes beyond people and can be used for face swap, virtual try-on, and other subject-replacement edits.
•The team will release code, making it easier for others to build on this work.

Why This Research Matters

MoCha lowers the barrier to high-quality video edits by needing only a single-frame mask and a few reference images, making complex replacements practical for more people. It preserves lighting, motion, and backgrounds, so results look natural rather than pasted-on. This can speed up film post-production, empower content creators, and enable personalized ads or avatars without massive manual labeling. It also supports tricky situations like occlusions and fast motion, which are common in real videos. The approach generalizes to face swap, virtual try-on, and even non-human subject edits. Because the team will release code, researchers and developers can quickly build new tools on top of MoCha. In short, it brings studio-grade swaps closer to everyday creative work.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

You know how movie editors can replace stunt doubles or change costumes digitally, but it takes a lot of careful work frame-by-frame? People want that kind of magic to be faster and more reliable for everyday videos, too.

🍞 Top Bread (Hook): Imagine you’re flipping through a flipbook. If you color one character blue on every page, it takes forever and you might make mistakes when the pages get messy. 🥬 Filling (The Actual Concept): Video character replacement is the task of switching the person (or character) in a video while keeping the background, camera motion, and actions the same.

How it works (before MoCha): Earlier tools needed a mask on every frame to show exactly where the character is, and extra guides like a stick-figure skeleton or a depth map to control how the character moves and fits in 3D space.
Why it matters: Without the right character in the right place, videos look fake—faces slide, arms clip through objects, and lighting looks wrong. 🍞 Bottom Bread (Anchor): Think of making a soccer highlight where the player is replaced by a cartoon hero, but the field, ball, and kicks stay perfectly matched.

The World Before:

Popular methods were “reconstruction-based.” They masked the character region on every frame, then re-drew the person with the help of structural guidance (skeletons, depth) and a reference image. This approach works okay on simple videos.

🍞 Top Bread (Hook): You know how tracing a dancing person is easy when they stand still, but hard when they spin behind a chair? 🥬 Filling (The Actual Concept): Per-frame segmentation masks are outlines that mark the character’s pixels on every single frame.

How it works: A segmentation tool draws a region around the target person over time.
Why it matters: If the outline is wrong even once—say, during fast motion or occlusion—the mistake can spread and break later frames. 🍞 Bottom Bread (Anchor): If your outline misses a hand for three frames, the new character’s hand might vanish or jitter there too.

🍞 Top Bread (Hook): Imagine using a stick figure to guide a puppet’s dance. 🥬 Filling (The Actual Concept): Structural guidance (like pose skeletons or depth maps) are extra signals that tell the model where limbs are and how far objects are.

How it works: Algorithms extract pose or depth per frame, then the generator follows these blueprints.
Why it matters: If the blueprint is wrong—like a bent arm detected as straight—the final video inherits that error. 🍞 Bottom Bread (Anchor): A backflip in dim light might trick the pose detector, causing a bizarre twist in the generated video.

The Problem:

In real videos, people get blocked by objects or other people (occlusion), lighting changes, and motions get wild (acrobatics). Errors in masks or structure pile up, causing flicker, warping, and broken motion.
This also costs a lot: computing masks and structures for every frame is slow and expensive.

Failed Attempts:

Heavier structural controls: more maps, more tags. But more controls mean more things can go wrong and more compute to run.
Better maskers: still break in difficult scenes and still require per-frame effort.

The Gap:

What if the model could track the character by itself and only needed a tiny hint of where to start?

🍞 Top Bread (Hook): You know how once you find your friend in a crowd, your eyes can keep track of them even as they move? 🥬 Filling (The Actual Concept): Video diffusion models can learn temporal patterns and track subjects implicitly across frames.

How it works: They attend to what changes smoothly over time and keep internal memory of who’s who.
Why it matters: If the model can track, you don’t need a mask on every frame—just one to say “start here.” 🍞 Bottom Bread (Anchor): Tag your friend in one frame, and the model follows them through the whole dance.

Real Stakes (Why we care):

Films: faster post-production, fewer artifacts.
Ads and social content: personalized characters without studio pipelines.
Virtual try-on: swap clothing on moving people naturally.
Avatars and games: drive new identities with real motion.
Everyday creators: powerful edits without expert tools.

Before we go deeper, let’s set up two key ideas we’ll need later.

🍞 Top Bread (Hook): Picture painting a scene by starting with fog and slowly wiping it away to reveal details. 🥬 Filling (The Actual Concept): A Video Diffusion Model generates videos by starting from noise and gradually refining frames until they look real.

How it works: Step-by-step, it denoises, using learned patterns about how videos should look and move.
Why it matters: This stepwise process naturally handles time and motion smoothly. 🍞 Bottom Bread (Anchor): Like sharpening a blurry sports replay until you see the exact kick and ball spin.

🍞 Top Bread (Hook): Imagine recognizing your friend by their eyes, jawline, and smile shape. 🥬 Filling (The Actual Concept): Face Feature Extraction pulls out identity clues from faces so the model knows who someone looks like.

How it works: A face network turns a face into a signature vector that stays stable across expressions and angles.
Why it matters: Without stable identity cues, faces drift and stop looking like the person you wanted. 🍞 Bottom Bread (Anchor): If the signature says “this is Alice,” Alice’s face should stay Alice even when she laughs or turns.

02Core Idea

The “Aha!” in one sentence: Give the model the source video, a single-frame mask as a starting pointer, and a few reference photos, then let it learn in context how to swap the character—no full structural guides needed.

Analogy 1: 🍞 Top Bread (Hook): You know how a magician asks for one volunteer and then keeps track of them for the whole trick? 🥬 Filling (The Actual Concept): MoCha uses one mask on one frame to say “this is the person,” and then tracks them across time while copying their motion onto a new identity.

How it works: The model sees everything together—video, mask, references—and figures out replacement directly.
Why it matters: You avoid making thousands of labels and fragile blueprints. 🍞 Bottom Bread (Anchor): The magician never asks you to keep waving flags on every page of the flipbook.

Analogy 2: 🍞 Top Bread (Hook): Think of building a LEGO model while looking at a photo of the finished castle. 🥬 Filling (The Actual Concept): In-context learning means MoCha understands the task by how you arrange inputs (video + mask + references) rather than by rigid extra instructions.

How it works: Tokens from all inputs are placed in one sequence so the model can compare, align, and act.
Why it matters: The model flexibly adapts to different video lengths, numbers of references, and where the mask appears. 🍞 Bottom Bread (Anchor): Whether your castle has 3 or 10 towers, the same building logic still works.

Analogy 3: 🍞 Top Bread (Hook): Imagine sticky notes on a timeline that mark who is who and where images belong. 🥬 Filling (The Actual Concept): Condition-aware RoPE is a smarter way to label positions in space and time so the model fuses all inputs coherently.

How it works: Source and target frames share matching time labels; reference images get a special tag; the single-frame mask’s tag matches its chosen frame.
Why it matters: Without this, the model could get confused about which frame or image goes where, breaking motion or identity. 🍞 Bottom Bread (Anchor): It’s like color-coding your notes so you never mix Monday’s homework with Tuesday’s.

Before vs After:

Before: Per-frame masks + skeletons/depth; more controls, more failures when scenes get complex; heavy compute.
After (MoCha): One mask, no structural guidance; the model tracks and replaces end-to-end; better identity, motion faithfulness, and lighting.

Why It Works (intuition):

Video diffusion models already learn smooth time patterns and who-where-when relationships.
If you let them see all conditions at once (in context) and tag positions smartly (condition-aware RoPE), they can infer which pixels to replace and how to keep motion consistent.
A short RL post-training nudges the face to match the references more tightly, while an added pixel-wise loss stops copy-paste cheating.

Building Blocks (introduced in kid-friendly order):

🍞 Top Bread (Hook): You’ve seen how a class learns by example setups. 🥬 Filling (The Actual Concept): In-Context Learning means MoCha understands the job based on how we arrange the inputs together.

How it works: Concatenate tokens from the source video, target slot, mask, and references so the model can compare and transfer.
Why it matters: Without this, each piece would be seen alone, and the model wouldn’t know how they relate. 🍞 Bottom Bread (Anchor): Like laying worksheets side-by-side so you can match answers across them.

🍞 Top Bread (Hook): Think of coordinate stickers on a map. 🥬 Filling (The Actual Concept): Condition-aware RoPE labels tokens with space-time coordinates tailored to their role.

How it works: Source and target share matching time labels; references are tagged as timeless but spatially distinct; the mask gets a time label that matches its chosen frame.
Why it matters: Without aligned labels, motion could desync, faces drift, or references get ignored. 🍞 Bottom Bread (Anchor): Your legend says red pins are today’s stops, blue pins are pictures to copy, and the green pin is where to start.

🍞 Top Bread (Hook): When you practice a song, a coach listens for the tricky notes. 🥬 Filling (The Actual Concept): RL Post-Training gives rewards for faces that match the reference identity and keeps the whole picture honest with a pixel-wise loss.

How it works: A face encoder checks identity similarity; the pixel loss prevents pasting the reference image directly.
Why it matters: Without this, faces might look “kind of right” but not truly like the person. 🍞 Bottom Bread (Anchor): You get points for sounding like the song, but you can’t just hit play on a recording.

🍞 Top Bread (Hook): Practice with easy examples, then harder ones. 🥬 Filling (The Actual Concept): A Three-Source Dataset trains the model: perfect UE5-rendered pairs, portrait animation pairs for expressions, and augmented real video–mask pairs for realism.

How it works: Each source fills a gap—alignment, facial motion, and real-world textures.
Why it matters: Without diverse training, the model would struggle with either realism or alignment. 🍞 Bottom Bread (Anchor): It’s like training wheels (rendered), singing lessons (expressions), and real concerts (real videos).

03Methodology

At a high level: Source video + single-frame mask + reference images → Encode into tokens → Concatenate with a target “slot” to be generated → Condition-aware RoPE tags every token with the right space-time label → Video diffusion model denoises step-by-step → Output video with the character replaced.

Step 1: Inputs and What They Mean

What happens: You provide (a) the source video (the motions and lighting you want to keep), (b) one mask on any chosen frame marking the character to replace, and (c) one or more reference images of the new identity.
Why this step exists: The model needs to know where to start tracking (mask) and what the new character should look like (references), while keeping the original motion and background (source video).
Example: A skateboarder does a kickflip. You draw a mask on frame 37 around the skateboarder. Your references are two photos of a superhero’s face and outfit.

🍞 Top Bread (Hook): Think of packing your backpack with labeled folders. 🥬 Filling (The Actual Concept): A Variational Autoencoder (VAE) compresses videos and images into smaller codes (latents) so the model can process them efficiently.

How it works: The video and images are turned into compact representations that still keep essential details.
Why it matters: Without compression, handling long videos with high resolution would be too heavy and slow. 🍞 Bottom Bread (Anchor): Like shrinking a giant poster into a postcard that still shows the picture clearly.

Step 2: Turn Everything Into Tokens and Line Them Up

What happens: The compressed latents are cut into patches (tokens). We create a single sequence by concatenating: target tokens (empty slot to fill), source video tokens, the single-frame mask token, and the reference image tokens.
Why this step exists: Putting all the pieces into one timeline lets the model “read” how they relate—who to copy, where to replace, and what the new face should be.
Example: If your source has 81 frames, you include 81 target slots, 81 source frames, one mask token, and 1–2 reference tokens.

🍞 Top Bread (Hook): Imagine putting colored tabs on a calendar and a map at the same time. 🥬 Filling (The Actual Concept): Condition-aware RoPE assigns specialized space-time labels to tokens so the model knows which tokens match which frames and which are references.

How it works (recipe):
1. Give the source and target tokens matching time labels (0 to F-1) so they align frame-by-frame.
2. Give reference images a special “timeless” label and separate them from each other using small spatial offsets.
3. Give the single-frame mask a time label that corresponds to its chosen frame (so the model knows where the pointer belongs).
Why it matters: Without coordinated labels, the model might lose track of the subject or mismatch frames. 🍞 Bottom Bread (Anchor): Like saying “these two rows are Monday’s class (source and target), these sticky notes are examples (references), and this green sticker marks the student we’re following (mask).”

Step 3: In-Context Learning Inside the Transformer

What happens: A Video Diffusion Transformer (DiT) sees the entire token sequence. With full self-attention, it lets target tokens look at source, mask, and references, learning the replacement task from the arrangement itself.
Why this step exists: The model figures out “what to replace” and “how it should look and move” by comparing tokens directly, instead of relying on fragile external blueprints.
Example: The target frame 37 token attends strongly to the source frame 37, the mask token, and the reference face—so it knows to paint the superhero right where the skateboarder is.

🍞 Top Bread (Hook): Think of slowly clearing fog from a window until you see the scene perfectly. 🥬 Filling (The Actual Concept): Flow Matching (within diffusion) is the training recipe that teaches the model how to move from noisy guesses toward clean frames step-by-step.

How it works: The model predicts the direction to nudge a noisy latent so it becomes the real video over time.
Why it matters: Without a good step-by-step guide, the model could wander and make jittery or blurry results. 🍞 Bottom Bread (Anchor): Like learning the smooth path from scribble to sketch to final drawing.

Step 4: Identity-Enhancing Post-Training (Short and Focused)

What happens: After the main training, a brief RL fine-tuning improves how well faces match the references. It adds an identity reward (using a face encoder like ArcFace) and a pixel-wise loss to avoid cheating.
Why this step exists: Even good generations can drift a bit in identity. Rewards nudge the face to look more like the references, while pixel loss stops the model from just pasting a reference image.
Example: If the jawline is off during a smile, the reward encourages the right shape next time.

🍞 Top Bread (Hook): Like adding a small, smart add-on to a big machine. 🥬 Filling (The Actual Concept): LoRA (Low-Rank Adaptation) updates only tiny adapters instead of the whole model during RL post-training.

How it works: It saves memory and keeps the base model’s general skills while improving face fidelity.
Why it matters: Without LoRA, the fine-tuning could be heavy, slow, and risk forgetting. 🍞 Bottom Bread (Anchor): It’s like clipping a tuner onto a guitar instead of rebuilding the guitar.

Step 5: The Data Triangle That Makes It Learn

What happens: The model trains on three kinds of paired data:
- UE5-rendered pairs: perfect alignment; same scene and motion, different characters.
- Expression-driven portrait pairs: swap and animate faces with the same driving expressions to learn facial dynamics.
- Augmented real video–mask pairs: add realism and textures from the real world.
Why this step exists: Each dataset fills a different gap—alignment, expressions, and realism.
Example: A clean, rendered parkour scene teaches motion alignment; a portrait singing dataset teaches smiles and lip shapes; real street videos teach gritty lighting.

Secret Sauce (what’s clever):

One-frame mask as a tracking pointer instead of per-frame masks.
Condition-aware RoPE to fuse multi-modal inputs without fixing video length.
In-context learning so the model infers the whole task from arrangement.
A tiny, surgical RL post-train with LoRA to lock in facial identity.

04Experiments & Results

The Test: What did they measure and why?

Synthetic benchmark (rendered): Pairs with perfect ground truth let us check if the output matches the target exactly. Metrics: SSIM (structural similarity), LPIPS (perceptual similarity), PSNR (reconstruction quality).
Real-world benchmark: Diverse, challenging videos with occlusions, fast motion, complex lighting, and multi-person interactions. Metrics from VBench: subject consistency, background consistency, aesthetic quality, imaging quality, temporal flickering (lower is better), motion smoothness.

🍞 Top Bread (Hook): Imagine grading drawings for both accuracy (does it match the photo?) and style (does it look pleasing and smooth?). 🥬 Filling (The Actual Concept): Metrics translate these feelings into numbers.

How it works: SSIM/PSNR score how close you are to the ground truth; LPIPS and VBench reflect how humans perceive sharpness, style, and stability.
Why it matters: Without numbers, we can’t tell if a method truly improves. 🍞 Bottom Bread (Anchor): It’s like getting both a math score and an art score for your project.

The Competition (Baselines):

VACE: All-in-one editing with structure guidance.
HunyuanCustom: Masks plus references; good identity but weaker motion integration.
Wan-Animate: Strong clothing and detail handling; still needs dense guidance.
Kling (commercial): Good identity; weaker motion/background integration; not suited for mass automated testing.

The Scoreboard (with context):

Synthetic benchmark results:
- VACE: SSIM 0.572, LPIPS 0.253, PSNR 17.10
- HunyuanCustom: SSIM 0.644, LPIPS 0.257, PSNR 17.70
- Wan-Animate: SSIM 0.692, LPIPS 0.213, PSNR 19.20
- MoCha: SSIM 0.746, LPIPS 0.152, PSNR 23.09
What it means: MoCha’s SSIM 0.746 is like getting an A when others are B’s; LPIPS 0.152 (lower is better) means the result “looks” closer to real to a neural judge; PSNR 23.09 shows cleaner reconstructions.
Real-world VBench highlights (higher is better except flickering):
- Subject Consistency: MoCha 92.25 vs Wan-Animate 91.25 vs HunyuanCustom 90.03 vs VACE 71.19
- Background Consistency: MoCha 94.40 vs Wan-Animate 93.42 vs HunyuanCustom 93.68 vs VACE 77.89
- Aesthetic Quality: MoCha 60.24 vs Wan-Animate 54.60 vs HunyuanCustom 56.77
- Imaging Quality: MoCha 59.58 vs Wan-Animate 58.48 vs HunyuanCustom 58.92
- Temporal Flickering: MoCha 97.98 (higher means less flicker) ties/bests others
- Motion Smoothness: MoCha 98.79 (best)
What it means: MoCha keeps the identity steady, preserves the background, and stays smooth over time—key ingredients for believable edits.

Surprising Findings:

One-frame mask really is enough: Attention maps show the model naturally tracks the subject across frames using that single pointer.
Lighting matters: Methods that throw away too much original information (due to reconstruction masking) often miss scene lighting. MoCha preserves it better.
More references help, but MoCha stays robust even with a single good reference due to training with reference-dropout.

🍞 Top Bread (Hook): Think of training for a concert—practice in a studio, then perform on a real stage. 🥬 Filling (The Actual Concept): The mixed training data (rendered + portrait + real) pays off across tests.

How it works: Rendered teaches alignment, portraits teach expressions, real videos teach realism and textures.
Why it matters: If you skip any part, you lose either accuracy, facial life, or real-world feel. 🍞 Bottom Bread (Anchor): Models trained only in the studio often freeze on a noisy stage; MoCha doesn’t.

Ablations (what changed when parts were removed):

Removing real-human data: Faces become less expressive and more synthetic; realism drops.
Removing identity RL (no LoRA post-train): Face match to references weakens; identity drifts during expressions.

Complex Case Performance:

Occlusions and object interactions stay coherent.
Complex lighting is preserved, so characters fit the scene.
Multi-character interactions remain plausible, with correct tracking of the chosen subject.

Bottom line: Across easy and hard tests, MoCha consistently replaces characters with better identity faithfulness, smoother motion, and more natural scene integration.

05Discussion & Limitations

Limitations:

Reference quality dependence: If the provided reference images are tiny, blurry, or wildly inconsistent, the face can drift or average out.
Extreme out-of-distribution actions: Motions or interactions far beyond training (e.g., very unusual sports tricks in odd lighting) can still cause artifacts.
Very long videos at very high resolutions: Though scalable, compute and memory costs can still be high for ultra-long, ultra-HD edits.
Fine-grained body shape mismatch: If references differ a lot in body shape from the source, clothing and silhouette can sometimes look off.

Required Resources:

A modern GPU setup for inference; multi-GPU training (the paper used 8×NVIDIA H20) for fine-tuning.
Pretrained video diffusion backbone (Wan-2.1-T2V-14B) and the provided MoCha components.
Access to the mixed training data or similar datasets.

When NOT to Use:

When you have no acceptable reference images (identity cannot be learned).
When legal/ethical consent is missing (e.g., unauthorized face replacement).
When perfectly exact pixel replication is required for forensic or scientific analysis (this is a generative method, not a physical simulator).

Open Questions:

How to better control body shape and clothing fit while preserving motion?
Can we use fewer or weaker references (e.g., one tiny photo) while keeping identity perfect?
How far can tracking go with crowded scenes and similar-looking subjects?
Can post-training move beyond faces to also reward pose/clothing match without encouraging copying?
How to make the system more efficient for very long videos without losing quality?

06Conclusion & Future Work

Three-Sentence Summary: MoCha replaces a character in a video using only one mask on one frame and a few reference images, no dense structural guidance required. It works by letting a video diffusion model learn in context, guided by a condition-aware positional scheme and a small RL post-training that sharpens facial identity. This design keeps motion, lighting, and background intact while swapping in a new, consistent face and body.

Main Achievement: The key contribution is an end-to-end, structure-free character replacement pipeline anchored by a single-frame mask and condition-aware RoPE, delivering state-of-the-art identity preservation and temporal consistency across challenging real-world scenes.

Future Directions:

Add stronger controls for body shape and clothing fit while preserving realism.
Improve efficiency and memory use for longer, higher-resolution videos.
Explore broader subject types (animals, vehicles) and richer interaction edits.
Expand post-training rewards beyond faces to include pose and garment fidelity.

Why Remember This: MoCha shows we don’t need heavy, fragile, per-frame guides to do high-quality video character replacement—one smart pointer plus in-context learning and a light identity tune-up can be enough. That makes powerful video edits more robust, faster, and more accessible for creators and studios alike.

Practical Applications

•Film and TV post-production: swap stunt doubles or late-cast actors while preserving lighting and action.
•Creator tools: easy character replacement for vlogs, shorts, and social videos without frame-by-frame masks.
•Advertising: personalize actors or outfits for different regions or audiences quickly.
•Virtual try-on: replace clothing on moving people while keeping realistic fabric motion and scene lighting.
•Game and avatar creation: drive new characters with captured motions and expressions from reference footage.
•Education and training: anonymize faces or replace identities in sensitive footage while retaining actions.
•Corporate communications: brand-consistent avatars replacing presenters without reshooting.
•Cultural localization: adapt characters to local preferences while maintaining original camera work and sets.
•Archival restoration: carefully replace damaged or missing character visuals while preserving historical context.
•Prototype VFX: previsualize character swaps early in production with minimal setup.

Version: 1