Avatar Forcing: Real-Time Interactive Head Avatar Generation for Natural Conversation

Taekyung Ki; Sangwon Jang; Jaehyeong Jo; Jaehong Yoon; Sung Ju Hwang

Avatar Forcing: Real-Time Interactive Head Avatar Generation for Natural Conversation

Intermediate

Taekyung Ki, Sangwon Jang, Jaehyeong Jo et al.1/2/2026

arXiv PDF

Key Summary

•This paper builds a real-time talking-listening head avatar that reacts naturally to your words, tone, nods, and smiles in about half a second.
•It uses a clever recipe called causal diffusion forcing so the avatar can respond right away without waiting for future audio or video frames.
•A Dual Motion Encoder fuses three live signals—your audio, your face motion, and the avatar’s audio—into one clear instruction for the avatar.
•A blockwise look-ahead causal mask keeps motion smooth over time while still staying causal (no peeking into the future).
•KV caching stores what the model just figured out, so the next step is faster—this is key to real-time speed.
•To teach expressiveness without human labels, the model uses Direct Preference Optimization by comparing ‘good’ reactions with ‘under-expressive’ ones made by dropping user cues.
•On the RealTalk dataset, the system runs with ~0.5s latency (about 6.8× faster than a strong baseline) and people pick it as better more than 80% of the time.
•It stays competitive on lip-sync and visual quality while being much more reactive and expressive.
•Ablations show user motion is essential; removing it makes the avatar stiff and less responsive.
•This approach opens the door to more natural, two-way, on-camera AI helpers, educators, and companions.

Why This Research Matters

Real-time, two-way avatars can make online conversations feel more human by reacting to both your words and your expressions. This reduces the awkwardness of lag and the flatness of one-way lip-sync bots. In classrooms, an avatar teacher who nods and smiles at the right times keeps students engaged. In customer support, a friendly, responsive face can build trust and clarity. For accessibility, expressive avatars can translate tone and non-verbal cues in helpful ways. And because the method doesn’t need new labeled data for expressiveness, it’s practical to improve and deploy widely while still allowing for safety features like watermarks.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Top Bread (Hook) You know how a great conversation isn’t just words—it’s nods, smiles, eye contact, and little reactions that show you’re listening? That back-and-forth is what makes talking to a person feel alive.

🥬 Filling (The Actual Concept)

What it is: Talking head generation animates a single photo so it can speak and move like a person, mostly focusing on lip-sync and simple head motion.
How it works (before this paper):
1. Take a portrait image.
2. Feed in an audio track for what the avatar should say.
3. Predict lip movements and some head motion that match the sound.
4. Render frames that look like the person is talking.
Why it matters: Without more than basic lip-sync, avatars feel like one-way radios; they don’t nod, smile, or react to you in the moment, so conversations feel flat.

🍞 Bottom Bread (Anchor) Imagine FaceTime with a bot that talks at you like a video lecture. It moves its lips fine but doesn’t make eye contact or nod when you emphasize a point—that’s the old world.

🍞 Top Bread (Hook) You know how texting back late makes a chat feel awkward? Timing really matters.

🥬 Filling (Latency)

What it is: Latency is the delay between your action and the avatar’s reaction.
How it works: If a model needs to see several seconds of future audio before it can move, it has to wait, causing visible lag.
Why it matters: In a live chat, even a couple seconds feels like a hiccup; quick, causal (only past and present) processing is needed.

🍞 Bottom Bread (Anchor) If you nod and the avatar nods back three seconds later, it feels weird. Under half a second feels natural.

🍞 Top Bread (Hook) Imagine listening to a friend: you hear their voice, see their face, and feel their vibe—all at once.

🥬 Filling (Multimodal Signals)

What it is: Multimodal means using more than one kind of input—like audio (what you say) and motion (how your face moves).
How it works: Combine user audio, user facial motion, and avatar audio into one shared understanding.
Why it matters: Only using audio misses silent signals (like smiles). Only using video misses tone and rhythm. Together, they make reactions natural.

🍞 Bottom Bread (Anchor) If you smile silently, a good listener smiles back. That’s multimodal awareness.

🍞 Top Bread (Hook) You know how it’s hard to say exactly what makes a reaction feel “expressive”? It’s a feeling, not a simple score.

🥬 Filling (Expressiveness Challenge)

What it is: Teaching a model to react warmly and dynamically is hard because there isn’t one “correct” reaction and we don’t have neat labels for good listening.
How it works: Real data often shows stiff listeners; training copies that stiffness.
Why it matters: Without a way to learn preferences (what looks better), avatars end up bland and robotic.

🍞 Bottom Bread (Anchor) Think of two listeners: one stares blankly; the other nods, smiles, and leans in a bit. Most people prefer the second, but labeling the exact “right” smile or nod is tough.

🍞 Top Bread (Hook) Picture a dance where both partners influence each other step by step.

🥬 Filling (The Gap)

What it is: We lacked a system that both responds instantly (causal, low-latency) and learns to be expressive without more human labels.
How it works: Past systems were either quick but one-way, or expressive but too slow because they looked into the future.
Why it matters: Real conversation needs both timing and warmth.

🍞 Bottom Bread (Anchor) We want the avatar to go “mm-hmm” right after you smile and keep eye contact when you start talking—without a big pause.

🍞 Top Bread (Hook) Imagine a walkie-talkie chat that looks and feels like real life.

🥬 Filling (What this paper brings)

What it is: Avatar Forcing is a framework that makes avatars react in real time to your voice and face, and it learns expressiveness without extra labels.
How it works (big picture):
1. Encode the user’s audio and face motion plus the avatar’s audio into one instruction.
2. Generate the avatar’s motion step by step using causal diffusion forcing.
3. Speed it up with KV caching and smooth it with a look-ahead causal mask.
4. Fine-tune expressiveness with preference learning (DPO) using synthetic “under-expressive” negatives.
Why it matters: You get quick, natural, two-way interactions.

🍞 Bottom Bread (Anchor) In tests, people picked the new system over 80% of the time for overall quality, and it reacts in about 0.5 seconds—fast enough to feel human.

02Core Idea

🍞 Top Bread (Hook) You know how a good friend mirrors your smile and nods while you talk, almost instantly? That mirroring builds trust.

🥬 Filling (Avatar Forcing)

What it is: Avatar Forcing is a system that turns live user cues (your audio and facial motion) plus the avatar’s audio into fast, natural head movements and expressions.
How it works:
1. A Dual Motion Encoder blends your signals and the avatar’s audio into a single, unified condition.
2. A causal diffusion-forcing motion generator predicts the next chunk of avatar motion using only past and present info.
3. KV caching reuses recent computations for speed.
4. A blockwise look-ahead causal mask keeps motions smooth across chunk boundaries without breaking causality.
5. Direct Preference Optimization boosts expressiveness by preferring reactions that use user cues over under-expressive ones made without them.
Why it matters: Without this, avatars either lag (waiting for future frames) or feel stiff. With it, they react quickly and warmly.

🍞 Bottom Bread (Anchor) When you grin, the avatar smiles right back and gives a tiny nod, in under half a second.

Three analogies for the same idea:

Orchestra analogy: Your audio and face are the conductor; the avatar is the orchestra. The Dual Motion Encoder reads the baton and score, the diffusion model plays each bar on time, and DPO makes the music more expressive.
Sports analogy: It’s like a point guard reacting to a teammate’s cut—no waiting for a replay. KV caching is studying past plays to react faster next time; DPO is the coach encouraging bolder, better passes.
Cooking analogy: The user’s cues are ingredients, the encoder mixes them, diffusion cooks them step by step, the look-ahead mask ensures even baking between layers, and DPO tunes the flavor to what people like.

🍞 Top Bread (Hook) You know how reading a whole book before answering a single question would be slow?

🥬 Filling (Causal Diffusion Forcing)

What it is: A way to generate motion step by step using only the past and present, not the future.
How it works:
1. Take the current noisy motion block.
2. Use a learned vector field to nudge it toward a clean motion block.
3. Repeat for a few steps (fast ODE solver) to get the next block.
4. Move on without ever peeking at future frames.
Why it matters: This keeps latency low, so reactions feel immediate.

🍞 Bottom Bread (Anchor) When you laugh, the very next block can brighten the avatar’s face right away—no future audio needed.

🍞 Top Bread (Hook) Imagine you could improve your smile by comparing two photos—one lively, one dull—and always choose the lively one.

🥬 Filling (Direct Preference Optimization, DPO)

What it is: A method to make the model prefer expressive, reactive motions without human labels.
How it works:
1. Create a preferred sample from real expressive motion.
2. Create a less-preferred sample by generating motion without user cues (under-expressive).
3. Train the model to score the preferred higher than the less-preferred.
Why it matters: It teaches “what feels better” directly, avoiding the need for a tricky reward model.

🍞 Bottom Bread (Anchor) Between “listener smiles back” and “listener stays blank,” DPO learns to pick “smiles back.”

🍞 Top Bread (Hook) Think of a toolbox with a special drawer just for movements.

🥬 Filling (Motion Latent Space)

What it is: A compact code that splits identity (who you are) from motion (how you move).
How it works:
1. Encode the face image into identity and motion latents.
2. Keep identity fixed; generate only motion latents.
3. Decode identity + motion into video frames.
Why it matters: It’s efficient and preserves who the avatar is while changing how it moves.

🍞 Bottom Bread (Anchor) Same avatar face, but now it can nod, blink, and smile differently depending on your cues.

🍞 Top Bread (Hook) Have you ever previewed just a tiny bit of the next scene to make a smoother cut?

🥬 Filling (Blockwise Look-ahead Causal Mask)

What it is: An attention rule that stays causal but lets the model peek a few frames ahead within a block boundary to avoid jitter.
How it works:
1. Split frames into blocks.
2. For each block, allow limited look-ahead attention (l frames) while keeping overall causality.
3. Smooth transitions across blocks.
Why it matters: Purely strict causal masks can create choppy motion; a tiny look-ahead reduces jitters.

🍞 Bottom Bread (Anchor) Like seeing one step of the dance ahead so the spin is smooth, not jerky.

🍞 Top Bread (Hook) Remembering what you just said helps me reply faster.

🥬 Filling (KV Caching)

What it is: Storing recent key/value attention states so the next step reuses them instead of recomputing.
How it works:
1. Save attention keys/values for past frames and conditions.
2. On the next block, reuse them for faster attention.
3. Keep a rolling window to limit memory.
Why it matters: Big speedups, enabling real-time interaction.

🍞 Bottom Bread (Anchor) Like keeping notes from the last sentence so you don’t reread the whole page before answering.

03Methodology

High-level recipe: Input (user audio + user motion + avatar audio) → Dual Motion Encoder (make one instruction) → Causal DFoT Motion Generator (predict next motion block with look-ahead mask and KV caching) → Decode with identity to frames → Real-time video output.

Step 1: Prepare the motion space 🍞 Hook: Think of separating a character’s face (who) from how it moves (what it does). 🥬 Concept (Motion Latent Space)

What it is: A compact code that splits identity z_S from motion m_S.
How it works:
1. Train an auto-encoder on videos so it learns: image → (identity + motion) and (identity + motion) → image.
2. During generation, keep z_S fixed for the avatar’s identity.
3. Only generate motion latents m to animate.
Why it matters: Lets us animate realistically without changing who the avatar looks like. 🍞 Anchor: Same face, but different nods, blinks, and smiles depending on the conversation.

Step 2: Fuse the signals 🍞 Hook: Imagine a mixer that combines your voice, your face motion, and what the avatar will say. 🥬 Concept (Dual Motion Encoder)

What it is: A module that blends user audio + user motion + avatar audio into one unified condition.
How it works:
1. Cross-attend user motion to user audio to align your non-verbal and verbal cues.
2. Cross-attend that result with the avatar’s audio to establish who speaks and when.
3. Output a single condition vector per time step.
Why it matters: Without this, the avatar might miss silent smiles or misread speaking turns. 🍞 Anchor: When you silently grin, the condition tells the avatar, “Mirror a friendly smile now.”

Step 3: Generate motion, causally and fast 🍞 Hook: Replying in real time means never waiting for the future. 🥬 Concept (Causal Diffusion Forcing with DFoT)

What it is: A step-by-step generator that cleans a noisy motion block into a clean one using only past/present info.
How it works:
1. Start from a noisy motion block (like a blurry guess).
2. Use the learned vector field (DFoT transformer) to nudge it toward the true motion.
3. Repeat for a few solver steps to finish the block.
4. Move to the next block, never looking ahead in time.
Why it matters: Keeps latency low and reactions timely. 🍞 Anchor: You laugh; the next block brightens the avatar’s face right away.

Step 4: Keep transitions smooth 🍞 Hook: A tiny peek ahead can prevent a jerky cut. 🥬 Concept (Blockwise Look-ahead Causal Mask)

What it is: Attention that’s still causal but looks a few frames ahead for smoother motion.
How it works:
1. Split time into blocks (e.g., 10 frames).
2. Within attention, allow limited look-ahead (l frames) across block edges.
3. Preserve overall causality while reducing jitter.
Why it matters: Pure causal masks often cause frame-to-frame wobble. 🍞 Anchor: The avatar’s head turn glides across blocks instead of stepping.

Step 5: Speed it up with memory 🍞 Hook: Don’t re-read the last paragraph to answer the next sentence. 🥬 Concept (KV Caching)

What it is: Save recent attention keys/values for reuse.
How it works:
1. After generating a block, store its attention states (and condition states).
2. Reuse them when generating the next block.
3. Keep a rolling cache to fit memory.
Why it matters: Gives the reported ~6.8× speedup and ~0.5s latency. 🍞 Anchor: The avatar reacts quickly in a live call because it isn’t recomputing everything from scratch.

Step 6: Decode into frames 🍞 Hook: Add motion to identity to bring the avatar to life. 🥬 What it is: A decoder turns (z_S + m) into images.

How it works:
1. Combine the fixed identity latent z_S with the generated motion latent.
2. Decode into B frames for the current block.
3. Stream them as video.
Why it matters: The identity stays consistent while motion changes naturally. 🍞 Anchor: You see the same person each frame, just with new expressions and movements.

Step 7: Teach expressiveness without labels 🍞 Hook: Learn what people prefer by comparing lively vs. dull reactions. 🥬 Concept (Direct Preference Optimization, DPO)

What it is: A training trick that pushes the model to prefer expressive, user-aware motion.
How it works:
1. Preferred sample: ground-truth expressive reaction.
2. Less-preferred sample: a version generated without user cues (under-expressive).
3. Optimize so the model assigns higher likelihood to preferred than to less-preferred.
Why it matters: No extra human labels or a fragile reward model needed. 🍞 Anchor: Between “mirror the user’s grin” vs. “stay blank,” the model learns to pick “mirror the grin.”

Secret sauce:

The trio—Dual Motion Encoder, causal DFoT with look-ahead masking, and DPO—work together: the encoder understands, the DFoT reacts on time, and DPO adds heart.

04Experiments & Results

🍞 Hook: If a new game loads faster, looks good, and feels more responsive than the old one, you notice right away.

🥬 The Tests (What they measured and why)

Latency: How quickly the avatar reacts (critical for live chats).
Reactiveness: How well the avatar’s motion syncs with the user’s motion (expressions and head pose).
Motion Richness: How diverse and lively the avatar moves (variance and diversity index).
Visual Quality: Image/video realism and identity consistency.
Lip Sync: Accuracy of mouth movements with the avatar’s audio. Why: Together, these tell us if the avatar is fast, natural, and believable.

🍞 Anchor: You want your AI tutor to nod when you do, stay on lip-sync, and look great—without lag.

🍞 Hook: It’s only impressive if it beats tough opponents.

🥬 The Competition (Baselines)

Interactive/dyadic models: INFP* (reproduced from paper).
Talking head models: SadTalker, Hallo3, FLOAT, INFP*.
Listening head models: RLHG, L2L, DIM, INFP*. Why: To show strength across talking, listening, and full back-and-forth conversation.

🍞 Anchor: It’s like testing a new car against both city and highway competitors.

🍞 Hook: Scoreboards need context—what does a number feel like?

🥬 The Scoreboard (Highlights with context)

Real-time: ~0.5s latency vs. INFP* ~3.4s (about 6.8× faster). That’s like answering in a heartbeat instead of after a long pause.
Human preference: Over 80% people preferred Avatar Forcing overall vs. INFP*. That’s a landslide win.
Reactiveness and Motion Richness: Lower rPCC errors (closer to ground truth user-avatar sync) and higher diversity metrics (SID, Var) than INFP*. This means livelier, better-aligned reactions.
Visual Quality & Lip Sync: Competitive or better than strong talking-head baselines (e.g., best FID/FVD on HDTF while keeping solid lip-sync).
Listening benchmarks (ViCo): Best or near-best in synchronization and diversity across expression and pose.

🍞 Anchor: Users described the avatar as more responsive and expressive—nodding and smiling at the right times—without losing lip-sync or image quality.

🍞 Hook: Surprises make science fun.

🥬 Surprising findings

User motion is crucial: Removing it made the avatar stiff, especially during silent periods (missed smiles and nods).
Tiny look-ahead pays off: A small peek smooths transitions without breaking causality.
Preference tuning without labels works: DPO boosted reactiveness and motion richness notably, even though no human-labeled expressiveness scores were used.

🍞 Anchor: The version without user motion felt like a statue when you smiled; the DPO-tuned version smiled right back.

05Discussion & Limitations

🍞 Hook: Every superpower has limits and care instructions.

🥬 Limitations

Mostly head-focused: No hands or full-body gestures yet, which also matter in conversation.
Control granularity: Doesn’t expose fine dials for, say, exact gaze targets or emotion intensity per frame.
Data realism: Listening data is often under-expressive, so the model still depends on DPO to overcome dataset stiffness.
Edge cases: Very noisy audio/video inputs or extreme lighting/occlusions can reduce quality.

Resources needed

A modern GPU (e.g., an H100 in the paper) for training/inference at the reported speed.
Preprocessing: face tracking/cropping and audio feature extraction (e.g., Wav2Vec2.0).
Motion latent auto-encoder and DFoT components.

When not to use

If you need full-body performance with hands and posture today.
If latency above ~0.5s is acceptable and you prefer a simpler pipeline (then a basic talking-head model might suffice).
If you require strict, frame-by-frame scripted control rather than learned, human-like reactions.

Open questions

How to include hands, torso, and eye-tracking for richer interactions without losing real-time speed?
Can we learn personalized listening styles safely and ethically?
How to ensure fairness and reduce bias across identities and cultures in expressions?
Best watermarking and detection to mitigate deepfake misuse while enabling positive applications?

🍞 Anchor: Think of this as a strong head-and-face base that can grow into a full performer with hands, gaze, and style controls.

06Conclusion & Future Work

Three-sentence summary: Avatar Forcing is a real-time, interactive head avatar system that reacts to your voice and facial cues in about half a second. It blends multimodal signals, generates motion causally with smoothing and caching for speed, and learns expressiveness via preference optimization without human labels. The result is an avatar that people prefer over 80% of the time for being more natural, responsive, and engaging.

Main achievement: Combining causal diffusion forcing, Dual Motion Encoding, blockwise look-ahead masking, and DPO into a single framework that finally makes two-way, expressive avatar interaction feel live and human.

Future directions: Add hands and body, finer controls for gaze and emotion, integrate eye-tracking and richer sensors, expand safety (watermarks, detection), and personalize styles responsibly. Also explore broader multilingual and multi-speaker scenarios with robust noise handling.

Why remember this: It marks the shift from one-way, lip-sync avatars to two-way, emotionally aware partners that can genuinely participate in a conversation—fast enough to feel real.

Practical Applications

•Live virtual presenters that respond instantly to audience reactions during webinars.
•Interactive tutors that nod, smile, and keep eye contact to boost student engagement.
•Customer service avatars that mirror user emotions to de-escalate tense conversations.
•Companion apps for language learning that react naturally to your pronunciation and expressions.
•On-device meeting assistants that acknowledge speakers and provide visual feedback in real time.
•Virtual interview practice tools that behave like attentive interviewers with realistic timing.
•Healthcare check-in kiosks with avatars that show empathy and clarify instructions.
•Content creators’ digital doubles that can improvise reactions during live streams.
•Signaling aids for neurodivergent users by reflecting social cues more clearly in video chats.
•Gaming NPC faces that react to player speech and facial expressions on the fly.

Version: 1