Ex-Omni: Enabling 3D Facial Animation Generation for Omni-modal Large Language Models

Haoyu Zhang; Zhipeng Li; Yiwen Guo; Tianshu Yu

Ex-Omni: Enabling 3D Facial Animation Generation for Omni-modal Large Language Models

Intermediate

Haoyu Zhang, Zhipeng Li, Yiwen Guo et al.2/6/2026

arXiv

Key Summary

•Ex-Omni is a new open-source AI system that can understand text or speech and then talk back while moving a 3D face in sync with the voice.
•It solves a big mismatch: language models think in slow, chunky words, but faces move fast and smoothly; Ex-Omni splits these jobs so each part does what it’s best at.
•The system uses discrete speech units as a steady beat (temporal scaffolding) that guides both speech and facial motion frame by frame.
•A special controller called Token-as-Query Gated Fusion (TQGF) decides how much meaning from the language model should flow into speech and face at each moment.
•Faces are animated with ARKit-52 blendshape numbers, which is a simple, widely used way to drive lips, jaw, and other facial parts.
•The team built a training set called InstructEx that mixes real and high-quality synthetic pairs of speech and face motion so the model learns to talk and move together.
•On multiple tests, Ex-Omni beats or matches setups that glue separate speech and face models together, and people prefer its lip-sync in head-to-head comparisons.
•It also does reasonably well on speech understanding and text-to-speech tasks, despite using far less data than many competitors.
•An ablation study shows that both velocity smoothing and the TQGF controller are important for stable, expressive mouth motion.
•Ex-Omni isn’t perfectly real-time yet and focuses mostly on mouth motion (not full emotion), but it’s a big step toward natural AI conversations with animated faces.

Why This Research Matters

Better lip-sync makes AI conversations feel natural and trustworthy, the way a friendly teacher or helper looks you in the eye and speaks clearly. Virtual agents in customer support, education, and healthcare can become easier to understand because their mouth movements match their words. Games and digital entertainment gain more life-like characters without massive manual animation. People who rely on assistive technology can read lips along with audio, improving accessibility in noisy places. And by training with a unified system, we reduce the janky handoffs between separate tools, opening the door to smoother, safer, real-time interactions.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Top Bread (Hook): You know how talking with a friend on video feels more natural than just reading a text, because you hear their voice and also see their lips and expressions move in time? That mix of sound and face motion makes communication click.

🥬 Filling (The Actual Concept): What it is: Omni-modal large language models (OLLMs) are big AI systems that try to understand and generate many kinds of signals—like text, speech, and images—in one place. How it works: 1) They read inputs (text, sound, pictures), 2) think about what it means, 3) then output answers in one or more formats. Why it matters: Without one unified brain, you need separate tools that might not stay in sync, like sound and face drifting apart.

🍞 Bottom Bread (Anchor): Imagine asking your smart assistant a question by voice. It not only answers with a voice, but a 3D avatar moves its lips in sync. That’s the dream OLLMs aim to deliver.

🍞 Top Bread (Hook): Imagine a puppet show where the voice comes from backstage and the puppet’s mouth moves to match. If the timing is off, it feels weird right away.

🥬 Filling (The Actual Concept): What it is: 3D facial animation means moving a 3D face (jaw, lips, cheeks, etc.) over time so it looks like it’s really talking. How it works: 1) We describe face poses with numbers (like how open the mouth is), 2) we change those numbers over time, 3) we render the 3D head to see smooth motion. Why it matters: If the lips don’t match the sound, people notice immediately; natural lip-sync builds trust and clarity.

🍞 Bottom Bread (Anchor): Think of a cartoon character whose mouth opens wide on “ah” sounds and narrows on “ee” sounds—good animation follows the audio closely.

The World Before: LLMs became great at text, and then expanded to audio and images. But open-source omni models mostly spoke words (text or speech) or showed pictures—they didn’t drive a 3D face in sync with the speech. That left a gap for natural, face-to-face style interaction.

The Problem: There’s a core mismatch. LLMs reason in slow, discrete tokens (words/subwords). Faces move fast and smoothly (dozens of frames per second). If you try to jump straight from word-y thoughts to frame-by-frame mouth motion, the model struggles; it’s like asking a novelist to conduct a symphony beat-by-beat without a metronome.

Failed Attempts: A simple idea is “attach a face decoder to the LLM” and let it figure out lip motion from the LLM’s hidden states. But those hidden states are built for meaning, not micro-timing. The face decoder then must guess fine timing from coarse signals, which usually needs lots of data and a very big decoder—and still wobbles.

The Gap: What was missing was a steady, time-aligned guide between meaning and motion—a rhythmic scaffold that both speech and face could follow together—plus a smart valve to control when and how the LLM’s meaning flows into timing-sensitive generators.

Real Stakes: This matters for virtual teachers, customer support avatars, game characters, and assistive tools. When the mouth matches the words, people feel more comfortable, understand better, and trust the system more. It turns clunky voice responses into friendly, face-to-face conversations.

🍞 Top Bread (Hook): Imagine learning a dance. It’s much easier if there’s a steady beat to step to, and a coach who tells you when to add style.

🥬 Filling (The Actual Concept): What it is: Ex-Omni introduces two key helpers—discrete speech units that act like the beat, and a Token-as-Query Gated Fusion (TQGF) controller that adds meaning at just the right times. How it works: 1) The LLM thinks about what to say, 2) a speech unit generator creates a time-aligned sequence (the beat), 3) TQGF gently injects the LLM’s meaning where needed, 4) the face decoder uses the same beat to move lips in sync. Why it matters: With a beat and a coach, the system doesn’t have to guess timing; it can focus on speaking clearly and moving naturally.

🍞 Bottom Bread (Anchor): Like karaoke subtitles that highlight each word exactly when to sing, Ex-Omni uses speech units as timing marks so the mouth can follow along perfectly.

02Core Idea

🍞 Top Bread (Hook): You know how building a bridge needs strong pillars and a clear pathway so cars don’t wobble or fall? AI talking faces need that same structure for timing and meaning.

🥬 Filling (The Actual Concept): What it is: The key insight is to split thinking (semantics) from moving-in-time (temporal generation) and connect them with a steady guide: discrete speech units plus a smart gate (TQGF). How it works: 1) Let the LLM focus on understanding and planning the response, 2) generate a time-aligned stream of speech units that act as a metronome, 3) use TQGF to control how much meaning reaches speech and face generators at each step, 4) produce speech audio and 3D face motion together. Why it matters: Without this split and control, the model tries to do brainwork and choreography at once and gets clumsy.

🍞 Bottom Bread (Anchor): It’s like a chef (LLM) writes the recipe, a baker (speech unit generator) sets the oven timer ticks, and a decorator (face decoder) follows those ticks to pipe frosting in rhythm—no mess, delicious results.

Multiple Analogies:

Music analogy: The LLM writes the lyrics; the speech units are the beat; TQGF is the sound engineer mixing vocals and instruments; the face decoder is the performer moving to the beat.
Sports analogy: The LLM is the coach planning plays; speech units are the whistle ticks; TQGF is the referee controlling flow; the face decoder is the player moving on cue.
Classroom analogy: The LLM is the teacher explaining; speech units are the bell schedule; TQGF is the hall monitor guiding when to move; the face decoder is students changing classes smoothly.

Before vs After:

Before: Directly mapping LLM features to facial frames was like turning chapter summaries into a frame-by-frame dance—timing was shaky and needed tons of data.
After: Ex-Omni adds a rhythmic backbone (speech units) and a regulator (TQGF), so speech and face lock together with less data and more stability.

Why It Works (intuition, no equations):

Discrete speech units compress audio into bite-sized, time-aligned pieces. This removes the guesswork about “when” things happen.
TQGF says, “Queries from the target timeline decide how much context to take.” That keeps timing in charge and meaning supportive.
Non-autoregressive face prediction means every frame can be computed together from the guided features, making motion smooth and avoiding error snowballs.

Building Blocks (introduced with Sandwich explanations):

🍞 Top Bread (Hook): Imagine a Swiss Army knife with tools for words, sounds, and pictures. 🥬 The Concept: OLLMs: one model to understand and generate across text, speech, and vision. How it works: combine encoders/decoders and a central LLM brain to route and reason. Why it matters: keeps everything consistent and synchronized. 🍞 Anchor: Ask it by voice, get a voice reply, and see a face move—all in one system.
🍞 Top Bread (Hook): Think of a mouth diagram with sliders for jaw open, lip smile, and cheek puff. 🥬 The Concept: ARKit-52 blendshape coefficients are 52 sliders that control parts of a 3D face. How it works: each number changes one muscle-like shape; a list of numbers per frame makes motion. Why it matters: it’s a simple, standard, identity-agnostic control panel. 🍞 Anchor: Setting jawOpen high and lipPucker low makes a wide “ah.”
🍞 Top Bread (Hook): Like cutting a song into notes to learn the melody. 🥬 The Concept: Discrete speech units are tiny tokens that represent short chunks of sound over time. How it works: an encoder-tokenizer turns audio into unit tokens; a generator predicts them; a decoder turns them back into waveform. Why it matters: they give a solid timing grid both speech and face can follow. 🍞 Anchor: Twelve unit tokens per second are like twelve ticks per second on a metronome.
🍞 Top Bread (Hook): Picture a faucet you can open a little or a lot. 🥬 The Concept: TQGF (Token-as-Query Gated Fusion) lets the time-aligned target tokens ask for just the right amount of meaning from the LLM. How it works: the target timeline is the query; LLM features are context; a learned gate decides how much to mix at each moment. Why it matters: prevents flooding the timing stream with too much high-level info, keeping motion stable. 🍞 Anchor: When saying “really,” the gate may open a bit more to emphasize mouth movement.
🍞 Top Bread (Hook): Like practicing with a coach and also with a high-quality video you copy from. 🥬 The Concept: InstructEx Dataset mixes ASR, TTS, S2S, T2T, and synthetic speech-to-face pairs (from a strong teacher model). How it works: staged training teaches alignment, speech generation, speech–face pairing, and then joint skills. Why it matters: real face capture is scarce; high-quality synthetic annotations fill the gap safely and consistently. 🍞 Anchor: 10k TTS samples had their facial motions labeled by a strong Audio2Face-3D model to guide learning.

03Methodology

At a high level: Input (text and/or speech) → unify into one representation → LLM thinks up the reply → speech unit generator produces a time-aligned unit stream → TQGF injects meaning where needed → speech decoder makes audio and face decoder makes ARKit-52 curves → output synchronized voice plus 3D facial animation.

Step-by-step, with Sandwich explanations for each new piece:

🍞 Hook: Think of getting everyone into the same room before the meeting starts. 🥬 What it is: Unified speech–text representation maps both text and speech into the LLM’s token space. How it works: a frozen speech encoder (like Whisper-Large-V3) turns audio into features; a small projector reshapes them to look like LLM tokens; text tokens use the LLM’s regular embedding. Concatenate them so the LLM reads them together. Why it matters: if speech and text live in different rooms, the LLM can’t compare or combine them easily. 🍞 Anchor: A spoken question plus a written hint go in side-by-side so the LLM can reason over both.
🍞 Hook: Like having the planner write the script before actors start performing. 🥬 What it is: LLM-centered reasoning keeps the LLM focused on the answer’s content, not on micromovements. How it works: the LLM reads the unified input, generates the reply text, and we keep its hidden states (its “thoughts”) for guidance. Why it matters: separating planning from performing makes each job easier. 🍞 Anchor: The LLM decides to answer, “It’s going pretty good—thanks! How about you?”
🍞 Hook: Use a metronome before you dance. 🥬 What it is: Speech unit generation predicts a token stream that marks time precisely. How it works: embed the reply tokens, then fuse in LLM meaning with TQGF so the speech generator knows style and emphasis; autoregressively predict unit tokens over time; later, a frozen decoder turns units into waveform. Why it matters: the unit sequence becomes the tempo line that both audio and face can follow together. 🍞 Anchor: About 12 unit ticks per second create a steady timeline the mouth can match.
🍞 Hook: Follow the beat to move your lips. 🥬 What it is: Non-autoregressive face generation predicts all facial frames in parallel, guided by resampled unit embeddings and contextual signals. How it works: 1) turn unit tokens into embeddings and resample them to the video frame rate (e.g., 25 fps), 2) use TQGF so the per-frame queries pull just the needed context, 3) refine with a light transformer and a periodic positional encoding that favors rhythmic patterns, 4) output 52 blendshape values per frame. Why it matters: parallel prediction gives smooth, stable motion and avoids error chains. 🍞 Anchor: The jaw opens wider exactly when long “ah” units appear, across the whole sentence.
🍞 Hook: Like learning to drive in stages—first steering, then parking, then highway. 🥬 What it is: Four-stage training strategy. How it works: Stage I aligns speech to LLM space with ASR (freeze most parts). Stage II teaches speech unit autoregression on TTS. Stage III adds paired face labels (from a high-quality teacher) so speech and face learn together. Stage IV fine-tunes everything jointly on a mix (ASR, TTS, S2S, T2T) to keep reasoning strong. Why it matters: staged learning keeps the model stable and prevents forgetting. 🍞 Anchor: First make sure speech features match language space, then learn to speak, then learn to move the face with speech, then polish all skills together.
🍞 Hook: Smooth roads make for comfy rides. 🥬 What it is: Losses for stability and accuracy. How it works: text and unit tokens use standard next-token training; face uses a per-frame error plus a velocity consistency term that discourages sudden jumps. Why it matters: without velocity smoothing, lips can jitter or snap. 🍞 Anchor: The model gets small penalties when mouth speed changes too sharply, encouraging natural transitions.
🍞 Hook: A conductor raises and lowers the baton to control volume and timing. 🥬 What it is: TQGF’s secret sauce. How it works: the target timeline (tokens for speech; frames for face) is the query; LLM or generator states are context; a learned gate opens or closes how much semantic meaning to mix at each step. Why it matters: it prevents overwhelming the timing stream and balances languages and styles. 🍞 Anchor: On strong words like “really,” the gate might open more, nudging a wider mouth opening.
🍞 Hook: Practice makes perfect—especially with good examples. 🥬 What it is: InstructEx dataset curation. How it works: build large ASR and TTS corpora with balanced English–Chinese; synthesize high-quality face labels using a strong Audio2Face-3D teacher; add S2S and T2T so the LLM keeps its brainy skills. Why it matters: real 3D face capture is scarce; synthetic annotations provide scalable, consistent training. 🍞 Anchor: 59k dialogue samples get paired with synthetic facial motion so the model learns open-domain talking.

Example with data:

Input: Text “How’s it going?”
LLM: Plans reply “It’s going pretty good. Thanks. How about you?”
Units: Generates a stream of unit tokens (like tick-tick-tick) aligned to the words.
Face: Resamples unit embeddings to 25 fps; predicts jaw/lip shapes for each frame; mouth opens on “go-ing” and closes on pauses.

Secret Sauce Summary:

Discrete units give reliable timing.
TQGF keeps timing in charge and meaning supportive.
Non-autoregressive face prediction ensures smooth, global consistency.
Staged training stabilizes learning with limited real paired data.

04Experiments & Results

🍞 Hook: You don’t judge a dancer only by copying one video—you also ask people which performance felt in-sync and natural.

🥬 The Test: What it is: The team tested speech-to-face (S2F) and text-to-face (T2F) lip-sync quality, speech understanding (S2T), and text-to-speech (TTS). How it works: For faces, they compared to a strong reference model (Audio2Face-3D) using Lip Vertex Error (LVE)—lower is better, like a smaller distance between predicted and reference lip positions. For speech understanding they used VoiceBench; for TTS they measured error rates (WER/CER) by transcribing the generated audio. Why it matters: We need both automatic scores and human judgments to trust the results.

🍞 Anchor: If one system scores 3.7 and another scores 7.3 in LVE, that’s like being twice as close to the reference lip motion on average.

The Competition: Ex-Omni was tested two ways: natively (one model that does speech+face together) and as a cascaded pipeline (first generate speech, then feed it into separate face models like EmoTalk or UniTalker). It was compared against other open omni models (like Qwen2.5-Omni) in similar cascaded setups.

Scoreboard Highlights:

Native Ex-Omni achieved lower LVE than cascaded pipelines on several benchmarks (CommonEval, Ex-A2F-EN, A2F-Bench). In simple terms, it matched the reference lips more closely.
In human A/B tests, people preferred Ex-Omni’s lip–speech sync 55%–80% of the time versus cascaded baselines, with few ties. This is like winning by a clear margin, not just a photo finish.
For speech-to-text on VoiceBench, proprietary giants still lead (they use far more data), but Ex-Omni was competitive among open models, ranking second on SD-QA and doing reasonably on safety checks.
For TTS, Ex-Omni isn’t a specialized speech synthesizer, but still produced decent WER/CER—good enough for an omni model whose main goal is unity and synchronization.

Surprising Findings and Context:

Cascaded systems with different omni backbones but the same face model scored similarly. That means in two-stage pipelines, the downstream face model dominates the quality. In contrast, Ex-Omni’s native approach avoids information loss between stages, helping it pull ahead.
On an English set (Ex-A2F-EN), Ex-Omni sometimes made longer replies. Longer speech means longer, harder face sequences, nudging error up a bit—but still competitive.
Latency: Component-wise it responds fast (first speech unit in about 0.03s; extra face latency ~0.012s), but full end-to-end speed wasn’t yet real-time with the tested hardware and an 8B LLM backbone.

Ablations (what changes when parts are removed):

No velocity smoothing: lip motion got a bit jerkier (LVE worsened), showing the smoothing term helps stability.
Replacing speech-generator context with raw LLM states hurt face quality, confirming that generator-level features better match timing needs.
Removing all context performed worse than keeping the carefully gated context—context matters, but it must be injected gently.
Removing TQGF and using simple self-attention slightly helped some English scores but hurt Chinese and used more compute, suggesting TQGF balances languages and efficiency.

Human Study Reliability:

People strongly and consistently preferred Ex-Omni’s lip-sync; inter-rater agreement was solid, so it wasn’t just random taste.

Bottom line:

Native, unified generation beats stitch-together pipelines for synchronized talking faces.
With much less data than big proprietary models, Ex-Omni still delivers competitive speech understanding and respectable TTS, proving its data efficiency.
The special combo—speech units as timing and TQGF as controller—really does what it promised: better alignment and stability.

05Discussion & Limitations

🍞 Hook: Even great dancers have parts to improve—maybe footwork is strong but facial expressions need work.

🥬 Limitations: What it is: Ex-Omni mainly masters mouth motion and timing; broader facial expressions (eyebrows, cheeks, emotions) aren’t deeply modeled yet. How it works today: ARKit-52 controls are there, but the training targets and losses focus on speech–lip synchronization more than full-body expressiveness. Why it matters: For storytelling or empathetic agents, richer emotion and head/eye dynamics matter. 🍞 Anchor: Think of a singer who nails lip-sync but doesn’t raise eyebrows or smile on happy lines.

Required Resources:

A reasonably large LLM backbone (e.g., Qwen3-8B) plus a speech generator and decoders; 8× high-memory GPUs were used for training.
A curated dataset with ASR, TTS, S2S/T2T tasks, and high-quality synthetic face labels (Audio2Face-3D as teacher).

When NOT to Use:

Tight real-time constraints on modest hardware; current full pipeline latency may be too high.
Scenarios demanding rich, controllable emotions or speaker timbre beyond what current training emphasizes.
Applications requiring strict ground-truth face replication rather than high-quality proxy alignment (since evaluation uses a strong proxy model).

Open Questions:

How to make it fully real-time on edge devices without shrinking reasoning skills?
How to expand from mouth motion to expressive, controllable emotion (valence/arousal), head pose, and eye gaze?
How to reduce reliance on synthetic face labels while maintaining breadth (e.g., semi-supervised or distillation from multiple teachers)?
How to guarantee long-form speech–text consistency so speech never truncates before the text ends?

Overall Assessment:

Ex-Omni takes a careful, engineering-smart route: split thinking from timing; give timing a scaffold; add a gentle controller. It doesn’t do everything yet, but it nails the core challenge of lip-sync within an omni model—a strong foundation for the rest.

06Conclusion & Future Work

Three-Sentence Summary: Ex-Omni is an open-source omni model that talks and moves a 3D face in sync by separating high-level language reasoning from fine-grained timing. It uses discrete speech units as a time scaffold and a gated fusion controller (TQGF) so meaning flows into speech and face only when helpful. This design delivers smoother lip-sync than cascaded pipelines and holds its own on speech understanding and TTS with limited data.

Main Achievement: Showing that native, unified generation of speech and 3D facial animation inside an omni model is both feasible and superior to stitching separate systems together—thanks to the combo of temporal scaffolding and gated semantic injection.

Future Directions: Add emotion and richer facial cues (brows, eyes, head), strengthen speaker control, improve long-form consistency, and optimize for real-time on wider hardware. More diverse supervision (multi-teacher, semi-supervised) could reduce bias and increase realism.

Why Remember This: It’s the blueprint for making AI feel more human in conversation—clear voice plus matching lips—by letting the brain plan, the beat keep time, and the face dance to it. The simple but powerful idea—decouple semantics from timing and connect them with a smart gate—will likely guide future multimodal systems well beyond talking faces.

Practical Applications

•Virtual customer service avatars that speak and lip-sync during help sessions.
•Educational tutors that explain concepts with clear speech and synchronized facial motion.
•Game NPCs that deliver dialogue more believably without hand-animating mouths.
•Digital presenters for news, product demos, or training videos with natural lip-sync.
•Speech therapy tools that show accurate mouth shapes to guide pronunciation practice.
•Accessible communication aids that pair audio with readable lip movements.
•Social robots and embodied agents that feel more present and engaging.
•Live-streaming assistants that narrate and emote with matching facial cues.
•Language-learning companions that demonstrate mouth shapes for new sounds.

Version: 1