Making Avatars Interact: Towards Text-Driven Human-Object Interaction for Controllable Talking Avatars

Youliang Zhang; Zhengguang Zhou; Zhentao Yu; Ziyao Huang; Teng Hu; Sen Liang; Guozhen Zhang; Ziqiao Peng; Shunkai Li; Yi Chen; Zixiang Zhou; Yuan Zhou; Qinglin Lu; Xiu Li

Making Avatars Interact: Towards Text-Driven Human-Object Interaction for Controllable Talking Avatars

Beginner

Youliang Zhang, Zhengguang Zhou, Zhentao Yu et al.2/2/2026

arXiv PDF

Key Summary

•This paper teaches talking avatars not just to speak, but to look around their scene and handle nearby objects exactly as a text instruction says.
•It introduces InteractAvatar, a two-track system: one track plans the motions with scene awareness, and the other makes high-quality video synced to the audio.
•A special Perception and Interaction Module (PIM) first understands the reference image and plans human-and-object motions that match the text.
•An Audio-Interaction-aware Module (AIM) turns that plan into a realistic, lip-synced video while keeping the person and background consistent.
•A Motion-to-Video (M2V) aligner keeps the video perfectly in step with the planned motion so the avatar’s hands and objects line up exactly.
•By separating planning from video rendering, the method avoids the usual trade-off between control (following instructions) and visual quality.
•The authors built a new test called GroundedInter with 600 cases to measure how well methods do grounded human-object interactions.
•Compared with strong baselines, the method greatly improves hand and object interaction scores while keeping lip sync and identity quality strong.
•It also supports flexible inputs: any mix of text, audio, and motion can drive the same model.
•A key current limit is single-person only; multi-person interactions are future work.

Why This Research Matters

Digital humans become far more useful when they can actually use the objects you see, not just talk about them. This system turns simple text and voice commands into precise, realistic actions grounded in your own scene. That can power better shopping demos, clearer how-to guides, more engaging education, and richer accessibility tools. It also reduces the need for complex pose inputs, so regular users can control avatars directly with words and speech. By separating planning from rendering, it delivers both accuracy and beauty at once. A dedicated benchmark proves these gains, pushing the whole field toward practical, scene-aware assistants.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine asking a virtual character, “Please pick up the red apple on the table and then say, ‘Want a bite?’” You’d expect it to grab the real apple you can see and talk at the same time, right? Most systems today can talk, but they often ignore the apple sitting right there.

🥬 The Concept — Grounded Human-Object Interaction (GHOI): What it is: GHOI means an avatar understands its actual environment (the reference image), finds the right objects, and interacts with them while following your text instruction. How it works: 1) Look at the scene to spot people and objects; 2) Read the text to know what action to do; 3) Plan a motion that reaches the correct object in the given scene; 4) Render a video where the person actually performs that motion and speaks. Why it matters: Without grounding, the avatar might act in a made-up scene or miss the real object, breaking the illusion that it’s “there.” 🍞 Anchor: If you say, “Lift the cup on your left and say, ‘Cheers!’” GHOI makes the avatar grab the real cup in the picture on its left and speak.

🍞 Hook: You know how it’s hard to both draw a beautiful picture and also perfectly trace a complex maze at the same time? Doing both at once makes each task worse.

🥬 The Concept — The Control-Quality Dilemma: What it is: When making HOI videos, models struggle to both follow precise instructions (control) and stay high-quality (beauty). How it works: If you push the model to obey instructions tightly, visuals can get messy; if you push for pretty visuals, it may ignore instructions. Why it matters: Without solving this, you either get fuzzy hands and objects or you get beautiful but disobedient videos. 🍞 Anchor: It’s like trying to paint a gorgeous portrait while someone keeps moving the canvas—you’ll either lose the likeness or the neatness.

🍞 Hook: Picture giving a friend directions: first you plan the route, then you walk it. If you mash both together, you’ll probably get lost.

🥬 The Concept — Decoupling Planning from Rendering: What it is: The paper’s key idea is to split the job into two parts: 1) plan a motion that fits the scene and text; 2) render a high-quality video that follows the plan and the audio. How it works: A planning brain (PIM) understands the scene and writes a motion plan; a rendering engine (AIM) speaks and moves the avatar beautifully; a connector (M2V) keeps them in sync. Why it matters: This avoids the control-quality dilemma by letting each part specialize and then harmonize. 🍞 Anchor: First decide how to pick up the apple (reach, grasp, lift), then perform it on camera while talking—no confusion.

🍞 Hook: Think about how we got here. First, talking heads nailed lip sync. Then half-body and full-body animations added simple motions. But touching and using real objects in the actual scene? That’s the next leap.

🥬 The Concept — The World Before: What it is: Older systems did great lips and faces, or simple body moves, or created whole new scenes—but they didn’t truly interact with real objects in your given image. How it works: Audio-driven methods map voice to pixels but don’t model objects; pose-driven ones need you to supply exact skeletons; subject-consistent ones rebuild new scenes instead of using your scene. Why it matters: Without grounded interaction, digital humans can’t perform practical, context-aware tasks. 🍞 Anchor: If you say “Raise the camera to your eyes and take a photo,” older methods might keep the mouth moving but never lift the camera in the given scene.

🍞 Hook: Suppose you show one photo and one sentence, and you expect a whole mini-movie that makes sense. That’s tough!

🥬 The Concept — Scene-Action Grounding: What it is: The model must read the scene layout and attach the action to the correct object at the correct place and time. How it works: 1) Detect objects and their positions; 2) Interpret the action words in text; 3) Plan steps to reach, grasp, and move; 4) Keep consistent with the initial image. Why it matters: Without grounding, the hand might reach into empty space or grab the wrong thing. 🍞 Anchor: With “Pick up the cup and drink,” the hand actually goes to the visible cup, not to a made-up spot.

🍞 Hook: Why should anyone care? Because helpful avatars need to use the real stuff around them—like showing a product, demonstrating a tool, or acting out instructions.

🥬 The Concept — Real Stakes: What it is: Grounded, controllable digital humans can teach, assist, and demonstrate in your real context. How it works: They follow text and voice, notice your environment, and act accordingly. Why it matters: This bridges the gap between chatty avatars and practical helpers. 🍞 Anchor: A shopping assistant avatar can literally pick up the sneaker in your photo and point out its features while speaking.

02Core Idea

🍞 Hook: Imagine a two-person team: one plans the play, the other performs it on stage. Each is great at their own job, and a stage manager keeps them in sync.

🥬 The Concept — InteractAvatar’s Dual-Stream Design: What it is: A two-track system where PIM plans text-and-scene-aligned motions and AIM renders a high-fidelity, lip-synced video, kept in step by M2V. How it works: 1) PIM perceives the reference image and plans motion of body and objects; 2) AIM renders video from text, audio, and motion hints; 3) M2V aligns layers so what’s planned is what you see. Why it matters: Splitting thinking (planning) and drawing (rendering) defeats the control-quality dilemma. 🍞 Anchor: For “Pick up the apple and show it,” PIM decides where the apple is and how to grasp; AIM shows realistic hands and face while speaking; M2V keeps the hand and apple perfectly aligned.

🍞 Hook: You know how you first look around a room before deciding how to move? That’s what the planning brain does.

🥬 The Concept — Perception and Interaction Module (PIM): What it is: A planner that understands the scene and generates an interaction motion: human skeleton poses plus object box paths. How it works: 1) Read the reference image to spot objects; 2) Read the text; 3) Create a timeline of poses and object interactions; 4) Train with both detection-like and motion tasks to be scene-aware. Why it matters: Without a good plan, hands won’t meet objects, and actions won’t match text. 🍞 Anchor: “Take off the glasses”: PIM figures out where the glasses are and plans the two-hand motion sequence to lift and hold them.

🍞 Hook: Now imagine a performer who must look great, talk clearly, and move exactly on cue.

🥬 The Concept — Audio-Interaction-aware Module (AIM): What it is: A renderer that makes the final video, matching lips to audio while following the PIM’s motion plan. How it works: 1) Ingest audio features (Wav2Vec) for lip sync; 2) Take motion features from PIM; 3) Render crisp video frames; 4) Use a face mask to focus audio effects mainly on the mouth area early in training. Why it matters: Without AIM, you’d have a plan but no realistic performance. 🍞 Anchor: When the line is “Do you want to take a bite?”, AIM makes the lips match the sound and the head and arms move naturally.

🍞 Hook: Think of a dance coach whispering beat-by-beat tips into the performer’s ear while on stage.

🥬 The Concept — Motion-to-Video (M2V) Aligner: What it is: A layer-by-layer bridge that injects motion cues into the video generator so motion and pixels match everywhere. How it works: 1) Compute residuals (the change layer-to-layer) from PIM; 2) Upsample and gently project them; 3) Add them to matching layers in AIM; 4) Zero-initialize to start stable and avoid sketchy artifacts. Why it matters: Without M2V, hands can drift from objects or motion plans can get ignored. 🍞 Anchor: M2V keeps the avatar’s fingers precisely on the cup handle as the cup is lifted.

🍞 Hook: Picture a super-talented art robot that learns to add details step by step until a full video appears.

🥬 The Concept — Diffusion Transformer (DiT): What it is: A powerful model that gradually denoises from fuzz to crisp video while using attention to follow conditions like text, audio, and motion. How it works: 1) Start with a noisy latent; 2) At each step, use learned patterns to remove noise; 3) Attend to conditions to guide the result; 4) Repeat until clean. Why it matters: DiT is the engine behind both PIM and AIM, making planning and rendering strong and consistent. 🍞 Anchor: It’s like sharpening a blurry sketch into a clear clip while listening to instructions and music.

Before vs After:

Before: Models either made pretty videos that ignored instructions or obeyed text but looked clumsy and didn’t respect the given scene.
After: InteractAvatar plans with perception (PIM), renders with fidelity and audio sync (AIM), and locks them together (M2V), so videos both obey and look great.

Why It Works (intuition):

Specialization: Let planners plan and renderers render; each gets simpler.
Alignment: Feed the plan into rendering at every layer, so the plan can’t be lost.
Curriculum: Train perception first, then audio sync, then everything together, so signals don’t fight.

Building Blocks:

Scene-aware motion (PIM)
Audio-synced rendering (AIM)
Residual alignment (M2V)
Shared DiT backbone
Unified training (flow matching) that treats detection and generation consistently

03Methodology

High-level recipe: Input (reference image + text + optional audio/motion) → PIM (plan motions with scene grounding) → M2V aligns PIM-to-AIM features → AIM (render lip-synced, high-quality video) → Output (grounded HOI video)

🍞 Hook: Imagine making a short movie: you storyboard first, then you film, and a script supervisor checks every scene matches the plan.

🥬 The Concept — Motion as a Bridge: What it is: Motion is defined as a sequence of human skeleton keypoints plus object bounding boxes, rendered as compact RGB motion maps. How it works: 1) Convert joints and boxes into image-like frames; 2) Keep short side at 256 px for efficiency; 3) Use as intermediate guidance for the video. Why it matters: A visual motion bridge aligns naturally with video generators, improving control and generalization. 🍞 Anchor: The motion map shows hands reaching the mug and the mug’s box moving upward—easy for the video model to follow.

Step A — PIM: Perception and Motion Planning

What happens: PIM reads the reference image and the instruction, then outputs a motion sequence (poses + object boxes). It’s trained with a clever mix: continuation (finish the motion given the first frame) and perception-as-generation (recreate even the first frame from just the image + text), sometimes reducing to a detection-like single-frame task.
Why it exists: Without this, the model guesses motion blindly and might miss objects or misalign hands.
Example: Text: “Pick up the cup and drink water.” PIM: detects cup, plans reach–grasp–lift–tilt.

🍞 Hook: You know how we sometimes learn by both spotting objects and acting them out? That’s what the training does here.

🥬 The Concept — Environment Perception Training: What it is: A training curriculum that alternates between detection-like perception tasks and full motion generation. How it works: 1) Sometimes hide the first motion frame to force perception from the image; 2) Sometimes predict only the first frame like detection; 3) Sometimes continue a motion sequence. Why it matters: This builds strong scene understanding and smooth motion planning. 🍞 Anchor: The model practices “find the apple” and “act out grab-the-apple,” making it good at both.

🍞 Hook: Think of sticky notes placed at exact layers saying, “Don’t forget: the hand goes here now!”

🥬 The Concept — M2V Residual Injection: What it is: A layer-wise method that sends PIM’s changes (residuals) into AIM so the rendered video tracks the plan at every depth. How it works: 1) Compute PIM residuals per layer; 2) Upsample to video size; 3) Project via a zero-initialized linear layer; 4) Add into the matching AIM layer. Why it matters: Prevents drift and keeps control stable without harming image quality. 🍞 Anchor: Each layer reminder keeps the fingertips locked on the cup handle during the whole lift.

Step B — AIM: Audio-Interaction-aware Rendering

What happens: AIM takes text, the reference image, audio features, and the aligned motion hints from PIM; it produces realistic frames with correct lips and body-object contact.
Why it exists: Without AIM, you can’t get photorealism, stable lips, or smooth motion.
Example: The avatar smiles, says “Cheers!”, and raises the cup naturally.

🍞 Hook: When you talk, your mouth moves smoothly across sounds, not letter by letter.

🥬 The Concept — Audio Injection with Face Mask: What it is: Inject audio features (from Wav2Vec) through cross-attention and focus their effect mainly on the mouth and face early in training. How it works: 1) Extract audio with a context window; 2) Cross-attend to visual features; 3) Multiply by a face mask so lips get the strongest influence; 4) Later, the model learns broader coordination. Why it matters: Keeps training stable and makes lip sync clean and precise. 🍞 Anchor: On “bite,” the lips close at the right moment, while the hand lifts the apple.

🍞 Hook: Learning tough stuff is easier when you start with the gentler skill first.

🥬 The Concept — Multimodal Curriculum (Three Stages): What it is: A training schedule that builds skills in the right order. How it works: 1) Pre-train PIM for scene-aware planning (detection + continuation + generation from image+text); 2) Pre-train AIM for audio-driven lip sync and image-to-video quality; 3) Jointly fine-tune both so motion and video co-evolve. Why it matters: If you add strong motion signals too early, they can drown out audio; this order prevents that. 🍞 Anchor: First learn to see and plan, then to speak clearly, then to act and speak together.

🍞 Hook: Sometimes you want to drive the dance exactly, not just let the dancer improvise.

🥬 The Concept — External Driving: What it is: The system can also take a clean motion sequence as a direct driver, not just generate one from text. How it works: 1) Feed the skeleton/HOI motion into PIM at diffusion step t=0 (encoder mode); 2) Extract residuals; 3) Inject via M2V; 4) Render in AIM. Why it matters: Lets users precisely control motions when needed. 🍞 Anchor: If you already have a recorded pose track, the avatar can exactly follow it while still talking.

🍞 Hook: Many pieces speak a common language so they can work together.

🥬 The Concept — Shared Backbones and Flow Matching: What it is: Both PIM and AIM use the same style of DiT backbone and train with flow matching, a unified way to learn from noisy-to-clean latents. How it works: 1) Encode visuals with a VAE into latents; 2) Condition with text (T5) and audio (Wav2Vec); 3) Learn a vector field to steer from noise to data; 4) Apply to both perception and video. Why it matters: A common language (latents, steps, attention) keeps planning and rendering compatible. 🍞 Anchor: Like having both the planner and the painter use the same sketchbook and markers.

Practical notes:

Reference images are given a special temporal position (like a frame at time -1) using a RoPE tweak, so they guide the whole sequence without confusing time order.
Motion resolution is smaller for speed; video is larger for detail; M2V bridges the gap.

04Experiments & Results

🍞 Hook: If you invent a new kind of test, you need a fair game to play and a scoreboard everyone understands.

🥬 The Concept — GroundedInter Benchmark: What it is: A 600-case test set with reference images, text instructions, and speech audio designed to judge grounded human-object interaction. How it works: 1) Images contain 1–3 real or plausible objects; 2) Text describes actions and timing; 3) Audio matches the dialogue; 4) Extra labels (masks, detections, keypoints) help measure quality. Why it matters: Without a focused benchmark, it’s hard to prove true grounded interaction. 🍞 Anchor: “Pick up the apple and show it” appears with a scene image and a voice line; metrics check if the hand actually touches the apple and lips sync.

The Test: What did they measure and why?

VLM-QA: A vision-language model answers structured questions (about objects, human, and interaction) to judge grounded correctness.
HQ (Hand Quality): Combines hand motion dynamics with sharpness to reflect usable, expressive hands.
OQ (Object Quality): Checks object motion and visual consistency over time.
PI (Pixel-level Interaction): Verifies contact between hands and target objects.
Plus standards: CLIP for text-video match, DINO for appearance consistency, Sync_conf for lip sync, Temp-C for temporal smoothness, DD for how dynamic the motion is.

🍞 Hook: It’s no fun winning if you only play easy teams; they compared against many strong players.

🥬 The Concept — Baselines: What it is: Strong systems from related areas—audio-driven talking avatars, subject-consistent video, and pose-driven methods. How it works: Each baseline tries the same tasks with its usual strengths: great lip sync, strong identity, or precise pose following. Why it matters: Beating them in interaction while staying good at identity and lips shows broad strength. 🍞 Anchor: They compare to Wan-S2V, Hunyuan-Avatar, OmniAvatar, FantasyTalking, HuMo, UniAnimate-DiT, and VACE.

The Scoreboard (with context):

Interaction wins: Compared to a strong audio-driven baseline (Wan-S2V), InteractAvatar’s audio-driven mode boosts Hand Quality by about 180% and Object Quality by about 111%, like going from average practice to top-of-class performance while keeping the same stage and costume.
Grounding: VLM-QA and PI scores are higher, meaning the model actually touches and uses the right objects in the given scene.
Still looks great: DINO_subject and DINO_ref stay competitive, so identity and background are preserved. Lip sync (Sync_conf) is strong—even leading in the HOI scenario.
Pose-driven comparison: Even when fed good poses, the pose baseline lags in object realism; InteractAvatar’s joint planning+rendering handles object shapes and hand-object contact better.
Subject-consistent comparison: Those models often make a fresh scene. InteractAvatar keeps the original environment while interacting smoothly.

Surprises and insights:

Training order matters: Pre-training audio sync before adding strong motion signals preserves lip quality.
Layer-wise residual injection beats last-layer or simple addition, improving stability and avoiding artifacts.
Using RGB-style motion maps (instead of raw (x,y) coordinates) better matches video priors and improves generalization and alignment.
Real-scene subset: Gains hold up on real images, not just synthetic ones.

🍞 Hook: Imagine judges scoring a dance: they check footwork (hands), partner hold (object), timing (sync), storytelling (text alignment), and costume consistency (identity/background). InteractAvatar scores high across the board.

05Discussion & Limitations

Limitations:

Single-person only: The current system can’t yet handle multi-person interactions like two people handing an object to each other.
Complex occlusions: Extremely cluttered scenes or tiny, partially hidden objects can still trip up perception and hand-object alignment.
Very long sequences: Long, multi-step tasks may require stronger long-horizon planning.

Required resources:

A modern GPU setup for training and inference, since diffusion transformers and video VAEs are compute-intensive.
Labeled or pseudo-labeled data for detection, pose, and object tracks during training.
Clean audio for best lip sync and a clear reference image for stable grounding.

When not to use:

If you need spontaneous, no-scene, purely stylized fantasy videos (brand-new worlds), subject-consistent text-to-video may be simpler.
If you require extremely precise 3D manipulation with exact forces and physics (e.g., robot-grade control), this method isn’t a physics simulator.
If you must coordinate multiple people tightly, this version won’t suffice yet.

Open questions:

Multi-person GHOI: How to plan and coordinate several agents’ motions and contacts simultaneously?
3D awareness: Can explicit 3D scene and hand-object geometry further improve grasp fidelity and contact realism?
Longer tasks: How to scale planning to multi-action scripts with memory and subgoals?
Robustness: How to handle tiny objects, heavy occlusions, and rapid camera motions?
Safety and ethics: How to watermark outputs and prevent misuse of realistic avatars?

06Conclusion & Future Work

Three-sentence summary:

InteractAvatar makes talking avatars that truly interact with real objects in your given scene by splitting planning (PIM) from rendering (AIM) and tightly aligning them (M2V).
This decoupling solves the usual control-versus-quality trade-off, delivering accurate, grounded actions with strong lip sync and visual fidelity.
A new benchmark, GroundedInter, shows large gains over strong baselines in hand and object interaction quality while preserving identity and sync.

Main achievement:

The paper’s top contribution is the dual-stream architecture with layer-wise motion-to-video alignment that enables reliable, text-driven, grounded human-object interactions without demanding complex user-supplied poses.

Future directions:

Extend to multi-person interactions, richer 3D reasoning, longer multi-step tasks, and stronger robustness to clutter and occlusion.
Explore better hand-grasp modeling, tactile cues, and physics priors for more dexterous manipulation.

Why remember this:

It’s a blueprint for practical digital humans: look, understand, plan, and act in your scene while speaking—controlled by simple text and voice. By letting a planner plan and a renderer render, the system finally makes grounded interaction both faithful and beautiful.

Practical Applications

•E-commerce demos: An avatar picks up and showcases the exact product in a user-supplied photo while describing features.
•Education: Science or art lessons where the avatar manipulates visible tools or models on screen while explaining steps.
•Customer support: Guided troubleshooting with an avatar that points to and operates visible device parts.
•How-to tutorials: Cooking, crafting, or DIY videos where the avatar grabs the right utensil and demonstrates actions.
•Accessibility: Voice-driven assistants that can act on objects in a scene for people with motor difficulties.
•Virtual presenters: News or sports highlights where the presenter gestures and interacts with on-screen props.
•Training simulations: Safety or repair procedures where the avatar operates tools in a realistic setting.
•Marketing and social media: Engaging product teases with natural hand-object interactions and synchronized speech.
•Content creation: Storytelling or skits where characters handle scene objects exactly as the script says.
•Telepresence: Personalized avatars that reference and manipulate items in shared scene images during calls.

Version: 1