This paper teaches talking avatars not just to speak, but to look around their scene and handle nearby objects exactly as a text instruction says.
ShapeR builds clean, correctly sized 3D objects from messy, casual phone or glasses videos by using images, camera poses, sparse SLAM points, and short text captions together.
LiveTalk turns slow, many-step video diffusion into a fast, 4-step, real-time system for talking avatars that listen, think, and respond with synchronized video.