This paper introduces Nexus Adapters, tiny helper networks that let a diffusion model follow both a text prompt and a structure map (like edges or depth) at the same time.
This paper teaches talking avatars not just to speak, but to look around their scene and handle nearby objects exactly as a text instruction says.
ShapeR builds clean, correctly sized 3D objects from messy, casual phone or glasses videos by using images, camera poses, sparse SLAM points, and short text captions together.
LiveTalk turns slow, many-step video diffusion into a fast, 4-step, real-time system for talking avatars that listen, think, and respond with synchronized video.