This paper teaches talking avatars not just to speak, but to look around their scene and handle nearby objects exactly as a text instruction says.
The paper shows how to make AI think faster and smarter by planning in a hidden space instead of writing long step-by-step sentences.
LTX-2 is an open-source model that makes video and sound together from a text prompt, so the picture and audio match in time and meaning.
The paper turns one flat picture into a neat stack of see‑through layers, so you can edit one thing without messing up the rest.
Saber is a new way to make videos that match a text description while keeping the look of people or objects from reference photos, without needing special triplet datasets.