This paper shows how to make text-to-video models create clearer, steadier, and more on-topic videos without using any human-labeled ratings.
This paper teaches talking avatars not just to speak, but to look around their scene and handle nearby objects exactly as a text instruction says.
CoDance is a new way to animate many characters in one picture using just one pose video, even if the picture and the video do not line up perfectly.
Big video makers (diffusion models) create great videos but are too slow because they use hundreds of tiny clean-up steps.
APOLLO is a single, unified model that can make video and audio together or separately, and it keeps them tightly in sync.
LTX-2 is an open-source model that makes video and sound together from a text prompt, so the picture and audio match in time and meaning.
SVBench is the first benchmark that checks whether video generation models can show realistic social behavior, not just pretty pictures.
This paper is about making the words you type into a generator turn into the right pictures and videos more reliably.
This paper introduces BiCo, a one-shot way to mix ideas from images and videos by tightly tying each visual idea to the exact words in a prompt.