OmniSIFT is a new way to shrink (compress) audio and video tokens so omni-modal language models can think faster without forgetting important details.
This paper introduces 3DiMo, a new way to control how people move in generated videos while keeping the camera moves flexible through text.
This paper shows how to make text-to-video models create clearer, steadier, and more on-topic videos without using any human-labeled ratings.
Motion 3-to-4 turns a single regular video into a moving 3D object over time (a 4D asset) by first getting the object’s shape and then figuring out how every part moves.
This paper builds two teamwork models, Qwen3-VL-Embedding and Qwen3-VL-Reranker, that understand text, images, visual documents, and videos in one shared space so search works across all of them.
LTX-2 is an open-source model that makes video and sound together from a text prompt, so the picture and audio match in time and meaning.
The paper teaches a video model to squeeze long video history into a tiny memory while still keeping sharp details in single frames.
Robots often get confused on long, multi-step tasks when they only see the final goal image and try to guess the next move directly.
CASA is a new way to mix images and text inside a language model that keeps speed and memory low while keeping accuracy high.
This paper protects your photos from being misused by new AI image editors that can copy your face or style from just one picture.
Robots often see the world as flat pictures but must move in a 3D world, which makes accurate actions hard.
This paper introduces BiCo, a one-shot way to mix ideas from images and videos by tightly tying each visual idea to the exact words in a prompt.