Motion 3-to-4 turns a single regular video into a moving 3D object over time (a 4D asset) by first getting the object’s shape and then figuring out how every part moves.
This paper builds two teamwork models, Qwen3-VL-Embedding and Qwen3-VL-Reranker, that understand text, images, visual documents, and videos in one shared space so search works across all of them.
LTX-2 is an open-source model that makes video and sound together from a text prompt, so the picture and audio match in time and meaning.
The paper teaches a video model to squeeze long video history into a tiny memory while still keeping sharp details in single frames.
Robots often get confused on long, multi-step tasks when they only see the final goal image and try to guess the next move directly.
CASA is a new way to mix images and text inside a language model that keeps speed and memory low while keeping accuracy high.
This paper protects your photos from being misused by new AI image editors that can copy your face or style from just one picture.
Robots often see the world as flat pictures but must move in a 3D world, which makes accurate actions hard.
This paper introduces BiCo, a one-shot way to mix ideas from images and videos by tightly tying each visual idea to the exact words in a prompt.
D4RT is a new AI model that turns regular videos into moving 3D scenes (4D) quickly and accurately.