Qwen3-TTS is a family of text-to-speech models that can talk in 10+ languages, clone a new voice from just 3 seconds, and follow detailed style instructions in real time.
OpenVision 3 is a single vision encoder that learns one set of image tokens that work well for both understanding images (like answering questions) and generating images (like making new pictures).
This paper asks a new question for vision-language models: not just 'What do you see?' but 'How far along is the task right now?'
Benign fine-tuning meant to make language models more helpful can accidentally make them overshare private information.
Robots often learn a bad habit called the vision shortcut: they guess the task just by looking, and ignore the words you tell them.
Render-of-Thought (RoT) turns the model’s step-by-step thinking from long text into slim images so the model can think faster with fewer tokens.
HERMES is a training-free way to make video-language models understand live, streaming video quickly and accurately.
GutenOCR turns a general vision-language model into a single, smart OCR front-end that can read, find, and point to text on a page using simple prompts.
The paper shows how to control accents in text-to-speech (TTS) by mixing simple, linguistics-based sound-change rules with speaker embeddings.
This paper introduces HUVR, a single vision model that can both recognize what’s in an image and reconstruct or generate images from tiny codes.
VideoMaMa is a model that turns simple black-and-white object masks into soft, precise cutouts (alpha mattes) for every frame of a video.
Motion 3-to-4 turns a single regular video into a moving 3D object over time (a 4D asset) by first getting the object’s shape and then figuring out how every part moves.