Small AI models often stumble when a tool call fails and then get stuck repeating bad calls instead of fixing the mistake.
Qwen3-TTS is a family of text-to-speech models that can talk in 10+ languages, clone a new voice from just 3 seconds, and follow detailed style instructions in real time.
Robots need videos that not only look pretty but also follow real-world physics and finish the task asked of them.
OpenVision 3 is a single vision encoder that learns one set of image tokens that work well for both understanding images (like answering questions) and generating images (like making new pictures).
This paper asks a new question for vision-language models: not just 'What do you see?' but 'How far along is the task right now?'
Benign fine-tuning meant to make language models more helpful can accidentally make them overshare private information.
Robots often learn a bad habit called the vision shortcut: they guess the task just by looking, and ignore the words you tell them.
Diffusion language models can write tokens in any order, but that freedom can accidentally hurt their ability to reason well.
Render-of-Thought (RoT) turns the model’s step-by-step thinking from long text into slim images so the model can think faster with fewer tokens.
HERMES is a training-free way to make video-language models understand live, streaming video quickly and accurately.
Typhoon OCR is an open, lightweight vision-language model that reads Thai and English documents and returns clean, structured text.
Robots used to explore by following simple rules or short-term rewards, which often made them waste time and backtrack a lot.