DiffCoT treats a model’s step-by-step thinking (Chain-of-Thought) like a messy draft that can be cleaned up over time, not something fixed forever.
Real people often ask vague questions with pictures, and today’s vision-language models (VLMs) struggle with them.
This paper teaches a computer agent to grow a toolbox of skills that are real, runnable programs, not just text ideas.
EpiQAL is a new benchmark that tests how well AI models answer population-level disease questions using real research papers.
ThinkRL-Edit teaches an image editor to think first and draw second, which makes tricky, reasoning-heavy edits much more accurate.
The paper teaches language models using extra 'language homework' made from the same raw text so they learn grammar and meaning, not just next-word guessing.
Mixture-of-Experts (MoE) language models don’t split cleanly into domain specialists; instead, a small, stable group of experts gets chosen again and again across many subjects.
InfiniDepth is a new way to predict depth that treats every image location as a smooth, continuous place you can ask for depth, not just the fixed pixels of a grid.
LTX-2 is an open-source model that makes video and sound together from a text prompt, so the picture and audio match in time and meaning.
This paper fixes a common problem in multimodal AI: models can understand pictures and words well but stumble when asked to create matching images.
Unified Thinker separates “thinking” (planning) from “drawing” (image generation) so complex instructions get turned into clear, doable steps before any pixels are painted.
This paper shows that training a language model with reinforcement learning on just one super well-designed example can boost reasoning across many school subjects, not just math.