The paper shows that video AIs do not need long, human-like chains of thought to reason well.
Diffusion language models write by gradually unmasking hidden words, so deciding which blanks to reveal next is a big deal for both speed and accuracy.
This paper teaches a vision-language model to think about images by talking to copies of itself, using only words to plan and decide.
TreeGRPO teaches image generators using a smart branching tree so each training run produces many useful learning signals instead of just one.
The paper asks when reinforcement learning (RL) really makes language models better at reasoning beyond what they learned in pre-training.
The paper shows that making a model write a number as a sequence of digits and then grading the whole number at the end works better than grading each digit separately.
Large language models forget or misuse new facts if you only poke their weights once; EtCon fixes this with a two-step plan.
COOPER is a single AI model that both “looks better” (perceives depth and object boundaries) and “thinks smarter” (reasons step by step) to answer spatial questions about images.
SPARK teaches AI to grade its own steps without needing the right answers written down anywhere.
ReVSeg teaches an AI to segment objects in videos by thinking step-by-step instead of guessing everything at once.
This paper teaches image models to keep things consistent across multiple pictures—like the same character, art style, and story logic—using reinforcement learning (RL).
This paper teaches AI models to reason better by first copying only good examples and later learning from mistakes too.