Loop-ViT is a vision model that thinks in loops, so it can take more steps on hard puzzles and stop early on easy ones.
AdaReasoner teaches AI to pick the right visual tools, use them in the right order, and stop using them when they aren’t helping.
This paper turns a video model into a step-by-step visual thinker that makes one final, high-quality picture from a text prompt.
BabyVision is a new test that checks if AI can handle the same basic picture puzzles that young children can do, without leaning on language tricks.
Large Multimodal Models (LMMs) are great at reading text and looking at pictures, but they usually do most of their thinking in words, which limits deep visual reasoning.
This paper teaches vision-language models to reason about pictures using puzzles instead of expensive human labels.
This paper teaches a vision-language model to think about images by talking to copies of itself, using only words to plan and decide.