SpatiaLab is a new test that checks if vision-language models (VLMs) can understand real-world spatial puzzles, like what’s in front, behind, bigger, or reachable.
The paper asks a simple question: when an AI sees a picture and some text but the instructions say 'only trust the picture,' how does it decide which one to follow?
This paper shows that comics (multi-panel pictures with words) can help AI think through problems step by step, just like a student explains their work.
Kimi K2.5 is a new open-source AI that can read both text and visuals (images and videos) and act like a team of helpers to finish big tasks faster.
This paper shows a new way to help AI think through long problems faster by turning earlier text steps into small pictures the AI can reread.
MMFineReason is a huge, open dataset (1.8 million examples, 5.1 billion solution tokens) that teaches AIs to think step by step about pictures and text together.
This paper asks a new question for vision-language models: not just 'What do you see?' but 'How far along is the task right now?'
XR is a new, training-free team of AI helpers that finds images using both a reference picture and a short text edit (like “same jacket but red”).
Computer-using agents kept forgetting important visual details over long tasks and could not reliably find up-to-date, step-by-step help for unfamiliar apps.
AgentOCR turns an agent’s long text history into pictures so it can remember more using fewer tokens.
FOCUSUI makes computer-using AI faster and still accurate by looking only at the important parts of a screen.
Real people often ask vague questions with pictures, and today’s vision-language models (VLMs) struggle with them.