SpatiaLab is a new test that checks if vision-language models (VLMs) can understand real-world spatial puzzles, like what’s in front, behind, bigger, or reachable.
The paper asks a simple question: when an AI sees a picture and some text but the instructions say 'only trust the picture,' how does it decide which one to follow?
This paper shows that comics (multi-panel pictures with words) can help AI think through problems step by step, just like a student explains their work.
This paper shows a new way to help AI think through long problems faster by turning earlier text steps into small pictures the AI can reread.
MMFineReason is a huge, open dataset (1.8 million examples, 5.1 billion solution tokens) that teaches AIs to think step by step about pictures and text together.
This paper asks a new question for vision-language models: not just 'What do you see?' but 'How far along is the task right now?'
Computer-using agents kept forgetting important visual details over long tasks and could not reliably find up-to-date, step-by-step help for unfamiliar apps.
AgentOCR turns an agent’s long text history into pictures so it can remember more using fewer tokens.
FOCUSUI makes computer-using AI faster and still accurate by looking only at the important parts of a screen.
WebGym is a giant practice world (almost 300,000 tasks) that lets AI web agents learn on real, ever-changing websites instead of tiny, fake ones.
ProGuard is a safety guard for text and images that doesn’t just spot known problems—it can also recognize and name new, never-seen-before risks.
The paper teaches vision-language models (AIs that look and read) to pay attention to the right picture parts without needing extra tools during answering time.