AdaptMMBench is a new test that checks if AI models know when to just look and think, and when to use extra visual tools like zooming or brightening an image.
A2Eval is a two-agent system that automatically builds and runs fair tests for robot-style vision-language models, cutting wasted work while keeping results trustworthy.
Youtu-VL is a new kind of vision-language model that learns to predict both words and tiny image pieces, not just words.
Render-of-Thought (RoT) turns the model’s step-by-step thinking from long text into slim images so the model can think faster with fewer tokens.
ChartVerse is a new way to make lots of tricky, realistic charts and perfectly checked questions so AI can learn to read charts better.
This paper teaches AI to look around a 3D place step by step, instead of staring at a fixed set of pictures, so it can answer tricky spatial questions better.
This paper teaches a model to turn a question about a table into both a short answer and a clear, correct chart.
CPPO is a new way to fine‑tune vision‑language models so they see pictures more accurately before they start to reason.
QuantiPhy is a new test that checks if AI models can measure real-world physics from videos using numbers, not guesses.
This paper teaches robots to move their camera to a better spot before answering a question about what they see.
ReVSeg teaches an AI to segment objects in videos by thinking step-by-step instead of guessing everything at once.
Before this work, big vision-language models (VLMs) were great at understanding pictures and words together but not at making new pictures.