AdaptMMBench is a new test that checks if AI models know when to just look and think, and when to use extra visual tools like zooming or brightening an image.
A2Eval is a two-agent system that automatically builds and runs fair tests for robot-style vision-language models, cutting wasted work while keeping results trustworthy.
Youtu-VL is a new kind of vision-language model that learns to predict both words and tiny image pieces, not just words.
Render-of-Thought (RoT) turns the model’s step-by-step thinking from long text into slim images so the model can think faster with fewer tokens.
ChartVerse is a new way to make lots of tricky, realistic charts and perfectly checked questions so AI can learn to read charts better.
This paper teaches AI to look around a 3D place step by step, instead of staring at a fixed set of pictures, so it can answer tricky spatial questions better.
This paper teaches a model to turn a question about a table into both a short answer and a clear, correct chart.
CPPO is a new way to fine‑tune vision‑language models so they see pictures more accurately before they start to reason.
Robots like cars and drones see the world with many different sensors (cameras, LiDAR, radar, and even event cameras), and this paper shows a clear roadmap for teaching them to understand space by learning from all of these together.
QuantiPhy is a new test that checks if AI models can measure real-world physics from videos using numbers, not guesses.
This paper teaches robots to move their camera to a better spot before answering a question about what they see.
ReVSeg teaches an AI to segment objects in videos by thinking step-by-step instead of guessing everything at once.