RIVER Bench is a new test that checks how well AI can watch a video stream and talk with you in real time.
This paper teaches AI to learn how-to steps from demonstrations in the moment, the way people do.
VidVec shows that video-capable multimodal language models already hide strong matching signals between videos and sentences inside their middle layers.
Multimodal AI models can mix up what they see and what they hear, making things up across senses; this is called cross-modal hallucination.
WorldVQA is a new test that checks if multimodal AI models can correctly name what they see in pictures without doing extra reasoning.
AdaReasoner teaches AI to pick the right visual tools, use them in the right order, and stop using them when they aren’t helping.
AVMeme Exam is a new test made by humans that checks if AI can understand famous internet audio and video clips the way people do.
The paper introduces SIN-Bench, a new way to test AI that read long scientific papers by forcing them to show exactly where their answers come from.
BabyVision is a new test that checks if AI can handle the same basic picture puzzles that young children can do, without leaning on language tricks.
This paper teaches AI to solve diagram-based math problems by copying how people think: first see (perception), then make sense of what you saw (internalization), and finally reason (solve the problem).
This paper teaches AI to notice not just what is in a picture, but how the picture looks and feels to people.
Robust-R1 teaches vision-language models to notice how a picture is damaged, think through what that damage hides, and then answer as if the picture were clear.