The paper discovers that popular RLVR methods for training language and vision-language models secretly prefer certain answer lengths, which can hurt learning.
This paper shows that comics (multi-panel pictures with words) can help AI think through problems step by step, just like a student explains their work.
RANKVIDEO is a video-native reasoning reranker that helps search engines find the right videos for a text query by directly looking at the video’s visuals and audio, not just text captions.
Mind-Brush turns image generation from a one-step 'read the prompt and draw' into a multi-step 'think, research, and create' process.
This paper argues that true world models are not just sprinkling facts into single tasks, but building a unified system that can see, think, remember, act, and generate across many situations.
MMFineReason is a huge, open dataset (1.8 million examples, 5.1 billion solution tokens) that teaches AIs to think step by step about pictures and text together.
The paper studies how to make and judge scientific images that are not just pretty but scientifically correct.
LaViT is a new way to teach smaller vision-language models to look at the right parts of an image before they speak.
Omni-R1 teaches AI to think with pictures and words at the same time by drawing helpful mini-images while reasoning.
VideoDR is a new benchmark that tests if AI can watch a video, pull out key visual clues, search the open web, and chain the clues together to find one verifiable answer.
ATLAS is a system that picks the best mix of AI models and helper tools for each question, instead of using just one model or a fixed tool plan.
Real people often ask vague questions with pictures, and today’s vision-language models (VLMs) struggle with them.