The paper discovers that popular RLVR methods for training language and vision-language models secretly prefer certain answer lengths, which can hurt learning.
This paper shows that comics (multi-panel pictures with words) can help AI think through problems step by step, just like a student explains their work.
RANKVIDEO is a video-native reasoning reranker that helps search engines find the right videos for a text query by directly looking at the video’s visuals and audio, not just text captions.
Mind-Brush turns image generation from a one-step 'read the prompt and draw' into a multi-step 'think, research, and create' process.
This paper argues that true world models are not just sprinkling facts into single tasks, but building a unified system that can see, think, remember, act, and generate across many situations.
MMFineReason is a huge, open dataset (1.8 million examples, 5.1 billion solution tokens) that teaches AIs to think step by step about pictures and text together.
LaViT is a new way to teach smaller vision-language models to look at the right parts of an image before they speak.
Omni-R1 teaches AI to think with pictures and words at the same time by drawing helpful mini-images while reasoning.
VideoDR is a new benchmark that tests if AI can watch a video, pull out key visual clues, search the open web, and chain the clues together to find one verifiable answer.
The paper teaches vision-language models (AIs that look and read) to pay attention to the right picture parts without needing extra tools during answering time.
LongVideoAgent is a team of three AIs that work together to answer questions about hour‑long TV episodes without missing small details.
This paper builds a tough new test called O3-BENCH to check if AI can truly think with images, not just spot objects.