This paper argues that true world models are not just sprinkling facts into single tasks, but building a unified system that can see, think, remember, act, and generate across many situations.
MMFineReason is a huge, open dataset (1.8 million examples, 5.1 billion solution tokens) that teaches AIs to think step by step about pictures and text together.
The paper studies how to make and judge scientific images that are not just pretty but scientifically correct.
LaViT is a new way to teach smaller vision-language models to look at the right parts of an image before they speak.
Omni-R1 teaches AI to think with pictures and words at the same time by drawing helpful mini-images while reasoning.
VideoDR is a new benchmark that tests if AI can watch a video, pull out key visual clues, search the open web, and chain the clues together to find one verifiable answer.
ATLAS is a system that picks the best mix of AI models and helper tools for each question, instead of using just one model or a fixed tool plan.
Real people often ask vague questions with pictures, and today’s vision-language models (VLMs) struggle with them.
The paper teaches vision-language models (AIs that look and read) to pay attention to the right picture parts without needing extra tools during answering time.
LongVideoAgent is a team of three AIs that work together to answer questions about hour‑long TV episodes without missing small details.
This paper builds a tough new test called O3-BENCH to check if AI can truly think with images, not just spot objects.
This paper builds a new test, LongShOTBench, to check if AI can truly understand long videos by using sight, speech, and sounds together.