The paper teaches multimodal large language models (MLLMs) to stop guessing from just text or just images and instead check both together before answering.
This paper teaches video-making AIs to follow real-world physics, so rolling balls roll right and collisions look believable.
Medical SAM3 is a text-prompted medical image segmentation model that was fully fine-tuned on 33 diverse datasets to work across many imaging types like ultrasound, X-ray, endoscopy, and pathology.
Visual grounding is when an AI finds the exact thing in a picture that a sentence is talking about, and this paper shows today’s big vision-language AIs are not as good at it as we thought.
ReVSeg teaches an AI to segment objects in videos by thinking step-by-step instead of guessing everything at once.