This paper teaches a computer to find the same object when seen from two very different cameras, like a body camera (first-person) and a room camera (third-person).
AI models that make CAD designs used to learn mostly from simple “draw-then-extrude” examples, so they struggled with real, complex parts.
This paper teaches AI to write movie-like scripts for videos by adding exact timestamps and rich details about what you see and hear.
The paper teaches multimodal large language models (MLLMs) to stop guessing from just text or just images and instead check both together before answering.
This paper teaches video-making AIs to follow real-world physics, so rolling balls roll right and collisions look believable.
Medical SAM3 is a text-prompted medical image segmentation model that was fully fine-tuned on 33 diverse datasets to work across many imaging types like ultrasound, X-ray, endoscopy, and pathology.
Visual grounding is when an AI finds the exact thing in a picture that a sentence is talking about, and this paper shows today’s big vision-language AIs are not as good at it as we thought.
ReVSeg teaches an AI to segment objects in videos by thinking step-by-step instead of guessing everything at once.