Visual grounding is when an AI finds the exact thing in a picture that a sentence is talking about, and this paper shows today’s big vision-language AIs are not as good at it as we thought.
This paper teaches a video-understanding AI to think in 3D plus time (4D) so it can answer questions about specific objects moving in videos.
AuditDM is a friendly 'auditor' model that hunts for where vision-language models get things wrong and then creates the right practice to fix them.
This paper builds Step-GUI, a pair of small-but-strong GUI agent models (4B/8B) that can use phones and computers by looking at screenshots and following instructions.
TimeLens studies how to teach AI not just what happens in a video, but exactly when it happens, which is called video temporal grounding (VTG).
This paper introduces MMSI-Video-Bench, a big, carefully hand-made test to check how well AI understands space and motion in videos.
The paper shows that video AIs do not need long, human-like chains of thought to reason well.
OmniSafeBench-MM is a one-stop, open-source test bench that fairly compares how multimodal AI models get tricked (jailbroken) and how well different defenses stop that.