Multimodal Large Language Models (MLLMs) often hallucinate on videos by trusting words and common sense more than what the frames really show.
Visual grounding is when an AI finds the exact thing in a picture that a sentence is talking about, and this paper shows today’s big vision-language AIs are not as good at it as we thought.
Robots that follow pictures and words (VLA models) can do many tasks, but they often bump into things because safety isn’t guaranteed.