WorldVQA is a new test that checks if multimodal AI models can correctly name what they see in pictures without doing extra reasoning.
Multimodal Large Language Models (MLLMs) often hallucinate on videos by trusting words and common sense more than what the frames really show.