How I Study AI - Learn AI Papers & Lectures the Easy Way

Taming Hallucinations: Boosting MLLMs' Video Understanding via Counterfactual Video Generation

Intermediate

Zhe Huang, Hao Wen et al.Dec 30arXiv

Multimodal Large Language Models (MLLMs) often hallucinate on videos by trusting words and common sense more than what the frames really show.

#multimodal large language model#video understanding#visual hallucination

GroundingME: Exposing the Visual Grounding Gap in MLLMs through Multi-Dimensional Evaluation

Intermediate

Rang Li, Lei Li et al.Dec 19arXiv

Visual grounding is when an AI finds the exact thing in a picture that a sentence is talking about, and this paper shows today’s big vision-language AIs are not as good at it as we thought.

#visual grounding#multimodal large language models#benchmark

VLSA: Vision-Language-Action Models with Plug-and-Play Safety Constraint Layer

Intermediate

Songqiao Hu, Zeyi Liu et al.Dec 9arXiv

Robots that follow pictures and words (VLA models) can do many tasks, but they often bump into things because safety isn’t guaranteed.

#Vision-Language-Action#Safety Constraint Layer#Control Barrier Function

Papers3

Taming Hallucinations: Boosting MLLMs' Video Understanding via Counterfactual Video Generation

GroundingME: Exposing the Visual Grounding Gap in MLLMs through Multi-Dimensional Evaluation

VLSA: Vision-Language-Action Models with Plug-and-Play Safety Constraint Layer