Papers3

#GroundingDINO

Taming Hallucinations: Boosting MLLMs' Video Understanding via Counterfactual Video Generation

Multimodal Large Language Models (MLLMs) often hallucinate on videos by trusting words and common sense more than what the frames really show.

#multimodal large language model#video understanding#visual hallucination

Not triaged yet

GroundingME: Exposing the Visual Grounding Gap in MLLMs through Multi-Dimensional Evaluation

Intermediate

Rang Li, Lei Li et al.Dec 19arXiv

Visual grounding is when an AI finds the exact thing in a picture that a sentence is talking about, and this paper shows today’s big vision-language AIs are not as good at it as we thought.

#visual grounding#multimodal large language models#benchmark

Not triaged yet

VLSA: Vision-Language-Action Models with Plug-and-Play Safety Constraint Layer

Intermediate

Songqiao Hu, Zeyi Liu et al.Dec 9arXiv

Robots that follow pictures and words (VLA models) can do many tasks, but they often bump into things because safety isn’t guaranteed.

#Vision-Language-Action#Safety Constraint Layer#Control Barrier Function

Not triaged yet