Papers2

#video-language models

MAD: Modality-Adaptive Decoding for Mitigating Cross-Modal Hallucinations in Multimodal Large Language Models

Sangyun Chung, Se Yeon Kim et al.Jan 29arXiv

Multimodal AI models can mix up what they see and what they hear, making things up across senses; this is called cross-modal hallucination.

#multimodal large language models#cross-modal hallucination#contrastive decoding

Not triaged yet

Factorized Learning for Temporally Grounded Video-Language Models

Intermediate

Wenzheng Zeng, Difei Gao et al.Dec 30arXiv

This paper teaches video-language models to first find when the proof happens in a video and then answer with that proof, instead of mixing both steps together.

#temporal grounding#video-language models#evidence tokens

Not triaged yet