Taming Hallucinations: Boosting MLLMs' Video Understanding via Counterfactual Video Generation
IntermediateZhe Huang, Hao Wen et al.Dec 30arXiv
Multimodal Large Language Models (MLLMs) often hallucinate on videos by trusting words and common sense more than what the frames really show.
#multimodal large language model#video understanding#visual hallucination