LaViT: Aligning Latent Visual Thoughts for Multi-modal Reasoning
IntermediateLinquan Wu, Tianxiang Jiang et al.Jan 15arXiv
LaViT is a new way to teach smaller vision-language models to look at the right parts of an image before they speak.
#multimodal reasoning#visual attention#knowledge distillation