LaViT is a new way to teach smaller vision-language models to look at the right parts of an image before they speak.
Large Multimodal Models (LMMs) are great at reading text and looking at pictures, but they usually do most of their thinking in words, which limits deep visual reasoning.