Latent Implicit Visual Reasoning
IntermediateKelvin Li, Chuyi Shang et al.Dec 24arXiv
Large Multimodal Models (LMMs) are great at reading text and looking at pictures, but they usually do most of their thinking in words, which limits deep visual reasoning.
#Latent Implicit Visual Reasoning#latent tokens#bottleneck attention masking