CASA: Cross-Attention via Self-Attention for Efficient Vision-Language Fusion
IntermediateMoritz Böhle, Amélie Royer et al.Dec 22arXiv
CASA is a new way to mix images and text inside a language model that keeps speed and memory low while keeping accuracy high.
#CASA#cross-attention#self-attention