This paper finds a precise way to describe and fix the Modality Gap, which is when image and text features that mean the same thing still sit in different places in the AIβs memory space.
Youtu-VL is a new kind of vision-language model that learns to predict both words and tiny image pieces, not just words.
This paper introduces YaPO, a way to gently nudge a language modelβs hidden thoughts so it behaves better without retraining it.