The paper asks a simple question: do the model’s invisible “imagination tokens” actually help it reason about images?
SwimBird is a multimodal AI that can switch how it thinks: only in text, only in vision (with hidden picture-like thoughts), or a mix of both.