The paper asks a simple question: do the model’s invisible “imagination tokens” actually help it reason about images?
Searching through videos, images, and long documents is powerful but gets very expensive when every tiny piece is stored separately.
LatentLens is a simple, training-free way to translate what a model "sees" in image patches into clear words and phrases.
IVRA is a simple, training-free add-on that helps robot brains keep the 2D shape of pictures while following language instructions.