The paper asks a simple question: do the model’s invisible “imagination tokens” actually help it reason about images?
This paper builds a new test, called MURGAT, to check whether AI models can back up each small fact they say with the right part of a video, audio, or figure.
This paper teaches talking avatars not just to speak, but to look around their scene and handle nearby objects exactly as a text instruction says.