This paper teaches multimodal AI models to not just read pictures but to also imagine and think with pictures inside their heads.
This paper asks a new question for vision-language models: not just 'What do you see?' but 'How far along is the task right now?'