Thinking with Images via Self-Calling Agent
IntermediateWenxi Yang, Yuzhong Zhao et al.Dec 9arXiv
This paper teaches a vision-language model to think about images by talking to copies of itself, using only words to plan and decide.
#Self-Calling Chain-of-Thought#sCoT#interleaved multimodal chain-of-thought