The paper teaches multimodal large language models (MLLMs) to stop guessing from just text or just images and instead check both together before answering.
Chain-of-Thought (CoT) makes AI think step by step, but it is slow because it writes many tokens one by one.
Render-of-Thought (RoT) turns the modelβs step-by-step thinking from long text into slim images so the model can think faster with fewer tokens.