The paper teaches multimodal large language models (MLLMs) to stop guessing from just text or just images and instead check both together before answering.
Youtu-VL is a new kind of vision-language model that learns to predict both words and tiny image pieces, not just words.
VisGym is a playground of 17 very different visual tasks that test and train AI models that see and talk (Vision–Language Models) to act over many steps.
CPPO is a new way to fine‑tune vision‑language models so they see pictures more accurately before they start to reason.