This paper teaches AI to name things in pictures very specifically (like “golden retriever” instead of just “dog”) without making more mistakes.
This paper introduces HACRL, a way for different kinds of AI agents to learn together during training but still work alone during use.
DeepVision-103K is a new 103,000-example picture-and-text math dataset designed to help AI think better using rewards that can be checked automatically.
This paper teaches a computer to find buttons, text, and icons on screens so it can click and type in the right places, a skill called GUI grounding.
TRIT is a new training method that teaches AI to translate and think at the same time so it can solve hard problems in many languages without extra helper models.
The paper discovers that popular RLVR methods for training language and vision-language models secretly prefer certain answer lengths, which can hurt learning.
When training smart language models with RL that use right-or-wrong rewards, learning can stall on 'saturated' problems that the model almost always solves.
JudgeRLVR teaches a model to be a strict judge of answers before it learns to generate them, which trims bad ideas early.
Real instructions often have logic like and first-then and if-else and this paper teaches models to notice and obey that logic.
The paper asks a simple question: do video AIs really need to “think out loud” every time, or can they answer quickly most of the time and think deeply only when needed?
Multimodal Large Language Models (MLLMs) often hallucinate on videos by trusting words and common sense more than what the frames really show.
AdaTooler-V teaches an image-and-video AI to first ask, “Do I really need a tool?” before using one, which saves time and boosts accuracy.