The paper discovers that popular RLVR methods for training language and vision-language models secretly prefer certain answer lengths, which can hurt learning.
This paper introduces Foundation-Sec-8B-Reasoning, a small (8 billion parameter) AI model that is trained to “think out loud” before answering cybersecurity questions.
When training smart language models with RL that use right-or-wrong rewards, learning can stall on 'saturated' problems that the model almost always solves.
Diffusion language models can write tokens in any order, but that freedom can accidentally hurt their ability to reason well.
STEP3-VL-10B is a small (10 billion parameters) open multimodal model that sees images and reads text, yet scores like much larger models.
JudgeRLVR teaches a model to be a strict judge of answers before it learns to generate them, which trims bad ideas early.
Real instructions often have logic like and first-then and if-else and this paper teaches models to notice and obey that logic.
The paper asks a simple question: do video AIs really need to “think out loud” every time, or can they answer quickly most of the time and think deeply only when needed?
Multimodal Large Language Models (MLLMs) often hallucinate on videos by trusting words and common sense more than what the frames really show.
AdaTooler-V teaches an image-and-video AI to first ask, “Do I really need a tool?” before using one, which saves time and boosts accuracy.
This paper teaches vision-language models to reason about pictures using puzzles instead of expensive human labels.