The paper discovers that popular RLVR methods for training language and vision-language models secretly prefer certain answer lengths, which can hurt learning.
Large language models learn better when we spend more practice time on the right questions at the right moments.
This paper teaches a model to make its own helpful hints (sub-questions) and then use those hints to learn better with reinforcement learning that checks answers automatically.
The paper asks a simple question: if a language model becomes better at step-by-step reasoning (using RLVR), do its text embeddings also get better? The short answer is no.
When training smart language models with RL that use right-or-wrong rewards, learning can stall on 'saturated' problems that the model almost always solves.
This paper says that to make math-solving AIs smarter, we should train them more on the hardest questions they can almost solve.
This paper teaches a model to be its own teacher so it can climb out of a learning plateau on very hard math problems.
This paper teaches a language-model agent to look up facts in millions of scientific paper summaries and answer clear, single-answer questions.
The paper shows that when an LLM is trained with spurious (misleading) rewards in RLVR, it can score higher by memorizing answers instead of reasoning.
The paper introduces Multiplex Thinking, a new way for AI to think by sampling several likely next words at once and blending them into a single super-token.
JudgeRLVR teaches a model to be a strict judge of answers before it learns to generate them, which trims bad ideas early.
Real instructions often have logic like and first-then and if-else and this paper teaches models to notice and obey that logic.