This paper shows that code-writing AI agents can take an existing math problem and automatically turn it into a new, harder one while keeping it solvable.
VESPO is a new, stable way to train language models with reinforcement learning even when training data comes from older or mismatched policies.
The paper introduces LT-Tuning, a way for AI models to “think silently” using special hidden tokens instead of writing every step out loud.
Big language models can get stuck after fine-tuning because they become too sure of themselves, so normal training stops helping.
The paper shows how to make AI think faster and smarter by planning in a hidden space instead of writing long step-by-step sentences.
Nemotron-Math is a giant math dataset with 7.5 million step-by-step solutions created in three thinking styles and with or without Python help.
Reinforcement learning (RL) can make big language models smarter, but off-policy training often pushes updates too far from the “safe zone,” causing unstable learning.
ARBITRAGE makes AI solve step-by-step problems faster by only using the big, slow model when it is predicted to truly help.