The paper shows how to make AI think faster and smarter by planning in a hidden space instead of writing long step-by-step sentences.
Nemotron-Math is a giant math dataset with 7.5 million step-by-step solutions created in three thinking styles and with or without Python help.
Reinforcement learning (RL) can make big language models smarter, but off-policy training often pushes updates too far from the โsafe zone,โ causing unstable learning.
ARBITRAGE makes AI solve step-by-step problems faster by only using the big, slow model when it is predicted to truly help.