The paper shows that when training reasoning AIs with reinforcement learning, treating every wrong answer the same makes the AI overconfident in some bad paths and less diverse overall.
The paper shows that the popular PPO method for training language models is unfair to rare words and too gentle with very common words, which makes learning slow and unstable.
Chain-of-Thought (CoT) makes AI think step by step, but it is slow because it writes many tokens one by one.