The paper shows that when training reasoning AIs with reinforcement learning, treating every wrong answer the same makes the AI overconfident in some bad paths and less diverse overall.
Large language models are usually trained to get good at one kind of reasoning, but real life needs them to be good at many things at once.
Chain-of-Thought (CoT) makes AI think step by step, but it is slow because it writes many tokens one by one.