Overconfident Errors Need Stronger Correction: Asymmetric Confidence Penalties for Reinforcement Learning
IntermediateYuanda Xu, Hejian Sang et al.Feb 24arXiv
The paper shows that when training reasoning AIs with reinforcement learning, treating every wrong answer the same makes the AI overconfident in some bad paths and less diverse overall.
#ACE#Reinforcement Learning with Verifiable Rewards#GRPO