Training Reasoning Models on Saturated Problems via Failure-Prefix Conditioning
IntermediateMinwu Kim, Safal Shrestha et al.Jan 28arXiv
When training smart language models with RL that use right-or-wrong rewards, learning can stall on 'saturated' problems that the model almost always solves.
#failure-prefix conditioning#RLVR#GRPO