DSDR: Dual-Scale Diversity Regularization for Exploration in LLM Reasoning
IntermediateZhongwei Wan, Yun Shen et al.Feb 23arXiv
LLMs trained with simple rewards often latch onto just a few ways of solving problems and stop exploring, which hurts their ability to find other correct answers.
#DSDR#dual-scale diversity#RLVR