Reinforcement Learning via Self-Distillation
IntermediateJonas Hübotter, Frederike Lübeck et al.Jan 28arXiv
The paper teaches large language models to learn from detailed feedback (like error messages) instead of only a simple pass/fail score.
#Self-Distillation#Reinforcement Learning with Rich Feedback#SDPO