Likelihood-Based Reward Designs for General LLM Reasoning
BeginnerAriel Kwiatkowski, Natasha Butt et al.Feb 3arXiv
Binary right/wrong rewards for training reasoning in large language models are hard to design and often too sparse to learn from.
#log-likelihood reward#chain-of-thought (CoT)#reinforcement learning for LLMs