PRL: Process Reward Learning Improves LLMs' Reasoning Ability and Broadens the Reasoning Boundary
IntermediateJiarui Yao, Ruida Wang et al.Jan 15arXiv
Large language models usually get only a final thumbs-up or thumbs-down at the end of their answer, which is too late to fix mistakes made in the middle.
#Process Reward Learning#PRL#Reasoning LLMs