Papers2

#Clipping

PRL: Process Reward Learning Improves LLMs' Reasoning Ability and Broadens the Reasoning Boundary

Jiarui Yao, Ruida Wang et al.Jan 15arXiv

Large language models usually get only a final thumbs-up or thumbs-down at the end of their answer, which is too late to fix mistakes made in the middle.

#Process Reward Learning#PRL#Reasoning LLMs

Not triaged yet

AT$^2$PO: Agentic Turn-based Policy Optimization via Tree Search

Intermediate

Zefang Zong, Dingwei Chen et al.Jan 8arXiv

AT2PO is a new way to train AI agents that work in several turns, like asking the web a question, reading the result, and trying again.

#Agentic Reinforcement Learning#Turn-level Optimization#Tree Search

Not triaged yet