Papers4

All Beginner Intermediate Advanced

All Sources arXiv

#Reasoning LLMs

Efficient RLVR Training via Weighted Mutual Information Data Selection

Intermediate

Xinyu Zhou, Boyu Zhu et al.Mar 2arXiv

Reinforcement learning (RL) trains language models by letting them try answers and learn from rewards, but training is slow if we pick the wrong practice questions.

#Reinforcement Learning#RLVR#Data Selection

CoBA-RL: Capability-Oriented Budget Allocation for Reinforcement Learning in LLMs

Intermediate

Zhiyuan Yao, Yi-Kai Zhang et al.Feb 3arXiv

Large language models learn better when we spend more practice time on the right questions at the right moments.

#Reinforcement Learning#RLVR#GRPO

Good SFT Optimizes for SFT, Better SFT Prepares for Reinforcement Learning

Intermediate

Dylan Zhang, Yufeng Xu et al.Feb 1arXiv

The paper shows that a model that looks great after supervised fine-tuning (SFT) can actually do worse after the same reinforcement learning (RL) than a model that looked weaker at SFT time.

#Supervised Fine-Tuning#Reinforcement Learning#Distribution Mismatch

PRL: Process Reward Learning Improves LLMs' Reasoning Ability and Broadens the Reasoning Boundary

Intermediate

Jiarui Yao, Ruida Wang et al.Jan 15arXiv

Large language models usually get only a final thumbs-up or thumbs-down at the end of their answer, which is too late to fix mistakes made in the middle.

#Process Reward Learning#PRL#Reasoning LLMs