Papers2

#LLM Post-Training

Revisiting Parameter Server in LLM Post-Training

Large language model (LLM) post-training has uneven work per GPU because some text sequences are much longer than others.

#On-Demand Communication#Fully Sharded Data Parallel#Parameter Server

Not triaged yet

Your Group-Relative Advantage Is Biased

Intermediate

Fengkai Yang, Zherui Chen et al.Jan 13arXiv

Group-based reinforcement learning for reasoning (like GRPO) uses the group's average reward as a baseline, but that makes its 'advantage' estimates biased.

#Reinforcement Learning from Verifier Rewards#GRPO#GSPO

Not triaged yet