Approximation of Log-Partition Function in Policy Mirror Descent Induces Implicit Regularization for LLM Post-Training
IntermediateZhenghao Xu, Qin Lu et al.Feb 5arXiv
The paper studies a simple way to train giant language models with reinforcement learning by replacing a hard-to-compute term (the log-partition function) with something easy: the mean reward.
#Policy Mirror Descent#KL regularization#chi-squared regularization