JustRL: Scaling a 1.5B LLM with a Simple RL Recipe
IntermediateBingxiang He, Zekai Qu et al.Dec 18arXiv
JustRL shows that a tiny, steady recipe for reinforcement learning (RL) can make a 1.5B-parameter language model much better at math without fancy tricks.
#Reinforcement Learning#GRPO#Policy Entropy