The paper studies a simple way to train giant language models with reinforcement learning by replacing a hard-to-compute term (the log-partition function) with something easy: the mean reward.
SLIME is a new way to train chatbots so they follow human preferences without forgetting how to write well.