The paper asks a simple question: what must a vision modelβs internal pictures (embeddings) look like if it can recognize new mixes of things it already knows?
Training big language models with reinforcement learning can wobble because the per-token importance-sampling (IS) ratios swing wildly.
The paper studies a simple way to train giant language models with reinforcement learning by replacing a hard-to-compute term (the log-partition function) with something easy: the mean reward.
SLIME is a new way to train chatbots so they follow human preferences without forgetting how to write well.