The paper studies a simple way to train giant language models with reinforcement learning by replacing a hard-to-compute term (the log-partition function) with something easy: the mean reward.
This paper teaches AI teams to get better by scoring every move they make, not just the final answer.
Diffusion language models (dLLMs) can write text in any order, but common decoding methods still prefer left-to-right, which wastes their superpower.
This paper shows a new way to help AI think through long problems faster by turning earlier text steps into small pictures the AI can reread.
JudgeRLVR teaches a model to be a strict judge of answers before it learns to generate them, which trims bad ideas early.