Large language models are usually trained to get good at one kind of reasoning, but real life needs them to be good at many things at once.
Group-based reinforcement learning for reasoning (like GRPO) uses the group's average reward as a baseline, but that makes its 'advantage' estimates biased.
JudgeRLVR teaches a model to be a strict judge of answers before it learns to generate them, which trims bad ideas early.
RubricHub is a huge (about 110,000) collection of detailed grading guides (rubrics) for many kinds of questions like health, science, writing, and chat.
The paper asks which small, add-on training tricks (PEFT) work best when we teach language models with yes/no rewards we can check (RLVR).
The paper asks how to best use expert step-by-step solutions (expert trajectories) when teaching big AI models to reason after pretraining.
This paper asks whether reinforcement learning (RL) can improve making 3D models from text and shows that the answer is yes if we design the training and rewards carefully.
Reinforcement learning (RL) can make big language models smarter, but off-policy training often pushes updates too far from the “safe zone,” causing unstable learning.