The paper shows that when a model compares two of its own answers head-to-head, it picks the right one more often than when it judges each answer alone.
The paper fixes a big flaw in test-time reinforcement learning (TTRL): when many wrong answers agree, the model rewards the mistake and gets stuck.
Multi-agent systems are like teams of smart helpers, but one bad message can mislead the whole team.
SkillOrchestra is a new way to make teams of AI models and tools work together by thinking in terms of skills, not just picking one big model for everything.
Training big language models with reinforcement learning can wobble because the per-token importance-sampling (IS) ratios swing wildly.
The paper studies a simple way to train giant language models with reinforcement learning by replacing a hard-to-compute term (the log-partition function) with something easy: the mean reward.
When you tune the learning rate carefully, plain old LoRA fine-tuning works about as well as fancy new versions.
This paper teaches AI teams to get better by scoring every move they make, not just the final answer.
Diffusion language models (dLLMs) can write text in any order, but common decoding methods still prefer left-to-right, which wastes their superpower.
This paper shows a new way to help AI think through long problems faster by turning earlier text steps into small pictures the AI can reread.
JudgeRLVR teaches a model to be a strict judge of answers before it learns to generate them, which trims bad ideas early.