The paper introduces Multiplex Thinking, a new way for AI to think by sampling several likely next words at once and blending them into a single super-token.
The paper fixes a common problem in training AI reasoners: models get stuck using the same favorite solution style and stop exploring new ways to solve problems.
Group-based reinforcement learning for reasoning (like GRPO) uses the group's average reward as a baseline, but that makes its 'advantage' estimates biased.
Dr. Zero is a pair of AI agents (a Proposer and a Solver) that teach each other to do web-search-based reasoning without any human-written training data.
X-Coder shows that models can learn expert-level competitive programming using data that is 100% synthetic—no real contest problems needed.
Real instructions often have logic like and first-then and if-else and this paper teaches models to notice and obey that logic.
The paper fixes a big problem in training web-searching AI: rewarding only the final answer makes agents cut corners and sometimes hallucinate.
Preference tuning teaches language models to act the way people like, but those habits can fall apart when the topic or style changes (domain shift).
When a model learns from many rewards at once, a popular method called GRPO can accidentally squash different reward mixes into the same learning signal, which confuses training.
RelayLLM lets a small model do the talking and only asks a big model for help on a few, truly hard tokens.
SmartSearch teaches search agents to fix their own bad search queries while they are thinking, not just their final answers.
AgentOCR turns an agent’s long text history into pictures so it can remember more using fewer tokens.