Dr. Zero is a pair of AI agents (a Proposer and a Solver) that teach each other to do web-search-based reasoning without any human-written training data.
X-Coder shows that models can learn expert-level competitive programming using data that is 100% synthetic—no real contest problems needed.
Real instructions often have logic like and first-then and if-else and this paper teaches models to notice and obey that logic.
The paper fixes a big problem in training web-searching AI: rewarding only the final answer makes agents cut corners and sometimes hallucinate.
Preference tuning teaches language models to act the way people like, but those habits can fall apart when the topic or style changes (domain shift).
The paper teaches an AI to act like a careful traveler: it looks at a photo, forms guesses about where it might be, and uses real map tools to check each guess.
When a model learns from many rewards at once, a popular method called GRPO can accidentally squash different reward mixes into the same learning signal, which confuses training.
RelayLLM lets a small model do the talking and only asks a big model for help on a few, truly hard tokens.
Re-Align is a new way for AI to make and edit pictures by thinking in clear steps before drawing.
Long-term AI helpers remember past chats, but using all memories can trap them in old ideas (Memory Anchoring).
SmartSearch teaches search agents to fix their own bad search queries while they are thinking, not just their final answers.
AgentOCR turns an agent’s long text history into pictures so it can remember more using fewer tokens.