X-Coder shows that models can learn expert-level competitive programming using data that is 100% synthetic—no real contest problems needed.
ArenaRL teaches AI agents by comparing their answers against each other, like a sports tournament, instead of giving each answer a single noisy score.
Real instructions often have logic like and first-then and if-else and this paper teaches models to notice and obey that logic.
The paper fixes a big problem in training web-searching AI: rewarding only the final answer makes agents cut corners and sometimes hallucinate.
This paper says long chain-of-thought (Long CoT) works best when it follows a 'molecular' pattern with three kinds of thinking bonds: Deep-Reasoning, Self-Reflection, and Self-Exploration.
EnvScaler is an automatic factory that builds many safe, rule-following practice worlds where AI agents can talk to users and call tools, just like real apps.
The paper teaches an AI to act like a careful traveler: it looks at a photo, forms guesses about where it might be, and uses real map tools to check each guess.
This paper teaches a camera to fix nighttime colors by combining a smart rule-based color trick (SGP-LRD) with a learning-by-trying helper (reinforcement learning).
The paper asks a simple question: do video AIs really need to “think out loud” every time, or can they answer quickly most of the time and think deeply only when needed?
RelayLLM lets a small model do the talking and only asks a big model for help on a few, truly hard tokens.
This survey explains how AI judges are changing from single smart readers (LLM-as-a-Judge) into full-on agents that can plan, use tools, remember, and work in teams (Agent-as-a-Judge).
Long-term AI helpers remember past chats, but using all memories can trap them in old ideas (Memory Anchoring).