Gaia2 is a new test that measures how well AI agents handle real-life messiness like changing events, deadlines, and team coordination.
MiniCPM-SALA is a 9B-parameter language model that mixes two kinds of attention—sparse and linear—to read very long texts quickly and accurately.
The paper teaches language models to explore more ideas while thinking, so they can solve harder problems.
This paper tackles a simple but serious question: can AI agents use paid tools to finish multi-step tasks without blowing the budget?
LoopFormer is a Transformer that thinks in loops and can flex its thinking time up or down based on the compute you give it.
The paper shows that, when teaching a reasoning AI with step-by-step examples, repeating a small set many times can beat using a huge set only once.
The paper introduces GENIUS, a new test that checks whether image-generating AIs can think on the fly, not just recall facts.
PhyCritic is a judge model that checks other AI models’ answers about the physical world, like cooking steps, robot actions, or driving choices.
GameDevBench is a new test that checks if AI agents can actually make parts of video games, not just write code in one file.
DataChef teaches a large language model to be a smart data chef: it plans and codes full data pipelines that turn messy datasets into great training meals for other models.
RISE lets a robot learn safely and cheaply by practicing in its imagination instead of always in the real world.
ROCKET is a fast, training-free way to shrink big AI models while keeping most of their smarts.