This paper teaches long-horizon AI agents to remember everything exactly without stuffing their whole memory at once.
SkillOrchestra is a new way to make teams of AI models and tools work together by thinking in terms of skills, not just picking one big model for everything.
This paper teaches AI agents to make smart choices about when to explore for more information and when to act right away.
LOCA-bench is a test that challenges AI agents to work correctly as their to-do list and background information grow very, very long.
This paper says we should measure an AI agent’s uncertainty across its whole conversation, not just on one final answer.
Kimi K2.5 is a new open-source AI that can read both text and visuals (images and videos) and act like a team of helpers to finish big tasks faster.
This paper fixes a common problem in reasoning AIs called Lazy Reasoning, where the model rambles instead of making a good plan.
Long tasks trip up most AIs because they lose track of goals and make small mistakes that snowball over many steps.
CAR-bench is a new 'driving test' for AI assistants that checks if they can stay careful, honest, and consistent during real back-and-forth conversations in a car.
Small AI models often stumble when a tool call fails and then get stuck repeating bad calls instead of fixing the mistake.
The paper builds a new way to create realistic, long conversations between people and AI that use tools like databases.
ArenaRL teaches AI agents by comparing their answers against each other, like a sports tournament, instead of giving each answer a single noisy score.