This paper says we should measure an AI agent’s uncertainty across its whole conversation, not just on one final answer.
Kimi K2.5 is a new open-source AI that can read both text and visuals (images and videos) and act like a team of helpers to finish big tasks faster.
This paper fixes a common problem in reasoning AIs called Lazy Reasoning, where the model rambles instead of making a good plan.
Long tasks trip up most AIs because they lose track of goals and make small mistakes that snowball over many steps.
CAR-bench is a new 'driving test' for AI assistants that checks if they can stay careful, honest, and consistent during real back-and-forth conversations in a car.
Small AI models often stumble when a tool call fails and then get stuck repeating bad calls instead of fixing the mistake.
The paper builds a new way to create realistic, long conversations between people and AI that use tools like databases.
ArenaRL teaches AI agents by comparing their answers against each other, like a sports tournament, instead of giving each answer a single noisy score.
The paper teaches an AI to act like a careful traveler: it looks at a photo, forms guesses about where it might be, and uses real map tools to check each guess.
Youtu-LLM is a small (1.96B) language model that was trained from scratch to think, plan, and act like an agent instead of just copying bigger models.
SenseNova-MARS is a vision-language model that can think step-by-step and use three tools—text search, image search, and image cropping—during its reasoning.
GenEnv is a training system where a student AI and a teacher simulator grow together by exchanging tasks and feedback.