This paper teaches AI agents to make smart choices about when to explore for more information and when to act right away.
LOCA-bench is a test that challenges AI agents to work correctly as their to-do list and background information grow very, very long.
This paper says we should measure an AI agent’s uncertainty across its whole conversation, not just on one final answer.
This paper fixes a common problem in reasoning AIs called Lazy Reasoning, where the model rambles instead of making a good plan.
Long tasks trip up most AIs because they lose track of goals and make small mistakes that snowball over many steps.
CAR-bench is a new 'driving test' for AI assistants that checks if they can stay careful, honest, and consistent during real back-and-forth conversations in a car.
The paper builds a new way to create realistic, long conversations between people and AI that use tools like databases.
Youtu-LLM is a small (1.96B) language model that was trained from scratch to think, plan, and act like an agent instead of just copying bigger models.
SenseNova-MARS is a vision-language model that can think step-by-step and use three tools—text search, image search, and image cropping—during its reasoning.
GenEnv is a training system where a student AI and a teacher simulator grow together by exchanging tasks and feedback.
This paper organizes how AI agents learn and improve into one simple map with four roads: A1, A2, T1, and T2.
Clinical conversations are special because they mix caring feelings with precise medical facts, and old AI systems struggled to do both at once.