This paper says we should measure an AI agent’s uncertainty across its whole conversation, not just on one final answer.
This paper fixes a common problem in reasoning AIs called Lazy Reasoning, where the model rambles instead of making a good plan.
Long tasks trip up most AIs because they lose track of goals and make small mistakes that snowball over many steps.
CAR-bench is a new 'driving test' for AI assistants that checks if they can stay careful, honest, and consistent during real back-and-forth conversations in a car.
The paper builds a new way to create realistic, long conversations between people and AI that use tools like databases.
Youtu-LLM is a small (1.96B) language model that was trained from scratch to think, plan, and act like an agent instead of just copying bigger models.
SenseNova-MARS is a vision-language model that can think step-by-step and use three tools—text search, image search, and image cropping—during its reasoning.
GenEnv is a training system where a student AI and a teacher simulator grow together by exchanging tasks and feedback.
This paper organizes how AI agents learn and improve into one simple map with four roads: A1, A2, T1, and T2.
Clinical conversations are special because they mix caring feelings with precise medical facts, and old AI systems struggled to do both at once.