This paper says we should measure an AI agent’s uncertainty across its whole conversation, not just on one final answer.
The paper shows that even if a model is great at predicting when an AI agent will fail, jumping in to “fix” the agent mid-task can still make things worse.
The paper studies how to teach a smaller language model using a bigger one by only focusing on the most useful bits instead of everything.
Deep search agents can plan and browse the web in many steps, but they often fail because they don’t notice when their own thinking drifts off-track.
WorldVQA is a new test that checks if multimodal AI models can correctly name what they see in pictures without doing extra reasoning.
This paper asks a new question for vision-language models: not just 'What do you see?' but 'How far along is the task right now?'
The paper introduces Entropy Sentinel, a simple way to watch how accurate an AI is by reading its “uncertainty heartbeat” during generation.
This paper studies how AI agents that use tools talk about how sure they are and finds a split: some tools make them too sure, others help them be honest.
This paper teaches AI models not just how to solve problems but also how to tell when their own answers might be wrong.
Coding agents used to fix software rely on feedback; unit tests give only pass/fail signals that are often noisy or missing.
Large language models often sound confident even when they are wrong, and existing ways to catch mistakes are slow or not very accurate.
LLM judges are cheap but biased; without calibration they can completely flip which model looks best.