The paper treats the last layer of a Large Language Model (the softmax over tokens) as an Energy-Based Model, which lets us measure a new signal called spilled energy.
Accuracy alone can make AI agents look good on paper while still failing in real life; this paper shows how to measure reliability properly.
This paper says we should measure an AI agent’s uncertainty across its whole conversation, not just on one final answer.
The paper shows that even if a model is great at predicting when an AI agent will fail, jumping in to “fix” the agent mid-task can still make things worse.
The paper studies how to teach a smaller language model using a bigger one by only focusing on the most useful bits instead of everything.
Deep search agents can plan and browse the web in many steps, but they often fail because they don’t notice when their own thinking drifts off-track.
WorldVQA is a new test that checks if multimodal AI models can correctly name what they see in pictures without doing extra reasoning.
This paper asks a new question for vision-language models: not just 'What do you see?' but 'How far along is the task right now?'
This paper teaches AI models not just how to solve problems but also how to tell when their own answers might be wrong.
Coding agents used to fix software rely on feedback; unit tests give only pass/fail signals that are often noisy or missing.
Large language models often sound confident even when they are wrong, and existing ways to catch mistakes are slow or not very accurate.
LLM judges are cheap but biased; without calibration they can completely flip which model looks best.