The paper shows that even if a model is great at predicting when an AI agent will fail, jumping in to “fix” the agent mid-task can still make things worse.
Long AI tasks can go wrong early and keep getting worse, like a snowball of mistakes called the Spiral of Hallucination.
The paper studies why large language models (LLMs) sound too sure of themselves when using retrieval-augmented generation (RAG) and how to fix it.
This paper teaches AI models not just how to solve problems but also how to tell when their own answers might be wrong.
The authors built a simple six-agent system to see if today’s AI models could plan, run, and write a research paper mostly on their own.
This paper asks a simple question with big impact: Can AI tell which test questions are hard for humans?