Papers3

#Brier score

Towards a Science of AI Agent Reliability

Stephan Rabanser, Sayash Kapoor et al.Feb 18arXiv

Accuracy alone can make AI agents look good on paper while still failing in real life; this paper shows how to measure reliability properly.

#AI agent reliability#consistency#robustness

Not triaged yet

The Confidence Dichotomy: Analyzing and Mitigating Miscalibration in Tool-Use Agents

Beginner

Weihao Xuan, Qingcheng Zeng et al.Jan 12arXiv

This paper studies how AI agents that use tools talk about how sure they are and finds a split: some tools make them too sure, others help them be honest.

#LLM agents#calibration#overconfidence

Not triaged yet

Scaling Open-Ended Reasoning to Predict the Future

Intermediate

Nikhil Chandak, Shashwat Goel et al.Dec 31arXiv

The paper teaches small language models to predict open-ended future events by turning daily news into thousands of safe, graded practice questions.

#open-ended forecasting#calibrated prediction#Brier score

Not triaged yet