Accuracy alone can make AI agents look good on paper while still failing in real life; this paper shows how to measure reliability properly.
The paper teaches small language models to predict open-ended future events by turning daily news into thousands of safe, graded practice questions.