Everyone uses tests (benchmarks) to judge how smart AI models are, but not all tests are good tests.
This paper studies how sure (confident) large language models are during multi-turn chats where clues arrive step by step.