This paper teaches AI agents to make smart choices about when to explore for more information and when to act right away.
LLM judges are cheap but biased; without calibration they can completely flip which model looks best.