Large language models often sound confident even when they are wrong, and existing ways to catch mistakes are slow or not very accurate.
LLM judges are cheap but biased; without calibration they can completely flip which model looks best.
This paper teaches video-making AI models to say how sure they are about each tiny part of every frame they create.