RubricBench is a new benchmark that checks whether AI judges can use clear, checklist-style rules (rubrics) the way humans do.
Diffusion models make great images and videos but are slow because they usually need many tiny steps.
Large language models often sound confident even when they are wrong, and existing ways to catch mistakes are slow or not very accurate.
The paper shows that many AI image generators are trained to prefer one popular idea of beauty, even when a user clearly asks for something messy, dark, blurry, or emotionally heavy.