Diffusion models make pictures from noise but often miss what people actually want in the prompt or what looks good to humans.
Coding agents used to fix software rely on feedback; unit tests give only pass/fail signals that are often noisy or missing.
The paper shows how a vision-language model (VLM) can train itself to be a fair judge of answers about images without using any human preference labels.