This paper builds MemoryRewardBench, a big test that checks if reward models (AI judges) can fairly grade how other AIs manage long-term memory, not just whether their final answers are right.
Large language models often sound confident even when they are wrong, and existing ways to catch mistakes are slow or not very accurate.
The paper shows that many AI image generators are trained to prefer one popular idea of beauty, even when a user clearly asks for something messy, dark, blurry, or emotionally heavy.