Diffusion language models (dLLMs) can write text in any order, but common decoding methods still prefer left-to-right, which wastes their superpower.
This paper builds MemoryRewardBench, a big test that checks if reward models (AI judges) can fairly grade how other AIs manage long-term memory, not just whether their final answers are right.
The paper shows how a vision-language model (VLM) can train itself to be a fair judge of answers about images without using any human preference labels.