This paper builds MemoryRewardBench, a big test that checks if reward models (AI judges) can fairly grade how other AIs manage long-term memory, not just whether their final answers are right.
Terminal-Bench 2.0 is a tough test that checks how well AI agents can solve real, professional tasks by typing commands in a computer terminal.