This paper builds MemoryRewardBench, a big test that checks if reward models (AI judges) can fairly grade how other AIs manage long-term memory, not just whether their final answers are right.
RealMem is a new benchmark that tests how well AI assistants remember and manage long, ongoing projects across many conversations.
This paper introduces the Confucius Code Agent (CCA), a coding helper built to handle huge real-world codebases with long tasks and many tools.