NanoKnow is a new benchmark that checks whether a language modelβs answers come from what it saw during training or from extra text we give it at question time.
SAGE is a new test for how well AI research agents find scientific papers when questions require multi-step reasoning.
AACR-Bench is a new test set that checks how well AI can do code reviews using the whole project, not just one file.
This paper teaches a language-model agent to look up facts in millions of scientific paper summaries and answer clear, single-answer questions.
SimpleMem is a new memory system that helps AI remember long conversations without wasting space or tokens.