SAGE is a new test for how well AI research agents find scientific papers when questions require multi-step reasoning.
AT2PO is a new way to train AI agents that work in several turns, like asking the web a question, reading the result, and trying again.
EpiQAL is a new benchmark that tests how well AI models answer population-level disease questions using real research papers.
The paper defines Scientific General Intelligence (SGI) as an AI that can do science like a human scientist across the full loop: study, imagine, test, and understand.