Gaia2 is a new test that measures how well AI agents handle real-life messiness like changing events, deadlines, and team coordination.
This paper builds a safe science βplaygroundβ called DeR that fairly tests how AI finds facts (retrieval) and how it thinks with those facts (reasoning) without mixing them up.