Gaia2 is a new test that measures how well AI agents handle real-life messiness like changing events, deadlines, and team coordination.
This paper builds a safe science “playground” called DeR that fairly tests how AI finds facts (retrieval) and how it thinks with those facts (reasoning) without mixing them up.
This paper builds a big, fair test called Hearing to Translate to check how well different speech translation systems work in the real world.