Gaia2 is a new test that measures how well AI agents handle real-life messiness like changing events, deadlines, and team coordination.
This survey explains how AI judges are changing from single smart readers (LLM-as-a-Judge) into full-on agents that can plan, use tools, remember, and work in teams (Agent-as-a-Judge).