A2Eval is a two-agent system that automatically builds and runs fair tests for robot-style vision-language models, cutting wasted work while keeping results trustworthy.
This paper asks a simple question with big impact: Can AI tell which test questions are hard for humans?