SAGE is a two-agent system that automatically writes tough, multi-step search questions and checks them by actually trying to solve them.
SVBench is the first benchmark that checks whether video generation models can show realistic social behavior, not just pretty pictures.