This paper builds a new test called AgentIF-OneDay that checks if AI helpers can follow everyday instructions the way people actually give them.
Text-to-image models draw pretty pictures, but often put things in the wrong places or miss how objects interact.
This paper introduces EDIR, a new and much more detailed test for Composed Image Retrieval (CIR), where you search for a target image using a starting image plus a short text change.
This survey explains how AI judges are changing from single smart readers (LLM-as-a-Judge) into full-on agents that can plan, use tools, remember, and work in teams (Agent-as-a-Judge).
T2AV-Compass is a new, unified test to fairly grade AI systems that turn text into matching video and audio.
This paper builds a new test called Video Reality Test to see if AI-made ASMR videos can fool both people and AI video watchers (VLMs).
The FACTS Leaderboard is a four-part test that checks how truthful AI models are across images, memory, web search, and document grounding.