MobilityBench is a big, carefully built test that checks how well AI helpers can plan real-world routes using natural language and map tools.
Modern image editors can now follow visual prompts like arrows and scribbles, which opens a new way for attackers to hide harmful instructions inside images.
AgenticPay is a safe playground where AI agents practice buying and selling by talking, not just by typing numbers.
PaperBanana is a team of AI helpers that turns a paper’s method text and caption into a clean, accurate, publication-ready figure.
DeepSearchQA is a new test with 900 real-world style questions that checks if AI agents can find complete lists of answers, not just one fact.
This paper builds MemoryRewardBench, a big test that checks if reward models (AI judges) can fairly grade how other AIs manage long-term memory, not just whether their final answers are right.
The paper tackles how AI agents can truly research the open web when the answers are hidden inside long, messy videos, not just text.
T2AV-Compass is a new, unified test to fairly grade AI systems that turn text into matching video and audio.
The FACTS Leaderboard is a four-part test that checks how truthful AI models are across images, memory, web search, and document grounding.