RubricBench is a new benchmark that checks whether AI judges can use clear, checklist-style rules (rubrics) the way humans do.
MobilityBench is a big, carefully built test that checks how well AI helpers can plan real-world routes using natural language and map tools.
Deep research agents write long reports, but old tests often judge only how smooth they sound and whether they add links, not whether the facts are true today or the logic really holds.
Modern image editors can now follow visual prompts like arrows and scribbles, which opens a new way for attackers to hide harmful instructions inside images.
AIRS-Bench is a new test suite that checks whether AI research agents can do real machine learning research from start to finish, not just answer questions.
AgenticPay is a safe playground where AI agents practice buying and selling by talking, not just by typing numbers.
This paper builds an AI team that can make real full‑stack websites (frontend, backend, and database) from plain English instructions.
PaperBanana is a team of AI helpers that turns a paper’s method text and caption into a clean, accurate, publication-ready figure.
This paper builds a smart team of AI helpers, called MEnvAgent, that automatically sets up the right computer environments for code projects in many languages.
CAR-bench is a new 'driving test' for AI assistants that checks if they can stay careful, honest, and consistent during real back-and-forth conversations in a car.
DeepSearchQA is a new test with 900 real-world style questions that checks if AI agents can find complete lists of answers, not just one fact.
LingBot-VLA is a robot brain that listens to language, looks at the world, and decides smooth actions to get tasks done.