PaperBanana is a team of AI helpers that turns a paper’s method text and caption into a clean, accurate, publication-ready figure.
DeepSearchQA is a new test with 900 real-world style questions that checks if AI agents can find complete lists of answers, not just one fact.
This paper builds MemoryRewardBench, a big test that checks if reward models (AI judges) can fairly grade how other AIs manage long-term memory, not just whether their final answers are right.
The paper tackles how AI agents can truly research the open web when the answers are hidden inside long, messy videos, not just text.
T2AV-Compass is a new, unified test to fairly grade AI systems that turn text into matching video and audio.
The FACTS Leaderboard is a four-part test that checks how truthful AI models are across images, memory, web search, and document grounding.