This paper builds a fair, big playground (a benchmark) to test many EEG foundation models side-by-side on the same rules.
This paper fixes a hidden flaw in a popular image tokenizer (FSQ) with a simple one-line change to its activation function.
FutureOmni is the first benchmark that tests if multimodal AI models can predict what happens next from both sound and video, not just explain what already happened.
ToolPRMBench is a new benchmark that checks, step by step, whether an AI agent using tools picks the right next action.
This paper builds MemoryRewardBench, a big test that checks if reward models (AI judges) can fairly grade how other AIs manage long-term memory, not just whether their final answers are right.
The paper shows that language models with a search tool often look up too much information, which wastes compute and can make answers worse on unanswerable questions.
COMPASS is a new framework that turns a company’s rules into thousands of smart test questions to check if chatbots follow those rules.
The paper tackles how AI agents can truly research the open web when the answers are hidden inside long, messy videos, not just text.
T2AV-Compass is a new, unified test to fairly grade AI systems that turn text into matching video and audio.
This paper builds a tough new test called O3-BENCH to check if AI can truly think with images, not just spot objects.
SWE-EVO is a new test (benchmark) that checks if AI coding agents can upgrade real software projects over many steps, not just fix one small bug.
This paper teaches a video-understanding AI to think in 3D plus time (4D) so it can answer questions about specific objects moving in videos.