This paper builds an AI team that can make real full‑stack websites (frontend, backend, and database) from plain English instructions.
PaperBanana is a team of AI helpers that turns a paper’s method text and caption into a clean, accurate, publication-ready figure.
This paper builds a smart team of AI helpers, called MEnvAgent, that automatically sets up the right computer environments for code projects in many languages.
CAR-bench is a new 'driving test' for AI assistants that checks if they can stay careful, honest, and consistent during real back-and-forth conversations in a car.
DeepSearchQA is a new test with 900 real-world style questions that checks if AI agents can find complete lists of answers, not just one fact.
LingBot-VLA is a robot brain that listens to language, looks at the world, and decides smooth actions to get tasks done.
This paper builds a fair, big playground (a benchmark) to test many EEG foundation models side-by-side on the same rules.
This paper fixes a hidden flaw in a popular image tokenizer (FSQ) with a simple one-line change to its activation function.
FutureOmni is the first benchmark that tests if multimodal AI models can predict what happens next from both sound and video, not just explain what already happened.
ToolPRMBench is a new benchmark that checks, step by step, whether an AI agent using tools picks the right next action.
This paper builds MemoryRewardBench, a big test that checks if reward models (AI judges) can fairly grade how other AIs manage long-term memory, not just whether their final answers are right.
The paper shows that language models with a search tool often look up too much information, which wastes compute and can make answers worse on unanswerable questions.