This paper builds a new test called Ref-Adv to check if AI can truly match tricky sentences to the right thing in a picture.
SkillsBench is a big test playground that measures whether giving AI agents step-by-step 'Skills' actually helps them finish real tasks.
This paper shows a simple, repeatable way to teach general Vision-Language Models (VLMs) to understand e-commerce items much better without forgetting their general skills.
The paper introduces GENIUS, a new test that checks whether image-generating AIs can think on the fly, not just recall facts.
PhyCritic is a judge model that checks other AI models’ answers about the physical world, like cooking steps, robot actions, or driving choices.
ViDoRe V3 is a big, carefully built test that checks how well AI systems find and use information from both text and pictures (like tables and charts) in real documents.
MDAgent2 is a special helper built from large language models (LLMs) that can both answer questions about molecular dynamics and write runnable LAMMPS simulation code.
ModelTables is a giant, organized collection of tables that describe AI models, gathered from Hugging Face model cards, GitHub READMEs, and research papers.
HERBench is a new test that checks if video AI models can combine several clues spread across time, not just guess from one frame or language priors.
This paper argues that the fastest and safest path to super-smart AI is for humans and AIs to improve together, not for AI to improve alone.