SpatiaLab is a new test that checks if vision-language models (VLMs) can understand real-world spatial puzzles, like what’s in front, behind, bigger, or reachable.
This paper builds ID-MoCQA, a new two-step (multi-hop) quiz set about Indonesian culture that makes AI connect clues before answering.
The paper introduces Rubric-ARM, a system that teaches two AI helpers—a rubric maker and a judge—to learn together using reinforcement learning so they can better decide which answers people would prefer.
The paper introduces SIN-Bench, a new way to test AI that read long scientific papers by forcing them to show exactly where their answers come from.
RubricHub is a huge (about 110,000) collection of detailed grading guides (rubrics) for many kinds of questions like health, science, writing, and chat.
Preference tuning teaches language models to act the way people like, but those habits can fall apart when the topic or style changes (domain shift).
SenseNova-MARS is a vision-language model that can think step-by-step and use three tools—text search, image search, and image cropping—during its reasoning.
SciEvalKit is a new open-source toolkit that tests AI on real scientific skills, not just trivia or simple Q&A.
Reward models are like scorekeepers that tell AI which answers people like more, and this paper builds the first big test for scorekeepers that judge both pictures and words together.
FINERWEB is a new, carefully built dataset pipeline that teaches computers to spot names of people, places, and more across 91 languages and 25 writing systems.
SAGE is a smart video-watching agent that decides when to answer quickly and when to take multiple steps, just like how people skim or rewind long videos.