FIRE-Bench is a new test that checks whether AI agents can fully redo real scientific discoveries, step by step, not just guess answers.
The paper introduces VDR-Bench, a new test with 2,000 carefully built questions that truly require both seeing (images) and reading (web text) to find answers.
Large language models usually learn by guessing the next word, then get a tiny bit of instruction-following practice; this paper flips that by turning massive web documents into instruction-and-answer pairs at huge scale.
This paper teaches language models to be safer, more factual, and higher quality during pretraining, not just after, by using reinforcement learning with a stronger model as a helper.
This paper builds a new test called AgentIF-OneDay that checks if AI helpers can follow everyday instructions the way people actually give them.
This paper turns rebuttal writing from ‘just write some text’ into ‘make a plan with proof, then write.’
AgencyBench is a giant test that checks how well AI agents can handle real, long, multi-step jobs, not just short puzzles.
The paper shows that top reasoning AIs don’t just think longer—they act like a tiny team inside their heads, with different voices that ask, disagree, and then agree.
VideoDR is a new benchmark that tests if AI can watch a video, pull out key visual clues, search the open web, and chain the clues together to find one verifiable answer.
The paper shows that friendly, people-pleasing language can trick even advanced language models into agreeing with wrong answers.
ArenaRL teaches AI agents by comparing their answers against each other, like a sports tournament, instead of giving each answer a single noisy score.
The paper shows that language models with a search tool often look up too much information, which wastes compute and can make answers worse on unanswerable questions.