FIRE-Bench is a new test that checks whether AI agents can fully redo real scientific discoveries, step by step, not just guess answers.
The paper introduces VDR-Bench, a new test with 2,000 carefully built questions that truly require both seeing (images) and reading (web text) to find answers.
Large language models usually learn by guessing the next word, then get a tiny bit of instruction-following practice; this paper flips that by turning massive web documents into instruction-and-answer pairs at huge scale.
This paper teaches language models to be safer, more factual, and higher quality during pretraining, not just after, by using reinforcement learning with a stronger model as a helper.
This paper turns rebuttal writing from ‘just write some text’ into ‘make a plan with proof, then write.’
AgencyBench is a giant test that checks how well AI agents can handle real, long, multi-step jobs, not just short puzzles.
The paper shows that top reasoning AIs don’t just think longer—they act like a tiny team inside their heads, with different voices that ask, disagree, and then agree.
VideoDR is a new benchmark that tests if AI can watch a video, pull out key visual clues, search the open web, and chain the clues together to find one verifiable answer.
The paper shows that friendly, people-pleasing language can trick even advanced language models into agreeing with wrong answers.
The paper shows that language models with a search tool often look up too much information, which wastes compute and can make answers worse on unanswerable questions.
This paper builds a new test, LongShOTBench, to check if AI can truly understand long videos by using sight, speech, and sounds together.
OpenDataArena (ODA) is a fair, open platform that measures how valuable different post‑training datasets are for large language models by holding everything else constant.