Legal RAG Bench is a new, end-to-end test that checks how well legal AI systems find information and use it to answer tough, real-world legal questions.
SLATE is a new way to teach AI to think step by step while using a search engine, giving feedback at each step instead of only at the end.
MediX-R1 teaches medical AI models to give clear, free-form answers (not just A, B, C, or D) and to explain their thinking.
MemGUI-Bench is a new test that checks how well phone-controlling AI agents can remember important information both during a task and across different tries.
FIRE-Bench is a new test that checks whether AI agents can fully redo real scientific discoveries, step by step, not just guess answers.
The paper introduces VDR-Bench, a new test with 2,000 carefully built questions that truly require both seeing (images) and reading (web text) to find answers.
Large language models usually learn by guessing the next word, then get a tiny bit of instruction-following practice; this paper flips that by turning massive web documents into instruction-and-answer pairs at huge scale.
This paper teaches language models to be safer, more factual, and higher quality during pretraining, not just after, by using reinforcement learning with a stronger model as a helper.
This paper builds a new test called AgentIF-OneDay that checks if AI helpers can follow everyday instructions the way people actually give them.
This paper turns rebuttal writing from ‘just write some text’ into ‘make a plan with proof, then write.’
AgencyBench is a giant test that checks how well AI agents can handle real, long, multi-step jobs, not just short puzzles.
The paper shows that top reasoning AIs don’t just think longer—they act like a tiny team inside their heads, with different voices that ask, disagree, and then agree.