VideoDR is a new benchmark that tests if AI can watch a video, pull out key visual clues, search the open web, and chain the clues together to find one verifiable answer.
The paper shows that friendly, people-pleasing language can trick even advanced language models into agreeing with wrong answers.
ArenaRL teaches AI agents by comparing their answers against each other, like a sports tournament, instead of giving each answer a single noisy score.
The paper shows that language models with a search tool often look up too much information, which wastes compute and can make answers worse on unanswerable questions.
This paper builds a new test, LongShOTBench, to check if AI can truly understand long videos by using sight, speech, and sounds together.
OpenDataArena (ODA) is a fair, open platform that measures how valuable different post‑training datasets are for large language models by holding everything else constant.
FINCH is a new test that checks whether AI can handle real finance and accounting work using messy, real spreadsheets, emails, PDFs, charts, and more.
VOYAGER is a training-free way to make large language models (LLMs) create data that is truly different, not just slightly reworded.
LLM judges are cheap but biased; without calibration they can completely flip which model looks best.
Role-playing agents need to juggle several goals at once, like staying in character, following instructions, and using the right tone.
Large language models forget or misuse new facts if you only poke their weights once; EtCon fixes this with a two-step plan.