This paper builds a new test, LongShOTBench, to check if AI can truly understand long videos by using sight, speech, and sounds together.
OpenDataArena (ODA) is a fair, open platform that measures how valuable different post‑training datasets are for large language models by holding everything else constant.
FINCH is a new test that checks whether AI can handle real finance and accounting work using messy, real spreadsheets, emails, PDFs, charts, and more.
VOYAGER is a training-free way to make large language models (LLMs) create data that is truly different, not just slightly reworded.
LLM judges are cheap but biased; without calibration they can completely flip which model looks best.
Role-playing agents need to juggle several goals at once, like staying in character, following instructions, and using the right tone.
Large language models forget or misuse new facts if you only poke their weights once; EtCon fixes this with a two-step plan.