MDAgent2 is a special helper built from large language models (LLMs) that can both answer questions about molecular dynamics and write runnable LAMMPS simulation code.
WebGym is a giant practice world (almost 300,000 tasks) that lets AI web agents learn on real, ever-changing websites instead of tiny, fake ones.
This paper teaches AI to solve diagram-based math problems by copying how people think: first see (perception), then make sense of what you saw (internalization), and finally reason (solve the problem).
COMPASS is a new framework that turns a company’s rules into thousands of smart test questions to check if chatbots follow those rules.
K-EXAONE is a super-sized language model that speaks six languages and can read very long documents (up to 256,000 tokens) without forgetting important details.
This paper makes video editing easier by teaching an AI to spread changes from the first frame across the whole video smoothly and accurately.
OpenRT is a big, open-source test bench that safely stress-tests AI models that handle both text and images.
NitroGen is a vision-to-action AI that learns to play many video games by watching 40,000 hours of gameplay videos from over 1,000 titles with on-screen controller overlays.
OpenNovelty is a four-phase, AI-powered helper that checks how new a research paper’s ideas are by comparing them to real, retrieved papers.
This paper introduces MOSS Transcribe Diarize, a single model that writes down what people say in a conversation, tells who said each part, and marks the exact times—all in one go.
DrivingGen is a new, all-in-one test that fairly checks how well AI can imagine future driving videos and motions.
SWE-Lego shows that a simple training method called supervised fine-tuning (SFT), when done carefully, can teach AI to fix real software bugs very well.