This paper teaches a model to be its own teacher so it can climb out of a learning plateau on very hard math problems.
TSRBench is a giant test that checks if AI models can understand and reason about data that changes over time, like heartbeats, stock prices, and weather.
Large language models often learn one-size-fits-all preferences, but people are different, so we need personalization.
The paper finds almost 300 accepted NLP papers (mostly in 2025) that include at least one fake or non-existent reference, which the authors call a HalluCitation.
LingBot-VLA is a robot brain that listens to language, looks at the world, and decides smooth actions to get tasks done.
AdaReasoner teaches AI to pick the right visual tools, use them in the right order, and stop using them when they aren’t helping.
This paper shows how a video generator can improve its own videos during sampling, without extra training or outside checkers.
AgentDoG is a new ‘diagnostic guardrail’ that watches AI agents step-by-step and explains exactly why a risky action happened.
This paper teaches code AIs to work more like real software engineers by training them in the middle of their learning using real development workflows.
TriPlay-RL is a three-role self-play training loop (attacker, defender, evaluator) that teaches AI models to be safer with almost no manual labels.
This paper introduces TAM-Eval, a new way to test how well AI models can create, fix, and update unit tests for real software projects.
This paper builds an AI agent that learns new skills while working, like a kid who learns new tricks during recess without a teacher telling them what to do.