AgentDoG is a new ‘diagnostic guardrail’ that watches AI agents step-by-step and explains exactly why a risky action happened.
This paper teaches code AIs to work more like real software engineers by training them in the middle of their learning using real development workflows.
TriPlay-RL is a three-role self-play training loop (attacker, defender, evaluator) that teaches AI models to be safer with almost no manual labels.
This paper introduces TAM-Eval, a new way to test how well AI models can create, fix, and update unit tests for real software projects.
This paper builds an AI agent that learns new skills while working, like a kid who learns new tricks during recess without a teacher telling them what to do.
This paper teaches a language-model agent to look up facts in millions of scientific paper summaries and answer clear, single-answer questions.
SAGE is a two-agent system that automatically writes tough, multi-step search questions and checks them by actually trying to solve them.
The paper tackles understanding super long, first‑person videos (days to a week) by giving an AI a smarter memory and better tools.
The paper shows how to speed up reinforcement learning (RL) for large language models (LLMs) by making numbers smaller (FP8) without breaking training.
DeepPlanning is a new benchmark that tests whether AI can make long, realistic plans that fit time and money limits.
FABLE is a new retrieval system that helps AI find and combine facts from many documents by letting the AI both organize the library and choose the right shelves to read.
DRPG is a four-step AI helper that writes strong academic rebuttals by first breaking a review into parts, then fetching evidence, planning a strategy, and finally writing the response.