This paper teaches long-horizon AI agents to remember everything exactly without stuffing their whole memory at once.
This paper teaches a language-model agent to explore smarter by combining two ways of learning (on-policy and off-policy) with a simple, self-written memory.
Tool-R0 teaches a language model to use software tools (like APIs) with zero human-made training data.
This paper teaches AI agents to make smart choices about when to explore for more information and when to act right away.
AI helpers often don’t know new users’ tastes and can’t keep up when those tastes change.
SkillsBench is a big test playground that measures whether giving AI agents step-by-step 'Skills' actually helps them finish real tasks.
EcoGym is a new open test playground where AI agents run small businesses over many days to see if they can plan well for the long term.
This paper asks a simple question: do tests written by AI coding agents actually help them fix real software bugs, or do they just look helpful?
AIRS-Bench is a new test suite that checks whether AI research agents can do real machine learning research from start to finish, not just answer questions.
AgenticPay is a safe playground where AI agents practice buying and selling by talking, not just by typing numbers.
Large language models are great at words, but they struggle to predict what will happen after they act in a changing world.
Before this work, AI agents often stopped to run safety checks at every single step, which made them slow and still easy to trick in sneaky ways.