This paper teaches a language-model agent to explore smarter by combining two ways of learning (on-policy and off-policy) with a simple, self-written memory.
This paper teaches AI agents to make smart choices about when to explore for more information and when to act right away.
SkillsBench is a big test playground that measures whether giving AI agents step-by-step 'Skills' actually helps them finish real tasks.
EcoGym is a new open test playground where AI agents run small businesses over many days to see if they can plan well for the long term.
This paper asks a simple question: do tests written by AI coding agents actually help them fix real software bugs, or do they just look helpful?
AIRS-Bench is a new test suite that checks whether AI research agents can do real machine learning research from start to finish, not just answer questions.
Large language models are great at words, but they struggle to predict what will happen after they act in a changing world.
Before this work, AI agents often stopped to run safety checks at every single step, which made them slow and still easy to trick in sneaky ways.
This paper says we should measure an AI agent’s uncertainty across its whole conversation, not just on one final answer.
Agent-Omit teaches AI agents to skip unneeded thinking and old observations, cutting tokens while keeping accuracy high.
MemSkill turns memory operations for AI agents into learnable skills instead of fixed, hand-made rules.
This paper studies how AI agents get better while they are working, not just whether they finish the job.