Terminal-Bench 2.0 is a tough test that checks how well AI agents can solve real, professional tasks by typing commands in a computer terminal.
PACEvolve is a new recipe that helps AI agents improve their ideas step by step over long periods without getting stuck.
This paper builds an AI agent, ML-Master 2.0, that can work on machine learning projects for a very long time without forgetting what matters.
ToolSafe is a new way to keep AI agents safe when they use external tools, by checking each action before it runs.
MAXS is a new way for AI agents to think a few steps ahead while using tools like search and code, so they make smarter choices.
Agents often act like tourists without a map: they react to what they see now and miss long-term consequences.
This paper teaches AI to build and improve its own small computer helpers (tools) while solving science problems, instead of relying only on a fixed toolbox made beforehand.
OpenTinker is an open-source system that makes training AI agents with reinforcement learning simple, modular, and reusable.
This paper studies how AI agents that use tools talk about how sure they are and finds a split: some tools make them too sure, others help them be honest.
EnvScaler is an automatic factory that builds many safe, rule-following practice worlds where AI agents can talk to users and call tools, just like real apps.
This paper turns an AI agent’s memory from a flat list of notes into a logic map of events connected by cause-and-time links.
This paper teaches a computer agent to grow a toolbox of skills that are real, runnable programs, not just text ideas.