Multi-agent LLM systems often use LoRA adapters so each agent has a special role, but they all rebuild almost the same KV cache, wasting memory and time.
CAR-bench is a new 'driving test' for AI assistants that checks if they can stay careful, honest, and consistent during real back-and-forth conversations in a car.
Idea2Story is a two-stage system that first studies many accepted research papers offline and then uses that knowledge online to turn a vague idea into a full scientific plan.
Terminal-Bench 2.0 is a tough test that checks how well AI agents can solve real, professional tasks by typing commands in a computer terminal.
PACEvolve is a new recipe that helps AI agents improve their ideas step by step over long periods without getting stuck.
This paper builds an AI agent, ML-Master 2.0, that can work on machine learning projects for a very long time without forgetting what matters.
ToolSafe is a new way to keep AI agents safe when they use external tools, by checking each action before it runs.
MAXS is a new way for AI agents to think a few steps ahead while using tools like search and code, so they make smarter choices.
Agents often act like tourists without a map: they react to what they see now and miss long-term consequences.
This paper teaches AI to build and improve its own small computer helpers (tools) while solving science problems, instead of relying only on a fixed toolbox made beforehand.
OpenTinker is an open-source system that makes training AI agents with reinforcement learning simple, modular, and reusable.
EnvScaler is an automatic factory that builds many safe, rule-following practice worlds where AI agents can talk to users and call tools, just like real apps.