Large language models are great at words, but they struggle to predict what will happen after they act in a changing world.
Before this work, AI agents often stopped to run safety checks at every single step, which made them slow and still easy to trick in sneaky ways.
This paper says we should measure an AI agent’s uncertainty across its whole conversation, not just on one final answer.
Agent-Omit teaches AI agents to skip unneeded thinking and old observations, cutting tokens while keeping accuracy high.
MemSkill turns memory operations for AI agents into learnable skills instead of fixed, hand-made rules.
This paper studies how AI agents get better while they are working, not just whether they finish the job.
Multi-agent LLM systems often use LoRA adapters so each agent has a special role, but they all rebuild almost the same KV cache, wasting memory and time.
CAR-bench is a new 'driving test' for AI assistants that checks if they can stay careful, honest, and consistent during real back-and-forth conversations in a car.
Idea2Story is a two-stage system that first studies many accepted research papers offline and then uses that knowledge online to turn a vague idea into a full scientific plan.
Terminal-Bench 2.0 is a tough test that checks how well AI agents can solve real, professional tasks by typing commands in a computer terminal.
PACEvolve is a new recipe that helps AI agents improve their ideas step by step over long periods without getting stuck.
This paper builds an AI agent, ML-Master 2.0, that can work on machine learning projects for a very long time without forgetting what matters.