This paper says we should measure an AI agent’s uncertainty across its whole conversation, not just on one final answer.
Agent-Omit teaches AI agents to skip unneeded thinking and old observations, cutting tokens while keeping accuracy high.
RLAnything is a new reinforcement learning (RL) framework that trains three things together at once: the policy (the agent), the reward model (the judge), and the environment (the tasks).
MemSkill turns memory operations for AI agents into learnable skills instead of fixed, hand-made rules.
This paper studies how AI agents get better while they are working, not just whether they finish the job.
Multi-agent LLM systems often use LoRA adapters so each agent has a special role, but they all rebuild almost the same KV cache, wasting memory and time.
CAR-bench is a new 'driving test' for AI assistants that checks if they can stay careful, honest, and consistent during real back-and-forth conversations in a car.
Idea2Story is a two-stage system that first studies many accepted research papers offline and then uses that knowledge online to turn a vague idea into a full scientific plan.
LLM agents are usually trained in a few worlds but asked to work in many different, unseen worlds, which often hurts their performance.
AI agents often act very sure of themselves even when they are wrong, especially on long, multi-step tasks.
Terminal-Bench 2.0 is a tough test that checks how well AI agents can solve real, professional tasks by typing commands in a computer terminal.
PACEvolve is a new recipe that helps AI agents improve their ideas step by step over long periods without getting stuck.