Large language models are great at words, but they struggle to predict what will happen after they act in a changing world.
Before this work, AI agents often stopped to run safety checks at every single step, which made them slow and still easy to trick in sneaky ways.
This paper says we should measure an AI agent’s uncertainty across its whole conversation, not just on one final answer.
Agent-Omit teaches AI agents to skip unneeded thinking and old observations, cutting tokens while keeping accuracy high.
RLAnything is a new reinforcement learning (RL) framework that trains three things together at once: the policy (the agent), the reward model (the judge), and the environment (the tasks).
MemSkill turns memory operations for AI agents into learnable skills instead of fixed, hand-made rules.
This paper studies how AI agents get better while they are working, not just whether they finish the job.
Multi-agent LLM systems often use LoRA adapters so each agent has a special role, but they all rebuild almost the same KV cache, wasting memory and time.
CAR-bench is a new 'driving test' for AI assistants that checks if they can stay careful, honest, and consistent during real back-and-forth conversations in a car.
Idea2Story is a two-stage system that first studies many accepted research papers offline and then uses that knowledge online to turn a vague idea into a full scientific plan.
LLM agents are usually trained in a few worlds but asked to work in many different, unseen worlds, which often hurts their performance.
AI agents often act very sure of themselves even when they are wrong, especially on long, multi-step tasks.