Papers40

#LLM agents

Towards Reducible Uncertainty Modeling for Reliable Large Language Model Agents

Changdae Oh, Seongheon Park et al.Feb 4arXiv

This paper says we should measure an AI agent’s uncertainty across its whole conversation, not just on one final answer.

#uncertainty quantification#LLM agents#interactive AI

Agent-Omit: Training Efficient LLM Agents for Adaptive Thought and Observation Omission via Agentic Reinforcement Learning

Intermediate

Yansong Ning, Jun Fang et al.Feb 4arXiv

Agent-Omit teaches AI agents to skip unneeded thinking and old observations, cutting tokens while keeping accuracy high.

#LLM agents#reinforcement learning#agentic RL

RLAnything: Forge Environment, Policy, and Reward Model in Completely Dynamic RL System

Beginner

Yinjie Wang, Tianbao Xie et al.Feb 2arXiv

RLAnything is a new reinforcement learning (RL) framework that trains three things together at once: the policy (the agent), the reward model (the judge), and the environment (the tasks).

#reinforcement learning#closed-loop optimization#reward modeling

MemSkill: Learning and Evolving Memory Skills for Self-Evolving Agents

Intermediate

Haozhen Zhang, Quanyu Long et al.Feb 2arXiv

MemSkill turns memory operations for AI agents into learnable skills instead of fixed, hand-made rules.

#memory skills#LLM agents#skill bank

TIDE: Trajectory-based Diagnostic Evaluation of Test-Time Improvement in LLM Agents

Intermediate

Hang Yan, Xinyu Che et al.Feb 2arXiv

This paper studies how AI agents get better while they are working, not just whether they finish the job.

#Test-Time Improvement#LLM agents#trajectory analysis

LRAgent: Efficient KV Cache Sharing for Multi-LoRA LLM Agents

Intermediate

Hyesung Jeon, Hyeongju Ha et al.Feb 1arXiv

Multi-agent LLM systems often use LoRA adapters so each agent has a special role, but they all rebuild almost the same KV cache, wasting memory and time.

#LoRA#Multi-LoRA#KV cache

CAR-bench: Evaluating the Consistency and Limit-Awareness of LLM Agents under Real-World Uncertainty

Intermediate

Johannes Kirmayr, Lukas Stappen et al.Jan 29arXiv

CAR-bench is a new 'driving test' for AI assistants that checks if they can stay careful, honest, and consistent during real back-and-forth conversations in a car.

#LLM agents#benchmarking#consistency

Idea2Story: An Automated Pipeline for Transforming Research Concepts into Complete Scientific Narratives

Intermediate

Tengyue Xu, Zhuoyang Qian et al.Jan 28arXiv

Idea2Story is a two-stage system that first studies many accepted research papers offline and then uses that knowledge online to turn a vague idea into a full scientific plan.

#autonomous scientific discovery#knowledge graph#method unit extraction

Paying Less Generalization Tax: A Cross-Domain Generalization Study of RL Training for LLM Agents

Beginner

Zhihan Liu, Lin Guan et al.Jan 26arXiv

LLM agents are usually trained in a few worlds but asked to work in many different, unseen worlds, which often hurts their performance.

#cross-domain generalization#state information richness#planning complexity

Agentic Confidence Calibration

Beginner

Jiaxin Zhang, Caiming Xiong et al.Jan 22arXiv

AI agents often act very sure of themselves even when they are wrong, especially on long, multi-step tasks.

#agentic confidence calibration#holistic trajectory calibration#general agent calibrator

Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces

Intermediate

Mike A. Merrill, Alexander G. Shaw et al.Jan 17arXiv

Terminal-Bench 2.0 is a tough test that checks how well AI agents can solve real, professional tasks by typing commands in a computer terminal.

#Terminal-Bench#command line interface#Docker containers

PACEvolve: Enabling Long-Horizon Progress-Aware Consistent Evolution

Intermediate

Minghao Yan, Bo Peng et al.Jan 15arXiv

PACEvolve is a new recipe that helps AI agents improve their ideas step by step over long periods without getting stuck.

#evolutionary search#LLM agents#context management

1 2 3 4