Papers40

All Beginner Intermediate Advanced

All Sources arXiv

#LLM agents

Memex(RL): Scaling Long-Horizon LLM Agents via Indexed Experience Memory

Beginner

Zhenting Wang, Huancheng Chen et al.Mar 4arXiv

This paper teaches long-horizon AI agents to remember everything exactly without stuffing their whole memory at once.

#indexed memory#LLM agents#long-horizon tasks

Not triaged yet

Exploratory Memory-Augmented LLM Agent via Hybrid On- and Off-Policy Optimization

Intermediate

Zeyuan Liu, Jeonghye Kim et al.Feb 26arXiv

This paper teaches a language-model agent to explore smarter by combining two ways of learning (on-policy and off-policy) with a simple, self-written memory.

#EMPO#memory-augmented agents#on-policy learning

Not triaged yet

Tool-R0: Self-Evolving LLM Agents for Tool-Learning from Zero Data

Beginner

Emre Can Acikgoz, Cheng Qian et al.Feb 24arXiv

Tool-R0 teaches a language model to use software tools (like APIs) with zero human-made training data.

#self-play reinforcement learning#tool calling#function calling

Not triaged yet

Calibrate-Then-Act: Cost-Aware Exploration in LLM Agents

Intermediate

Wenxuan Ding, Nicholas Tomlin et al.Feb 18arXiv

This paper teaches AI agents to make smart choices about when to explore for more information and when to act right away.

#Calibrate-Then-Act#cost-aware exploration#LLM agents

Not triaged yet

Learning Personalized Agents from Human Feedback

Beginner

Kaiqu Liang, Julia Kruk et al.Feb 18arXiv

AI helpers often don’t know new users’ tastes and can’t keep up when those tastes change.

#personalization#human feedback#pre-action clarification

Not triaged yet

SkillsBench: Benchmarking How Well Agent Skills Work Across Diverse Tasks

Intermediate

Xiangyi Li, Wenbo Chen et al.Feb 13arXiv

SkillsBench is a big test playground that measures whether giving AI agents step-by-step 'Skills' actually helps them finish real tasks.

#Agent Skills#LLM agents#Benchmarking

Not triaged yet

EcoGym: Evaluating LLMs for Long-Horizon Plan-and-Execute in Interactive Economies

Intermediate

Xavier Hu, Jinxiang Xia et al.Feb 10arXiv

EcoGym is a new open test playground where AI agents run small businesses over many days to see if they can plan well for the long term.

#EcoGym#long-horizon planning#LLM agents

Not triaged yet

Rethinking the Value of Agent-Generated Tests for LLM-Based Software Engineering Agents

Intermediate

Zhi Chen, Zhensu Sun et al.Feb 8arXiv

This paper asks a simple question: do tests written by AI coding agents actually help them fix real software bugs, or do they just look helpful?

#LLM agents#agent-written tests#software engineering agents

Not triaged yet

AIRS-Bench: a Suite of Tasks for Frontier AI Research Science Agents

Intermediate

Alisia Lupidi, Bhavul Gauri et al.Feb 6arXiv

AIRS-Bench is a new test suite that checks whether AI research agents can do real machine learning research from start to finish, not just answer questions.

#AIRS-Bench#AI research agents#LLM agents

Not triaged yet

AgenticPay: A Multi-Agent LLM Negotiation System for Buyer-Seller Transactions

Beginner

Xianyang Liu, Shangding Gu et al.Feb 5arXiv

AgenticPay is a safe playground where AI agents practice buying and selling by talking, not just by typing numbers.

#multi-agent negotiation#language-mediated bargaining#LLM agents

Not triaged yet

Reinforcement World Model Learning for LLM-based Agents

Intermediate

Xiao Yu, Baolin Peng et al.Feb 5arXiv

Large language models are great at words, but they struggle to predict what will happen after they act in a changing world.

#Reinforcement World Model Learning#world modeling#LLM agents

Not triaged yet

Spider-Sense: Intrinsic Risk Sensing for Efficient Agent Defense with Hierarchical Adaptive Screening

Intermediate

Zhenxiong Yu, Zhi Yang et al.Feb 5arXiv

Before this work, AI agents often stopped to run safety checks at every single step, which made them slow and still easy to trick in sneaky ways.

#Intrinsic Risk Sensing#Event-driven defense#Hierarchical Adaptive Screening

Not triaged yet

1 2 3 4