This paper teaches a language-model agent to explore smarter by combining two ways of learning (on-policy and off-policy) with a simple, self-written memory.
This paper organizes how AI agents learn and improve into one simple map with four roads: A1, A2, T1, and T2.