This paper shows how to fairly test "general-purpose" AI agents that should work in many places without special tweaks.
This paper teaches AI agents to learn new reusable skills and get better over time by using reinforcement learning, not just prompts.