AgentLongBench is a new test that checks how well AI agents think over very long stories made of their own actions and the world's replies, not just by reading static documents.
The paper builds a new way to create realistic, long conversations between people and AI that use tools like databases.