This paper shows how to fairly test "general-purpose" AI agents that should work in many places without special tweaks.
Computer-using agents kept forgetting important visual details over long tasks and could not reliably find up-to-date, step-by-step help for unfamiliar apps.