This paper says we should test AI the way real life works: by letting it ask questions, gather clues, and make smart moves step by step under a limited budget.
This paper shows how to fairly test "general-purpose" AI agents that should work in many places without special tweaks.
The paper argues that the fairest way to check how generally smart an AI is, is to see how quickly and well it learns lots of different human-made games, just like a person with the same time and practice.