This paper shows a simple way to turn many 'too-easy' questions into harder, still-checkable ones so that AI keeps learning instead of stalling.
Everyone uses tests (benchmarks) to judge how smart AI models are, but not all tests are good tests.