Benchmark^2: Systematic Evaluation of LLM Benchmarks
IntermediateQi Qian, Chengsong Huang et al.Jan 7arXiv
Everyone uses tests (benchmarks) to judge how smart AI models are, but not all tests are good tests.
#LLM evaluation#benchmark quality#ranking consistency