Papers2

#dynamic evaluation

Interactive Benchmarks

This paper says we should test AI the way real life works: by letting it ask questions, gather clues, and make smart moves step by step under a limited budget.

#interactive benchmarks#information acquisition#budgeted reasoning

Not triaged yet

GISA: A Benchmark for General Information-Seeking Assistant

Intermediate

Yutao Zhu, Xingshuo Zhang et al.Feb 9arXiv

GISA is a new test (benchmark) that checks how well AI assistants can search the web like real people do.

#GISA#information-seeking agents#web search benchmark

Not triaged yet