GISA: A Benchmark for General Information-Seeking Assistant

Yutao Zhu; Xingshuo Zhang; Maosen Zhang; Jiajie Jin; Liancheng Zhang; Xiaoshuai Song; Kangzhi Zhao; Wencong Zeng; Ruiming Tang; Han Li; Ji-Rong Wen; Zhicheng Dou

GISA: A Benchmark for General Information-Seeking Assistant

Intermediate

Yutao Zhu, Xingshuo Zhang, Maosen Zhang et al.2/9/2026

arXiv

Key Summary

•GISA is a new test (benchmark) that checks how well AI assistants can search the web like real people do.
•Unlike many past tests, GISA’s questions are written by humans from real curiosity, not built backward from answers.
•It measures both deep reasoning (digging into one topic) and wide aggregation (collecting from many places) in the same tasks.
•Answers must fit one of four structured formats—item, set, list, or table—so scoring can be exact and fair.
•GISA includes a live subset with regularly updated answers to prevent models from just memorizing.
•Every question has a full human search trajectory, so we can see and learn the exact steps people took.
•Top models still scored only about 19.30% exact match, showing these tasks are genuinely hard.
•Thinking mode (letting a model reason step by step) helps, but too many tool calls can hurt due to noise.
•Commercial deep research products did not beat well-set-up LLM agents on GISA.
•GISA highlights clear gaps—planning, careful browsing, conflict checking, and strict formatting—that the next generation of search agents must fix.

Why This Research Matters

Real people need AI that can actually search the web carefully, not just sound confident. GISA pushes AI assistants to plan better, browse deeper, and verify facts before answering. Its strict formats mean answers arrive in clean lists and tables you can use right away. The live subset keeps the test fresh, so models can’t just coast on old memory. Human search trajectories teach agents practical steps that mirror how skilled people investigate. As these agents improve on GISA, your homework help, news checks, and research summaries become faster and more reliable.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine you’re doing a school project about famous festivals. With a regular search engine, you type a question, click lots of links, read long pages, and try to stitch together an answer yourself. That’s a lot of work!

🥬 The Concept (Search agents): An information-seeking agent is an AI helper that searches, clicks, reads, and pieces together answers for you. How it works:

Understand your question.
Plan what to search first.
Run multiple web searches.
Open pages, read, and extract key facts.
Check across sources and combine everything into a clean answer. Why it matters: Without agents, you do all the clicking and summarizing yourself—it’s slow and easy to miss things. 🍞 Anchor: You ask, “List all winners of the 1990s music award X with years.” An agent searches, browses different sites, verifies disagreements, and returns a tidy table.

🍞 Hook: You know how some tests don’t really feel like real life? Like practicing only trick questions for a science quiz.

🥬 The Concept (Benchmarks): A benchmark is a standardized test used to measure how good AIs are at a job. How it works:

Collect questions and correct answers.
Make clear scoring rules.
Run different AIs and compare scores. Why it matters: Without solid benchmarks, we can’t tell which AI truly helps in real-life tasks. 🍞 Anchor: A reading test that fairly checks main idea, details, and vocabulary tells you who’s really good at reading.

🍞 Hook: Imagine making a riddle after peeking at the answer first. It might be clever—but not like real questions you’d naturally ask.

🥬 The Concept (Reverse-engineered queries): These are questions built backward from a known answer to be hard. How it works:

Pick an answer.
Design a tricky path to get there.
Turn it into a question. Why it matters: Results can feel unrealistic—good scores here may not mean better help in daily life. 🍞 Anchor: Asking “Which 1997 album by the band that headlined Festival Y starts with the letter C?” is less natural than “What albums did that band release in the 1990s?”

🍞 Hook: Think of exploring a cave (deep) versus exploring a whole neighborhood (wide). Both are important depending on the goal.

🥬 The Concept (Deep vs. Wide Search): Deep search means following many steps on one thread; wide search means gathering from many sources and combining. How it works:

Deep: click into linked pages, track dates, tie events together.
Wide: find many trustworthy sources and merge facts. Why it matters: Real questions often need both—digging into details and comparing across places. 🍞 Anchor: To fill a table of all South Korean presidents and whether each declared martial law, you must collect every president (wide) and inspect each tenure’s history (deep).

🍞 Hook: If a friend already knows the answer by heart, you’re not testing their searching—you’re testing their memory.

🥬 The Concept (Data contamination/memorization): When a model has seen the answers during its training, it may answer from memory instead of searching. How it works:

Models are trained on lots of web text.
If a benchmark’s answers are static and public, they might be memorized.
Scores then stop reflecting true search ability. Why it matters: We want to measure real-time finding, not recall from training. 🍞 Anchor: If a model instantly answers a new question that changed last month, it must actually search—otherwise we can’t trust the skill being measured.

🍞 Hook: Grading is easiest when everyone follows the same format—like filling the same worksheet with the same columns.

🥬 The Concept (Deterministic evaluation): This means having strict formats and rules so scoring is exact and repeatable. How it works:

Fix answer shapes (like item, set, list, table).
Specify sorting orders and exact column names.
Compare answers cell-by-cell. Why it matters: Without fixed formats, grading becomes fuzzy and subjective. 🍞 Anchor: If you and your friend both return the same table schema sorted by date, a computer can grade you fairly.

What was missing before: Many older tests used reverse-engineered, static questions and vague grading. They didn’t fairly check both deep digging and wide gathering, and they risked rewarding memorization.

The gap GISA fills: GISA builds a realistic test with human-made questions, structured answer formats, live updates, and human search step-by-step paths for process learning. It lets us finally measure the skills real users need.

Why you should care: Better testing leads to better helpers. Think: faster homework research, more accurate news summaries, and fewer copy-paste mistakes—all from AI that actually knows how to search.

02Core Idea

🍞 Hook: Imagine a science fair where the judges ask real kids’ questions, require neat lab reports with the same sections, and sometimes change the questions to make sure no one just memorizes last year’s answers.

🥬 The Concept (GISA): GISA is a new benchmark (test) that checks if AI assistants can do real, multi-step web searching and neatly report answers. How it works (the "Aha!"):

Start with authentic, human-written questions that reflect real curiosity.
Require structured answer formats—item, set, list, or table—with explicit sorting and headers.
Mix deep reasoning and wide aggregation in single tasks.
Include a live subset with regularly updated answers to fight memorization.
Provide full human search trajectories as gold references for learning and analysis. Why it matters: Without this structure, models can look smart without truly searching, and grading can be unfair or fuzzy. 🍞 Anchor: A question like “List all CMU Statistics & Data Science faculty alphabetically by last name” demands wide gathering, exact formatting, and a clear sort rule—perfect for precise grading.

Three analogies for the same idea:

School analogy: A fair test with clear rubrics and answer boxes (item, set, list, table) that checks both how well you dig into details and how well you collect facts from many sources.
Cooking analogy: A recipe contest that scores both your ability to perfect one dish (deep) and create a sampler from many cuisines (wide), with plating rules so judges can compare evenly.
Detective analogy: A case file that needs both following one suspect’s timeline (deep) and cross-checking multiple witnesses (wide), all reported in a standard form so the chief can verify everything.

🍞 Hook: You know how being organized makes you faster and clearer?

🥬 The Concept (Structured answer formats): GISA uses four shapes—item, set, list, table—to make answers unambiguous and easy to grade. How it works:

Item: One value (e.g., a single name or number).
Set: A bag of unique items; order doesn’t matter.
List: A specific order matters (and the rule must be stated).
Table: Multiple columns with exact headers and sorting keys. Why it matters: Without structure, even correct facts can be marked wrong due to formatting confusion. 🍞 Anchor: Returning band names as a Set for a festival (order-free) vs. returning faculty in an alphabetized List (order required) are graded differently but both precisely.

🍞 Hook: Think of a coach watching not just your final score but also your footwork.

🥬 The Concept (Human search trajectories): These are recorded step-by-step paths real people took—queries, clicks, and pages—to find answers. How it works:

Log search queries and results.
Track every clicked page.
Time-stamp actions for pacing. Why it matters: They prove the task is solvable by normal browsing and let models learn better strategies. 🍞 Anchor: If humans solved “Who won the Archibald Prize most often in the 20th century?” by checking an official gallery list and cross-verifying biographies, AI can imitate that route.

🍞 Hook: News changes; so should the test.

🥬 The Concept (Dynamic evaluation/live subset): Some GISA questions are kept fresh and re-labeled periodically so models must search, not memorize. How it works:

Mark questions that can change (live).
Update their answers regularly.
Score models on the updated ground truth. Why it matters: Prevents easy wins from training data and keeps the benchmark relevant over time. 🍞 Anchor: A list of current department faculty or a 2023 festival lineup must reflect new updates—not last year’s snapshot.

Before vs. After:

Before: Many tests rewarded clever memory or narrow tricks, and grading depended on subjective LLM judgments.
After: GISA sets real tasks, demands structured outputs, and uses strict, reproducible metrics to measure true search ability.

Why it works (intuition): Clear formats reduce grading noise; real questions ensure real skills; live updates fight memorization; and human trajectories show the solvable path. All together, they push agents to plan better, browse smarter, and verify conflicts.

Building blocks:

Human-written queries across everyday domains.
Four answer types with explicit sorting and schemas.
Deep + Wide blended into single tasks.
Live vs. Stable subsets.
Process supervision via human trajectories.
Deterministic metrics (EM, F1, order, row-level F1) for fair scoring.

03Methodology

At a high level: Real-world question → (Brainstorm) → (Refine into structured format) → (Human search + logging) → (Quality checks + anti-memorization filter) → (Evaluation with exact metrics) → Score.

Step 1: Brainstorming real questions

What happens: Annotators freely browse domain sites (news, encyclopedias, arts, sports) and jot down natural questions triggered by what they read.
Why this step exists: To make questions feel like what real people would actually ask, not puzzles made backward from answers.
Example: Seeing news about martial law in South Korea sparks: “Which presidents declared it?” and “What is the history of such declarations?”

Step 2: Query refinement into structured formats

What happens: Turn seed questions into precise prompts with answer type (item, set, list, table), exact columns, and explicit sorting rules.
Why this step exists: So the final answers can be graded deterministically and push agents to both gather widely and reason deeply.
Example: Instead of “How many presidents?” (trivial), require: “Provide a table listing all presidents with Start Date, End Date, and whether Martial Law was declared; sort by Start Date.” This forces wide collection (all presidents) plus deep verification (each tenure’s declarations).

🍞 Hook: You know how checklists make group projects run smoothly?

🥬 The Concept (Deterministic rules for outputs): Refinement includes strict schema and order to ensure consistent grading. How it works:

Fix headers exactly.
Set primary/secondary sort keys to break ties.
Disallow ambiguous formats. Why it matters: Without this, even correct content can be ungradeable. 🍞 Anchor: “Name, Start Date, End Date, Martial Law (Yes/No), sorted by Start Date” leaves no doubt.

Step 3: Human annotation with a logging tool

What happens: Annotators use Google Search only (no LLM help), and a browser extension silently records queries, SERP results, clicks, and timestamps. They build final answers in CSV.
Why this step exists: To capture real human search trajectories that prove solvability and later teach models process skills.
Example: Logs show queries like “list of presidents of South Korea,” clicks into Wikipedia, and then follow-up searches for each leader’s martial law decisions.

🍞 Hook: Like a coach reviewing a game tape to learn the plays.

🥬 The Concept (Human search trajectories): The recorded paths are stored as JSON with search terms, SERP snapshots, and click chains. How it works:

Start/stop per task.
Store each query and its results.
Record navigation between pages. Why it matters: Lets us analyze good strategies and where models diverge from humans. 🍞 Anchor: Seeing humans browse more pages but issue fewer queries suggests agents should explore deeper before reformulating.

Step 4: Quality checking and anti-memorization filter

What happens: A verifier checks if logs are clean (no missing starts or noise), if answers match facts and formats, and whether everything is derivable from the logged pages. Finally, a strong LLM is tested with web access off; if it answers perfectly from memory, the question is removed.
Why this step exists: Ensures every sample is valid, solvable via normal browsing, and not trivially memorized.
Example: If the final table misses a row or uses wrong headers, it’s fixed only if the needed info exists in the recorded pages; otherwise, the sample is reworked or discarded.

Step 5: Evaluation-time parsing and normalization

What happens: Agents must output inside <answer> tags and a TSV code block with a header. The evaluator extracts the block, normalizes text (lowercasing, stripping symbols, canonicalizing numbers), and compares to ground truth.
Why this step exists: To avoid grading differences due to harmless formatting (like commas in numbers or casing).
Example: “1,000” becomes “1000,” percentages convert to decimals, and headers are standardized before comparison.

Step 6: Metrics tailored to answer types

Universal metric: Exact Match (EM) gives 1 only if the whole output exactly matches; else 0.
Set-type: F1 measures overlap between predicted and true sets.
List-type: Two scores—F1 for content overlap and an order score (SequenceMatcher) for how well the order aligns.
Table-type: Row-level F1 (entire rows right) and item-level F1 (correct individual cells).
Why this step exists: Different answer shapes need different fairness rules.
Example: A list of faculty must be alphabetized as asked; a set of band names ignores order; a table checks rows and cells separately.

🍞 Hook: Think of a pop quiz where even tiny mistakes reduce your score—so neatness and accuracy both count.

🥬 The Concept (Exact Match Score): EM is a strict pass/fail for matching the ground truth exactly. How it works:

Normalize both outputs.
Compare all cells or items.
Score 1 for perfect match; else 0. Why it matters: It rewards full correctness and strong instruction-following. 🍞 Anchor: If the model prints the exact table with the right headers and sorting, EM = 1; a single wrong header or missing row gives EM = 0.

The Secret Sauce:

Unite deep + wide demands in a single realistic task.
Use structured formats for clean grading.
Keep a live subset fresh so models must actually search.
Include human trajectories so we can train and analyze the process, not just the final answer.

04Experiments & Results

The Test: Researchers ran many top language models as web-search agents on all 373 GISA questions. Each model got the same instructions, the same tools (Search and Browse), and the same rule to output TSV inside <answer> tags, so the comparison was fair.

The Competition: They compared strong open and commercial models (like Claude 4.5 Sonnet, Gemini 3 Pro, GPT-5.2, DeepSeek-V3.2, Qwen models, Kimi K2.5) and also commercial deep research/search systems (OpenAI o4-mini Deep Research, Perplexity Sonar Pro Search, Google Search AI Mode, and GPT-4o Search Preview).

The Scoreboard with context:

Overall difficulty: Even the best model (Claude 4.5 Sonnet with thinking) only reached about 19.30% EM. That’s like a test where full-credit exact answers are so strict that getting even one-fifth perfect is already hard.
Answer-type trends: Models did much better on single items and sets than on tables. Tables require both wide gathering and neat multi-column formatting with sorting—many places to slip.
Thinking helps: Turning on “thinking mode” (longer step-by-step reasoning) improved performance for most model families, but it also costs more tokens.
Tool calls: More is not always better. Some models spammed searches and browsing but did worse—extra noise and longer contexts can confuse reasoning.
Commercial performance: Surprisingly, commercial deep research systems did not beat the best ReAct-style LLM agents. A big reason: instruction-following and formatting mistakes (like wrong headers) cost them exact matches.

Surprising Findings:

Live vs. Stable: A very recent model (Kimi K2.5) did clearly worse on live questions compared to stable ones, suggesting it might have memorized stable facts from training but struggled when answers changed. This supports why GISA needs a live subset.
Human vs. Model behavior: Humans searched fewer times but browsed many more pages per search; models searched more often but skimmed fewer pages. Tasks where model behavior looked more human-like (deeper browsing, smarter query refinements) scored better.
Inference-time scaling: Running multiple independent attempts per question (k runs) boosted the chance of getting one perfect answer a lot (Best@k climbed from about 8.9% to 22.22% at k=16). However, picking the right answer among many tries is still tricky (Majority@k lagged), so better self-checking is needed.
Error patterns: The biggest chunk of mistakes came from the search process itself (poor query planning, not following links deeper, not launching verification queries when sources disagreed). Another large slice: output formatting errors—like wrong table headers or sort orders.
Cost matters: Some top performers used tokens efficiently; others spent a lot but didn’t score higher. This means smart strategy beats raw compute in this setting.

🍞 Hook: You know how a spelling mistake can turn a correct answer into a wrong one on a strict quiz?

🥬 The Concept (Instruction following): Models must obey exact formatting—correct headers, TSV structure, sorting, and no extra notes in the code block. How it works:

Follow the schema exactly.
Put only TSV inside the code block.
Respect sorting rules. Why it matters: GISA’s grading is strict; a tiny formatting slip can drop EM to 0 even if facts are right. 🍞 Anchor: Returning a table with “Name” instead of the specified “President Name” can ruin an otherwise correct answer.

Bottom line: GISA is hard in the right way—it reveals where today’s agents fall short: planning multi-step searches, browsing deeply, verifying conflicts, and packaging results precisely. Those are exactly the skills people need help with in real life.

05Discussion & Limitations

Limitations:

Scale: 373 questions is plenty for evaluation but not enough for big training runs. More queries would widen coverage further.
Modality: GISA is text-only. Real web tasks can need images, charts, or videos; future versions should consider multimodal content.
Tool budget: Experiments capped tool calls at 30 per task for cost fairness. A few cases might need more; dynamic budgets could help.

Required resources:

Access to web search and a page reader/summarizer.
Models that can plan multi-step tool use and follow strict output formats.
Evaluation scripts to parse TSV, normalize values, and compute EM/F1/order metrics.

When NOT to use:

If you’re testing purely offline knowledge with no web access needed.
If you only care about free-form storytelling instead of structured answers.
If your application is heavily visual (e.g., reading infographics) until a multimodal GISA exists.

Open questions:

How to teach agents to browse like humans—fewer query rewrites, more systematic page exploration?
What’s the best way to resolve conflicts across sources automatically?
Can agents self-check tables for missing rows, wrong headers, or sort errors before submitting?
How to pick the best attempt among many runs reliably (strong verification)?
How to scale the benchmark while keeping human trajectories and quality high?

🍞 Hook: Imagine a teacher who not only grades your final essay but also helps you improve your outline and sources.

🥬 The Concept (Process-level supervision): Using human trajectories to guide how models plan, search, and verify—not just what final answer they give. How it works:

Compare model steps to human steps.
Reward good planning and thorough browsing.
Penalize skipping verification. Why it matters: Better process leads to better, more reliable answers. 🍞 Anchor: If humans checked three sources for conflicting dates, models can be trained to do the same before finalizing a table.

Overall assessment: GISA is a strong, realistic yardstick that shines a light on practical weaknesses—and that’s exactly what the field needs to make reliable research assistants.

06Conclusion & Future Work

Three-sentence summary:

GISA is a realistic benchmark that tests whether AI assistants can truly search the web like people do, combining deep reasoning and broad aggregation in the same tasks.
It enforces structured answers (item, set, list, table) with deterministic scoring, includes a live subset to avoid memorization, and ships with complete human search trajectories.
Current systems struggle (best ~19.30% EM), revealing urgent needs in planning, deep browsing, conflict resolution, and strict formatting.

Main achievement:

Turning vague, static, and sometimes unrealistic agent tests into a fair, dynamic, process-aware evaluation that mirrors how real users seek information.

Future directions:

Expand to multimodal web pages (images, charts, videos), scale the number of queries, refine process-level training from human trajectories, and develop stronger self-verification for multi-run selection.

Why remember this:

GISA resets the bar for what it means to be a great web-searching assistant: not just smart in theory, but careful, organized, up-to-date, and aligned with how people actually look for answers. It’s a blueprint for building AI that truly lightens our research load.

Practical Applications

•Build classroom research helpers that return neat, graded tables instead of messy paragraphs.
•Create newsroom tools that verify facts across multiple sources before publishing.
•Power academic assistants that compile conference awards, faculty lists, or citations with strict sorting.
•Design enterprise dashboards that aggregate up-to-date market data into clean TSV outputs.
•Train agents with human trajectories to improve query planning and deep browsing habits.
•Automate compliance checks by collecting documents from many sources and extracting key fields reliably.
•Use live-subset style questions to audit whether internal assistants still search rather than memorize outdated info.
•Add self-checkers that confirm headers, sorting, and missing rows before finalizing answers.
•Adopt inference-time scaling (multiple runs) plus verification to boost exact-match rates on tough tasks.
•Benchmark new agent architectures (e.g., multi-agent or MCTS-guided) on a realistic, dynamic suite of tasks.

Version: 1