Hunt Instead of Wait: Evaluating Deep Data Research on Large Language Models

Wei Liu; Peijie Yu; Michele Orini; Yali Du; Yulan He

Hunt Instead of Wait: Evaluating Deep Data Research on Large Language Models

Intermediate

Wei Liu, Peijie Yu, Michele Orini et al.2/2/2026

arXiv PDF

Key Summary

•The paper asks AI to hunt for insights in big databases without being told exact questions, like a curious scientist instead of a test-taker.
•It introduces Deep Data Research (DDR) and a large benchmark (DDR-Bench) that lets models freely explore data using tools like SQL and Python.
•A new checklist system fairly checks whether each claim in the AI’s report is actually supported by the data, avoiding vague or subjective grading.
•Frontier models show early signs of agency, but long, multi-step investigations and knowing when to stop remain hard.
•Claude 4.5 Sonnet tops the scoreboard (around high 40% average accuracy), while several open-source models come close in certain domains.
•Models improve by exploring broadly first and then making a few sharp, late-stage queries—like planning quietly and then striking precisely.
•Bigger models or longer context windows alone don’t guarantee better investigation; training that targets agent behavior matters more.
•Agent scaffolding (like memory notes) changes how models act but doesn’t reliably boost final insight quality.
•Novel insights judged pairwise line up well with checklist accuracy, showing the benchmark captures the main signal of useful discovery.
•Hallucinations are rare and have little impact, and every reported insight links back to exact tool calls and data for traceability.

Why This Research Matters

In real life, people often start from messy data, not neat questions, so we need AIs that can explore wisely on their own. DDR-Bench measures this ability fairly by checking each claim against the same data the AI used, reducing guesswork and bias. This helps build trustworthy assistants in healthcare, finance, and wellbeing analysis that show their work and earn confidence. By studying exploration patterns and costs, we can design AIs that discover more while wasting less time and money. The benchmark also shows what training really helps, steering the field beyond “just make it bigger.” In short, it pushes AI toward being a careful, curious partner that can find what matters and prove it.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: You know how a detective doesn’t start with a list of answers, but with a messy room full of clues and the freedom to look anywhere? Real science and data work feel like that too.

🥬 The Concept: Large Language Models (LLMs) are powerful text tools, but most tests used to judge them ask them to answer specific, already-written questions. How it works (before this paper): 1) Humans write a clear question. 2) The AI replies. 3) We check if the reply matches the answer. Why it matters: This misses a different kind of smarts—choosing good questions and exploring wisely when nothing is handed to you. 🍞 Anchor: It’s like grading a treasure hunter only on whether they can open a chest you point to, not on whether they know where to search in a huge cave.

🍞 Hook: Imagine two kinds of smart. One kind follows instructions perfectly; the other decides what to do when no one’s around to tell them. 🥬 The Concept: Executional intelligence is doing a given task well; investigatory intelligence is deciding what to explore and when to stop. How it works: 1) Executional: read the goal, perform steps, finish. 2) Investigatory: look around, form hunches, test them, change course, and declare, “I’ve got enough.” Why it matters: If we only test execution, we can’t tell which AIs can be independent researchers rather than polite assistants. 🍞 Anchor: Cooking from a recipe (executional) vs. inventing a new dish from a mystery box of ingredients (investigatory).

🍞 Hook: Picture a data scientist opening a giant hospital database with no questions yet—just curiosity. 🥬 The Concept: Real-world data science often starts from raw, structured data without prewritten questions. How it works: 1) Scan what data exists. 2) Spot weird patterns. 3) Form a hypothesis. 4) Query and compute. 5) Keep or toss the hunch. Why it matters: Current AI tests rarely score this open-ended behavior, so we don’t know how good AIs are at the real job start line. 🍞 Anchor: Like a bird-watcher going to a new forest: first you look, then you decide where to sit and which calls to follow.

🍞 Hook: People tried to test AI ‘research’ before—but it was like grading essays with vibes. 🥬 The Concept: Earlier benchmarks often gave hidden goals in the prompt, used short interactions, or used subjective LLM-as-a-judge scoring. How it works: 1) Provide hints that steer the model. 2) Limit steps. 3) Let another LLM rate the report’s quality. Why it matters: Hidden hints reduce autonomy, short runs miss long-term planning, and subjective judging can be unfair or inconsistent. 🍞 Anchor: It’s like telling a kid where to look for Easter eggs and then praising them for finding eggs quickly while a cousin had to search a whole park.

🍞 Hook: So what was missing? A fair way to score open-ended discovery. 🥬 The Concept: This paper fills the gap with DDR—Deep Data Research—and DDR-Bench, which let AIs freely explore big, real databases and then check each claim with a fact checklist grounded in the data. How it works: 1) Give only a tiny start prompt (like a patient ID). 2) Let the AI explore with tools. 3) The AI writes insights. 4) A checklist verifies each insight is truly supported by the data. Why it matters: Now we can test the “hunt” side of AI, not just the “wait for a question” side. 🍞 Anchor: It’s like letting treasure hunters roam the island and then checking each claimed find against a GPS log and a map legend.

🍞 Hook: Why should regular people care? Because we all live in a data forest—health records, school dashboards, company reports. 🥬 The Concept: Measuring investigatory intelligence helps AIs become better partners in health, finance, sports science, and more. How it works: 1) Start from raw data. 2) Discover what truly matters. 3) Explain it clearly and verifiably. Why it matters: Better discovery can save time, reduce mistakes, and uncover hidden problems and opportunities. 🍞 Anchor: Like having a smart assistant who doesn’t just answer when asked but also says, “Hey, you might want to look at this trend—it could matter.”

02Core Idea

🍞 Hook: Imagine sending a curious robot into a library with no questions—just a library card and a notebook. What would it bring back? 🥬 The Concept: The paper’s key insight is that to test AI-as-researchers, you must let them freely explore structured data and then verify their discovered claims with objective, data-grounded checklists. How it works: 1) Minimal start prompt (e.g., “Analyze patient X”). 2) The AI explores with tools (SQL/Python) across unlimited steps. 3) It writes per-step and final insights. 4) Each claim is checked against a curated fact checklist extracted from the same data. Why it matters: This cleanly tests investigatory intelligence—goal-setting, hypothesis testing, and choosing when to stop—without hiding answers in the prompt or using fuzzy grading. 🍞 Anchor: Like a science fair where judges don’t grade style—they check if every claim is backed by lab notes and measurements.

Multiple analogies for the same idea:

Map analogy: Before, AIs followed a highlighted route (questions). Now, they choose their own path across the map and must mark proofs on each landmark they claim.
Kitchen analogy: Before, the AI cooked from a recipe. Now, it has a pantry and must design a meal, and every flavor note must tie back to actual ingredients used.
Sports analogy: Before, we timed a single sprint. Now, we run an obstacle course where athletes decide the order of obstacles and must tag proof at each station.

🍞 Hook: You know how a good explorer doesn’t just wander forever—they plan quietly, test, and decide when to stop. 🥬 The Concept: DDR captures long-horizon exploration and self-termination—models must plan implicitly, explore broadly, then zoom in, and finally decide, “I’ve learned enough.” How it works: 1) Broad early scans. 2) Hypothesis formation. 3) Targeted deep queries. 4) Confident stop and final synthesis. Why it matters: Without this, models either stop too early (miss insights) or never stop (waste time and money). 🍞 Anchor: Like a treasure hunter who first surveys the island shoreline and only then digs at one promising X—and knows when to put the shovel down.

🍞 Hook: Grading fair isn’t easy when answers can be written in many ways. 🥬 The Concept: Checklist-based verification checks if each insight is truly supported by the database, replacing subjective rubrics. How it works: 1) Extract factual items from text linked to the data. 2) Have humans screen them. 3) Use an LLM-as-a-Checker only to judge “supported/contradicted/insufficient” with the ground-truth answer visible. Why it matters: It avoids opinion-based grading and focuses on evidence. 🍞 Anchor: Like grading a science report by asking, “Does the data table show what you claimed?” rather than “Do I like your writing?”

🍞 Hook: A single sentence per step can be helpful, but a grand finale matters too. 🥬 The Concept: Message-wise insights capture what each step learned; trajectory-wise insights are the final, all-in-one report after the full journey. How it works: 1) After each turn, write a short insight (or say “NO INSIGHT” if nothing useful). 2) At the end, write a comprehensive summary. Why it matters: This shows both process skill (per step) and synthesis skill (at the end). 🍞 Anchor: Notes you jot during a hike (per-step) versus your big trail journal entry at the end (final synthesis).

🍞 Hook: Bigger is not always better—think of tall shoes versus good balance. 🥬 The Concept: Training for agency beats simply adding parameters or longer context windows. How it works: 1) Compare model sizes. 2) Compare long-context versions. 3) Compare newer generations trained with reasoning/agent skills. Why it matters: Agency needs practice and strategy, not just size. 🍞 Anchor: A taller ladder doesn’t help if you still don’t know which wall to lean it against.

03Methodology

High-level pipeline: Input (Database + minimal start prompt) → ReAct exploration loop (Reason → Act with SQL/Python → Observe results) repeated many times → Per-turn “message-wise” insights → Decide to stop → Final “trajectory-wise” report → Checklist evaluation (LLM-as-a-Checker) → Scores

🍞 Hook: Imagine you get a locker number and two tools—a key and a flashlight—and a huge building to search. 🥬 The Concept: Minimal start prompt. What it is: A tiny instruction like “Start analyzing patient 12345,” without any detailed questions. How it works: 1) Provide entity ID and basic table descriptions. 2) No goals or hints. 3) The model must choose where to begin. Why it matters: It preserves true autonomy—no hidden objectives. 🍞 Anchor: Being told only “Row 7, seat 12” in a theater and figuring out what matters from there.

🍞 Hook: You know how people think, do, then look at what happened? 🥬 The Concept: ReAct exploration loop. What it is: A cycle of Reason → Act (tool call) → Observe (returned data) repeated across turns. How it works: 1) Reason: write thoughts and next plan. 2) Act: call SQL or Python once per turn. 3) Observe: see rows, stats, or errors. 4) Repeat, building on history. Why it matters: It mimics how analysts work and logs the trace for verification. 🍞 Anchor: A scientist writes a plan, runs an experiment, checks the results, then revises the plan.

🍞 Hook: Tools matter—like using a magnifying glass or a calculator at the right time. 🥬 The Concept: Tool-use with SQL and Python. What it is: The AI can query tables (SQL) and analyze signals or compute (Python). How it works: 1) SQL to select, join, aggregate. 2) Python to process time series, compute trends, or visualize data. 3) Strictly keep reasoning outside the code; code is for pulling/analyzing data. Why it matters: Rich, multi-step analysis needs the right tools. 🍞 Anchor: Checking a patient’s medications (SQL) and then plotting weekly sleep patterns (Python).

🍞 Hook: Sometimes a step gives gold; sometimes, nothing. 🥬 The Concept: Message-wise insights (per turn). What it is: A short sentence or two explaining what the latest action discovered; or “NO INSIGHT” if it didn’t help. How it works: 1) After each observe, write a concise insight tied to that step’s reason. 2) Skip if only listing metadata or if the call failed. Why it matters: Separates useful steps from filler and shows progress. 🍞 Anchor: “This query confirms the patient took warfarin during dialysis weeks.”

🍞 Hook: Endings matter—you have to know when the story is complete. 🥬 The Concept: Self-termination + trajectory-wise report. What it is: The model decides when to stop and writes a comprehensive final report. How it works: 1) Monitor coverage and confidence. 2) If more digging won’t add much, emit FINISH:. 3) Summarize key findings across all steps. Why it matters: Avoids endless wandering and ensures a coherent conclusion. 🍞 Anchor: “FINISH: Based on 30 turns, here is the patient’s full care timeline, procedures, and medications.”

🍞 Hook: Fair grading for free-form answers is tricky—so use a checklist tied to the same data. 🥬 The Concept: Checklist evaluation with LLM-as-a-Checker. What it is: A list of verifiable facts derived from the database’s text parts or surveys to test against the model’s insights. How it works: 1) Build fact checklists per entity (human-screened). 2) Provide insights, the question, and the ground-truth answer to a checker LLM. 3) The checker judges if the insight supports, contradicts, or is insufficient. Why it matters: It’s objective, repeatable, and per-claim rather than vibe-based. 🍞 Anchor: “Does the insight support: ‘Was a right-sided facial droop noted?’ → Checker: Supported by [Message 7].”

Concrete data examples used:

MIMIC-IV (healthcare): Structured hospital tables + clinical notes → Checklists about demographics, diagnoses, surgeries, meds.
GLOBEM (behavior + wellbeing): Wearable daily signals + surveys → Checklists ask if wellbeing improved, worsened, or stayed same.
10-K (finance): XBRL financials + text sections → Checklists about risks, margins, and drivers of profitability.

The secret sauce: 🍞 Hook: You can’t call something “discovery” if answers were smuggled in. 🥬 The Concept: Radical openness with verifiable grounding. What it is: Unlimited turns, minimal prompting, tool-based analysis, and claim-by-claim checks. How it works: 1) No predefined questions. 2) Long horizons allowed. 3) Every claim must trace to data via tool logs. Why it matters: It isolates real investigatory intelligence. 🍞 Anchor: Like a lab where notebooks, experiments, and results are fully auditable for each conclusion.

04Experiments & Results

🍞 Hook: If we send different explorers into the same cave with just a lantern and a rope, who brings back the most real treasure? 🥬 The Concept: DDR-Bench compares many LLMs on three big, real databases using the same open-ended rules. What it is: A large, checklist-graded test of deep data research. How it works: 1) Let each model freely explore. 2) Collect per-turn and final insights. 3) Check each claim with the checklist. 4) Score accuracy as the share of supported items. Why it matters: It reveals who can truly hunt, not just follow directions. 🍞 Anchor: Like timing and scoring climbers on both pathfinding and proof they reached marked ledges.

The competition: Proprietary (Claude, GPT, Gemini) and open-source (DeepSeek, GLM, Kimi, Qwen, MiniMax, Llama). Tasks: MIMIC-IV (100 patients), GLOBEM (91 users), 10-K (100 companies). Metrics: checklist accuracy, novelty of extra insights, number of turns, efficiency.

Scoreboard (with context):

Overall, Claude 4.5 Sonnet leads with about high-40% average accuracy, notably ahead in several domains. Think of this as scoring an A- while many others hover around C+ to B-.
GPT-5 variants and Gemini perform strongly but generally lower than Claude on this particular benchmark.
Top open-source models (DeepSeek V3.2, GLM-4.6, Kimi K2) sometimes approach proprietary performance, especially on certain datasets like 10-K and GLOBEM.
Message-wise vs. trajectory-wise: Some models are better at step-by-step insights (process clarity), others at final synthesis (big-picture reasoning). This gap shows different strengths.

🍞 Hook: Do extra discoveries that aren’t on the checklist matter? 🥬 The Concept: Novelty (pairwise) analysis. What it is: Compare which model’s off-checklist insights are more useful, head-to-head. How it works: 1) Extract insights not used by the checklist. 2) Blindly compare model A vs. model B per entity. 3) Aggregate with Bradley–Terry ranking. Why it matters: Ensures we don’t punish creative, valid finds that the checklist didn’t list. 🍞 Anchor: Two explorers bring bonus artifacts; judges pick which bonus pile is more useful without knowing who found them. Finding: Novelty rankings correlate strongly with checklist rankings across domains, so the checklist isn’t missing the main story.

Scaling and dynamics (what we learned about behavior):

Interaction scaling: Many models follow a slow-then-sharp improvement curve. Better models often delay committing and then improve quickly—like quiet planning before a precise strike.
Token scaling: The most valuable tokens often appear late; a few deep, targeted queries deliver big gains.
Cost scaling: Performance rises with spending, but cost-effective standouts (e.g., DeepSeek) achieve strong results at lower cost.

🍞 Hook: How do explorers spread their attention—everywhere a little, or deeply on a few spots? 🥬 The Concept: Exploration patterns via coverage and entropy. What it is: Coverage = breadth of fields touched; entropy = how evenly attention is spread. How it works: 1) Track distinct fields touched. 2) Measure distribution uniformity. 3) Find sweet spot: balanced breadth with focused depth. Why it matters: Too narrow misses facts; too diffuse misses depth. 🍞 Anchor: Good bird-watchers scan many trees but settle in the most promising canopy when they hear the right call. Observation: Stronger models cluster in the balanced region and show stable strategies.

Surprising findings:

Bigger or longer-context models don’t automatically do better at deep research. Training that targets agent skills matters more.
Memory modules can change interaction style (more aggressive reads, earlier stops) but don’t guarantee higher accuracy.
Reactive mode (turning checklists into explicit questions) boosts accuracy a lot—showing tasks are solvable—but it weakens the test of autonomy.

Reliability and safety:

LLM-as-a-Checker shows low variation across repeats and ~90% macro-F1 vs. human checks in sampled cases.
Hallucinations (facts not grounded in observed data) are rare (<~5% in most settings) and weakly related to accuracy. Full trace links each claim to exact tool outputs.

05Discussion & Limitations

🍞 Hook: No tool is perfect, even a good compass can wobble near magnets. 🥬 The Concept: Limitations. What it is: Places where DDR/DDR-Bench can’t tell the whole story. How it works: 1) Checklists can’t list every valid insight, so some good discoveries won’t be scored. 2) Strong models still struggle with very long, uncertain hunts. 3) In specialized areas (e.g., healthcare), small ungrounded guesses can be risky even if rare. Why it matters: Users should see DDR as a rigorous but not all-seeing lens. 🍞 Anchor: A spelling test can’t measure poetry.

Resources needed:

Access to large structured databases (MIMIC-IV, GLOBEM, SEC 10-K).
Tool environment for SQL/Python and logging.
Enough inference budget (tokens/time) to allow long, multi-turn exploration.

When not to use:

If you need quick, exact answers to known questions (traditional QA may be cheaper and simpler).
If the domain is safety-critical and you can’t human-review the trace (e.g., unsupervised clinical decisions).
If subjective creativity or style is the main goal (DDR prizes data-grounded facts over prose flair).

Open questions:

How to expand checklists so they cover more creative, multi-hop insights without becoming subjective.
How to train for stronger investigatory behaviors (e.g., uncertainty-handling, hypothesis revision, stop decisions) beyond size/context.
How to design agent scaffolds that help consistently, not just reshape behavior unpredictably.
How to blend structured and unstructured environments (e.g., web + database) while keeping objective scoring.
How to measure strategic qualities like implicit planning and balanced exploration directly.

06Conclusion & Future Work

Three-sentence summary: The paper introduces Deep Data Research (DDR) and DDR-Bench to test whether AI models can explore large, structured databases on their own, uncovering insights without being handed questions. It verifies each claim using objective, per-entity checklists tied to the same data and analyzes long-horizon behavior like exploration balance, implicit planning, and stopping decisions. Results show emerging agency but also clear room to grow—training for agent skills matters more than just making models bigger.

Main achievement: Cleanly isolating investigatory intelligence with a minimal, open-ended setup and a rigorous, verifiable checklist evaluation that scales across healthcare, behavior science, and finance.

Future directions: Develop training pipelines that reward curiosity, uncertainty resolution, and stable exploration policies; design scaffolds that truly aid discovery; expand checklists to capture richer multi-hop insights; and connect structured and web-scale environments under the same objective scoring.

Why remember this: DDR-Bench shifts the question from “Can AI answer?” to “Can AI discover?”—and gives the field a fair, scalable way to measure that leap.

Practical Applications

•Clinical chart reviewers that reconstruct patient timelines and highlight key risks with SQL-backed evidence links.
•Financial analysts that scan 10-K filings and XBRL data to surface drivers of margins and regulatory risks with proof.
•Behavior and wellbeing monitors that connect daily activity patterns to survey changes with reproducible Python analysis.
•Internal data auditors that explore warehouses to detect anomalies, missing joins, or policy violations and show queries used.
•Operations dashboards that proactively discover bottlenecks and verify claims against logs and KPI tables.
•Compliance assistants that comb structured records and flag evidence-supported concerns (e.g., cost recovery risks).
•Data onboarding guides that map new databases, summarize useful fields, and suggest next investigative steps.
•Hypothesis generators for scientists that propose and test simple analyses before humans invest in deeper studies.
•Education tools that teach students to form data-backed insights and compare against checklists for feedback.
•Product analytics bots that uncover user behavior shifts and retention drivers with traceable SQL/Python notebooks.

Version: 1