FIRE-Bench: Evaluating Agents on the Rediscovery of Scientific Insights

Zhen Wang; Fan Bai; Zhongyan Luo; Jinyan Su; Kaiser Sun; Xinle Yu; Jieyuan Liu; Kun Zhou; Claire Cardie; Mark Dredze; Eric P. Xing; Zhiting Hu

FIRE-Bench: Evaluating Agents on the Rediscovery of Scientific Insights

Intermediate

Zhen Wang, Fan Bai, Zhongyan Luo et al.2/2/2026

arXiv PDF

Key Summary

•FIRE-Bench is a new test that checks whether AI agents can fully redo real scientific discoveries, step by step, not just guess answers.
•Instead of asking AIs to invent brand-new ideas (hard to verify), FIRE-Bench asks them to rediscover proven findings from recent ML papers using only a high-level question.
•Agents must plan studies, write code, run experiments, and draw conclusions backed by data; their final claims are scored against the paper’s true findings.
•Across 30 tasks, even the strongest agents scored under 50 F1 on average and showed big ups and downs between runs, meaning results are fragile.
•Most failures came from weak research planning and shaky evidence-to-conclusion reasoning, not just coding mistakes.
•Agents did better on straightforward, step-by-step tasks and struggled when careful control design or causal thinking was needed.
•The evaluation uses claim-level precision/recall with an LLM judge plus human spot-checks to keep it fair and consistent.
•No strong signs of data contamination were found after adjusting for task difficulty and model knowledge cutoffs.
•Better models usually cost more to run, but efficient execution can improve the performance–cost balance.
•FIRE-Bench gives a clear, diagnostic way to measure real scientific reasoning so we can build more reliable AI research helpers.

Why This Research Matters

If we want AI to help scientists, we must be sure it can plan solid experiments, run them correctly, and make careful, evidence-backed conclusions. FIRE-Bench checks exactly that by asking agents to rediscover known results, so we can fairly grade their scientific reasoning. This reduces the risk of trusting agents that sound confident but skip critical controls or overstate what data shows. It also gives developers a clear map of where agents fail—planning, execution, or conclusion—so improvements target the right skills. The benchmark’s claim-level scoring makes evaluations objective and scalable. In the long run, this helps build research AIs that are reliable partners in areas like medicine, safety, and policy, where accuracy and honesty matter most.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Top Bread (Hook): You know how in science class you don’t just give an answer—you make a plan, try things out, gather data, and then explain what it all means? Real science is a full journey, not a single step.

🥬 Filling (The Actual Concept)

What it is: This paper asks a big question—can AI agents do that full science journey and end up with true, checkable discoveries?
How it works (story of the world before):
1. Before, AI agents were good at parts of research—like reading papers, writing code, or making plots—but not necessarily the whole process end to end.
2. People tried two main kinds of tests:
  - Full paper generation: Agents wrote entire papers from a broad prompt, but judging their correctness needed other AIs to be the judges, which can be subjective.
  - Single-metric leaderboards: Agents just had to boost one number (like accuracy) on a benchmark. That’s easy to score, but it doesn’t tell us if the agent truly understood the science.
3. The problem: We lacked a reliable way to check if an AI could run complete research—from question to data-backed conclusion—without leaning on opinionated grading.
Why it matters: Without a rigorous, checkable test, we can’t safely trust AI to help in real science where mistakes are costly and confidence must come from evidence.

🍞 Bottom Bread (Anchor): Imagine testing a cooking robot. One test says, “Make a meal and a panel of robots will taste it.” Another says, “Make the soup 2% saltier.” Neither proves the robot can run a full dinner from planning to serving. We need a test where the robot plans, cooks, tastes, adjusts, and serves a dish that matches a known recipe—so we can verify it got the core idea right.

🍞 Top Bread (Hook): You know how a teacher prefers math problems with answers in the back of the book? That’s because the teacher can verify your work without guessing.

🥬 Filling (LLM-as-judge)

What it is: LLM-as-judge means using a language model to grade other models’ work.
How it works:
1. An AI produces an answer.
2. Another AI reads it and says if it’s good.
3. We accept the second AI’s opinion.
Why it matters: If the judge AI is wrong or biased, the grades are unreliable.

🍞 Bottom Bread (Anchor): It’s like asking a robot teacher to grade your essay. If the robot misreads a paragraph, your score might be unfair.

🍞 Top Bread (Hook): Imagine your science fair: a poster without real experiments won’t win, and a perfect graph without an explanation won’t, either.

🥬 Filling (The Gap)

What it is: We needed a benchmark that checks the whole scientific cycle, not just bits and pieces.
How it works:
1. Give agents a solid, high-level research question from a real, trusted paper.
2. Hide the original methods and results.
3. See if agents can independently design tests, run them, and reach the true insight.
Why it matters: This tests real understanding, not memorization or guessing.

🍞 Bottom Bread (Anchor): It’s like giving a student the big question from a chapter test but not the worked examples, then checking if they can figure out the same key lesson the book teaches by doing their own work.

🍞 Top Bread (Hook): Picture two roads: one is a straight sidewalk; the other is a winding hiking trail with rivers to cross. Which is harder?

🥬 Filling (What failed before and why)

What it is: Past benchmarks either felt too subjective or too narrow.
How it works:
1. Full-paper tasks: expressive but hard to verify at scale.
2. Single-metric tasks: easy to verify but don’t test scientific reasoning.
Why it matters: Without a middle path, we couldn’t both allow exploration and keep evaluation objective.

🍞 Bottom Bread (Anchor): We needed a game where players are free to try strategies but score points only when they rediscover the true trick that wins, just like the original champions did.

🍞 Top Bread (Hook): Think of everyday stakes—like medicine or climate—where wrong conclusions can hurt people.

🥬 Filling (Real Stakes)

What it is: Reliable evaluation prevents overconfident deployment of shaky AI scientists.
How it works:
1. Use only verifiable questions with known right answers.
2. Check claims precisely, not just overall vibes.
3. Diagnose where the process breaks—planning, running, or concluding.
Why it matters: This keeps AI research helpers honest, safer, and genuinely useful.

🍞 Bottom Bread (Anchor): If an AI suggests a medical policy, we must know it can plan fair tests, run them correctly, and conclude with evidence—FIRE-Bench moves us closer to that standard.

02Core Idea

🍞 Top Bread (Hook): Imagine a treasure hunt where you only get the map’s title, not the route. Can you still find the treasure by making your own plan and checking clues carefully?

🥬 Filling (The “Aha!” Moment)

What it is: FIRE-Bench tests if AI agents can rediscover known scientific insights from real papers—starting with just a high-level question—by planning, experimenting, and concluding with evidence.
How it works:
1. Pick a recent, peer-reviewed ML paper with a clear, verifiable finding.
2. Give the agent only the high-level research question and allowed tools/datasets.
3. The agent must plan experiments, implement code, run tests, and write conclusions.
4. We split both agent and paper conclusions into atomic claims and match them.
Why it matters: This reveals whether agents truly understand and can reproduce science-like reasoning, not just output nice-looking text.

🍞 Bottom Bread (Anchor): It’s like handing a student the question, “Do plants grow better with more sunlight?” and seeing if they design the right tests and reach the classic correct pattern—without peeking.

🍞 Top Bread (Hook): You know how teachers explain the same math idea three ways so everyone gets it?

🥬 Filling (Multiple Analogies)

What it is: Three views of the same idea.
How it works:
1. Detective analogy: The agent is a detective; the paper’s insight is the solved case; with only the case question, can the agent trace clues and solve it again?
2. Cooking analogy: The agent gets the dish name (not the recipe). Can it invent a reasonable recipe, cook it, and serve a dish that matches the original’s core taste?
3. Science fair analogy: The agent gets the big question on the poster header; can it run the right experiments and end with the same takeaways on the poster’s bottom line?
Why it matters: If the agent can pass all three stories, it’s doing real reasoning.

🍞 Bottom Bread (Anchor): For “Lost in the Middle,” agents test accuracy when key info is early, middle, or late in context, and the correct rediscovery is: middle is worst.

🍞 Top Bread (Hook): Before and after pictures help you see change at a glance.

🥬 Filling (Before vs After)

What it is: What changes with FIRE-Bench.
How it works:
1. Before: Evaluations leaned on opinionated judging or narrow metrics.
2. After: Evaluations check end-to-end scientific behavior, claim by claim.
3. Before: Limited insight into why agents fail.
4. After: Fine-grained error taxonomy shows whether planning, execution, or conclusion is the weak link.
Why it matters: We can now measure real scientific competence and debug it.

🍞 Bottom Bread (Anchor): Instead of just saying “Agent got 70% accuracy,” we can say “Agent missed the control group, so its conclusion about bias isn’t supported.”

🍞 Top Bread (Hook): Picture the gears inside a clock; seeing how they mesh explains why timekeeping works.

🥬 Filling (Why It Works—no equations)

What it is: FIRE-Bench anchors freedom to explore with a solid target.
How it works:
1. Agents are free to choose methods, but targets are fixed, verifiable insights.
2. Claim-level scoring turns fuzzy essays into checkable bites.
3. Tasks are compute-light and public, so results are reproducible.
Why it matters: This balance encourages genuine reasoning, not guesswork.

🍞 Bottom Bread (Anchor): On “LLMs Lack Self-Correction,” an agent can try different reflection loops, but it must conclude the true pattern: self-correction doesn’t reliably help most tasks.

🍞 Top Bread (Hook): Building a LEGO set is easier when you know the pieces.

🥬 Filling (Building Blocks)

🍞/🥬 FIRE-Bench
- What: The full-cycle rediscovery benchmark.
- How: Tasks from 30 recent ML analyses; agents get questions, not methods.
- Why: To test research planning, execution, and conclusion together.
- Example: “Does info position in long context change accuracy?”
🍞/🥬 Research-Problem Tree
- What: A map of a paper’s big question → sub-questions → concrete experiments.
- How: An LLM parser extracts a hierarchical tree aligned to figures/tables.
- Why: Ensures tasks are grounded in verifiable results.
- Example: Root: “Do LLMs show bias?” Leaf: “Cost predictions by race on dataset X.”
🍞/🥬 Constrained Rediscovery
- What: Agent gets the mid-level question and scope, but not the exact setup.
- How: The leaf’s core result is ground truth; the parent node becomes the prompt.
- Why: Keeps freedom to explore while anchoring evaluation.
- Example: “Design a fair test for racial bias in cost prediction with given data.”
🍞/🥬 Claim-Level Evaluation
- What: Break conclusions into atomic claims; match against ground truth.
- How: LLM-based extractor and entailment judge with human spot checks.
- Why: Objective, scalable, fine-grained scoring (precision/recall/F1).
- Example: “Middle is worst” matches; “Late is best” contradicts.
🍞/🥬 Error Taxonomy
- What: Labels where and how agents fail (planning, implementation, execution, conclusion).
- How: LLM-assisted, human-verified trace analysis.
- Why: Turns “it failed” into “it missed the control design.”
- Example: Method Deviation vs Analysis Failure.

03Methodology

🍞 Top Bread (Hook): Imagine a science cooking show: contestants get a dish name, a pantry, and a time limit. They must plan, cook, and present a dish that matches a classic version. Judges compare bite by bite.

🥬 Filling (High-Level Overview)

What it is: FIRE-Bench turns real ML papers into rediscovery tasks and scores agents on claim-level correctness.
How it works (pipeline): Input (paper) → [Step A: Select verifiable insights] → [Step B: Extract research-problem tree] → [Step C: Create constrained rediscovery task] → [Step D: Run agents end-to-end] → [Step E: Claim-level evaluation] → Output (precision/recall/F1 + diagnostics).
Why it matters: This recipe checks complete scientific behavior, not just one number.

🍞 Bottom Bread (Anchor): For “Lost in the Middle,” the input paper’s figure shows accuracy dips in the middle. The task gives the question about position effects; the agent must design the test and rediscover the dip.

Step A: Source Paper Selection 🍞 Hook: You know how a fair game needs clear rules and a scoreboard? 🥬 The Concept

What it is: Choose 30 recent, peer-reviewed ML analysis papers with public data/tools, compute-light experiments, and non-trivial, verifiable insights.
How it works:
1. Search top venues with LLM-behavior keywords.
2. Filter for empirical analysis (not new models or theory).
3. Ensure open inputs, 24-hour compute, and figure/table-grounded claims.
Why it matters: Keeps tasks reproducible and conclusions checkable. 🍞 Anchor: Picking “LLMs Lack Self-Correction” works because it reports clear patterns across datasets that can be re-run quickly.

Step B: Research-Problem Tree Extraction 🍞 Hook: Imagine turning a chapter into a mind map: big idea at the top, details as branches. 🥬 The Concept

What it is: An LLM parser builds a tree: root question → sub-questions → leaves (specific experiments tied to figures/tables).
How it works:
1. Use a fixed prompt and greedy decoding to avoid drift.
2. Output a JSON tree with node types, links, and evidence.
3. Humans spot-check for groundedness and coherence.
Why it matters: This maps each rediscovery to concrete, verifiable results. 🍞 Anchor: For “Premise Order Effects,” a leaf might tie to “Figure 2: Accuracy by premise permutation.”

Step C: Constrained Rediscovery Task Instantiation 🍞 Hook: Getting the riddle’s question but not the solution path is both scary and exciting. 🥬 The Concept

What it is: Choose a central leaf (ground-truth finding) and prompt the agent with its parent node’s higher-level question plus allowed scope (datasets/metrics), but hide methods and conclusions.
How it works:
1. Identify main figure/table (leaf l*).
2. Use its parent (v*) as the agent’s research question.
3. Provide datasets, models, and evaluation criteria only.
Why it matters: Encourages genuine planning and exploration while keeping evaluation objective. 🍞 Anchor: “Study racial bias in medical cost prediction using dataset B; evaluate cost and length-of-stay. Design the experiment.”

Step D: Agent End-to-End Research 🍞 Hook: Like planning a treasure hunt, then actually digging, then reporting the loot. 🥬 The Concepts (three mini-steps)

Research Planning
- What it is: Decide hypotheses, controls, and procedures.
- How it works: List variables, define baselines, design controlled comparisons.
- Why it matters: Bad plans create misleading results.
- Anchor: In bias detection, first remove race cues, then selectively add labels as a control.
Experimental Execution
- What it is: Implement code, run models, collect data.
- How it works: Set up environments, write scripts, run batches, log outputs.
- Why it matters: Flaky code or tiny samples can break conclusions.
- Anchor: For “Lost in the Middle,” generate contexts with the answer early/middle/late; run n samples per position.
Conclusion Formation
- What it is: Turn numbers into careful claims.
- How it works: Aggregate metrics, compare conditions, check significance, avoid overgeneralization.
- Why it matters: Great data can be ruined by careless summaries.
- Anchor: “Middle positions perform worst; early best; late slightly recovers.”

Step E: Claim-Level Evaluation 🍞 Hook: Grading sentence by sentence keeps things fair. 🥬 The Concept

What it is: Split both the agent’s write-up and the paper’s text into atomic, checkable claims; match them with an LLM judge.
How it works:
1. Extract claims with a fixed-prompt LLM for both sides.
2. Use entailment checking to mark true positives, false positives, false negatives.
3. Compute precision, recall, F1; validate a subset with humans (≈0.89 F1).
Why it matters: Makes evaluation scalable and fine-grained, not just opinion-based. 🍞 Anchor: “Middle is worst” → true positive; “Late is best” → contradictory false positive.

Secret Sauce 🍞 Hook: Sometimes a tiny trick makes the whole recipe work. 🥬 The Concept

What it is: Constrained rediscovery paired with claim-level scoring.
How it works:
1. Give freedom at the method level but fix the target insight.
2. Score conclusions at the claim level, not the whole essay.
3. Diagnose process failures with a structured taxonomy.
Why it matters: Balances exploration with verifiability and reveals where agents stumble. 🍞 Anchor: Two agents may use different scripts, but only the one that rediscovers “middle is worst” earns high recall.

Concrete Data Examples

Example 1 (Performance): Claude Code averaged ~46.7 F1; Codex ~41.9; OpenHands (gpt-5) ~37.9; OpenHands (o4-mini) ~31.9.
Example 2 (Task): “Persona with Catch” saw top F1 ≈ 88.6 (procedurally direct). “LLM Racial Bias in Medicine” saw many failures due to missing controls (planning errors).
Example 3 (Cost): Higher-capability backbones tended to cost more; Codex showed a good cost–performance balance via shorter action sequences.

Why Each Step Exists (What breaks without it)

No careful paper selection → non-reproducible or non-verifiable tasks.
No problem tree → tasks drift away from concrete, scorable findings.
No constrained prompt → agents copy methods or chase trivialities.
No end-to-end run → we miss real research weaknesses.
No claim-level scoring → vague grading, hard to trust or compare.

04Experiments & Results

🍞 Top Bread (Hook): If you race four teams on the same obstacle course three times each, you learn who’s fast, who’s consistent, and where they trip.

🥬 Filling (The Test)

What it is: FIRE-Bench measures whether agents can rediscover the true findings across 30 ML analysis tasks.
How it works:
1. Each agent runs each task three times to check consistency.
2. Final write-ups are split into atomic claims and matched to the paper’s claims.
3. We report precision, recall, and F1.
Why it matters: This directly tests end-to-end scientific reasoning, not just code running or answer guessing.

🍞 Bottom Bread (Anchor): On “Lost in the Middle,” the correct rediscovery is that the middle position is worst; agents that miss the slight late-position recovery lose some recall.

🍞 Top Bread (Hook): Scoreboards make numbers meaningful when you add context.

🥬 Filling (The Competition & Scoreboard)

What it is: Four agent setups—OpenHands (o4-mini), OpenHands (gpt-5), Codex (gpt-5-medium), Claude Code (Claude-4-Sonnet).
How it works:
1. Average F1 (higher is better): Claude Code ≈ 46.7; Codex ≈ 41.9; OpenHands (gpt-5) ≈ 37.9; OpenHands (o4-mini) ≈ 31.9.
2. Variance is high across many tasks and runs.
3. Frontier backbones help but do not solve core weaknesses.
Why it matters: Even top agents are below 50 F1 on average—like getting a high C when an A is needed for reliable science.

🍞 Bottom Bread (Anchor): Codex sometimes achieves strong task scores with lower cost due to efficient trajectories, while Claude Code leads in average F1 but at higher expense.

🍞 Top Bread (Hook): Some puzzles are one-straight-line; others require careful control experiments.

🥬 Filling (Task Structure Matters)

What it is: Agents excel at procedurally direct tasks but struggle on control-based or causal tasks.
How it works:
1. High scores on direct pipelines: “Lost in the Middle” (best ≈ 91.7), “Persona with Catch” (≈ 88.6), “CoT Without Prompting” (≈ 82.6), “Hallucination Snowballing” (≈ 80.9).
2. Low scores on control-heavy design: e.g., “LLM Racial Bias in Medicine” requires building proper counterfactuals; agents often skipped key controls.
Why it matters: Planning and causal thinking are the main bottlenecks.

🍞 Bottom Bread (Anchor): Many bias tests failed because agents injected race labels without first establishing a race-neutral baseline.

🍞 Top Bread (Hook): False alarms vs misses tells you where the detector goes wrong.

🥬 Filling (Error Patterns)

What it is: Failures concentrate in Research Planning and Conclusion Formation.
How it works:
1. False positives are mostly Contradictory or Unrelated claims; few are valid “Alternative” insights.
2. Frequent errors: Method Deviation (wrong design), Overgeneralization, and Analysis Failures (missing trends).
Why it matters: Agents need better study design and careful, evidence-grounded summaries.

🍞 Bottom Bread (Anchor): Saying “late is best” when the paper shows only a slight late recovery is a Contradictory false positive.

🍞 Top Bread (Hook): Wallet check: performance often costs tokens.

🥬 Filling (Cost–Performance)

What it is: Stronger models tend to cost more but can score higher; efficient agents can do more with less.
How it works:
1. Claude Code had the highest average F1 and highest total estimated cost.
2. Codex reached a solid F1 with notably lower estimated spend via shorter action sequences.
Why it matters: Practical deployments must balance accuracy with budget.

🍞 Bottom Bread (Anchor): Tasks with long reasoning chains (e.g., self-correction studies) consumed more tokens across agents.

🍞 Top Bread (Hook): Did they just memorize the answers?

🥬 Filling (Data Contamination Check)

What it is: Compare tasks before vs after model knowledge cutoffs, controlled by difficulty.
How it works:
1. Stratify tasks into Easy/Medium/Hard with a rubric.
2. Check F1 for pre- vs post-cutoff tasks within each band.
3. Observe no consistent pre-cutoff advantage.
Why it matters: Reduces (but doesn’t eliminate) the worry that agents just memorized papers.

🍞 Bottom Bread (Anchor): For Hard tasks, some agents even did better post-cutoff, suggesting difficulty—not memorization—drives performance gaps.

05Discussion & Limitations

🍞 Top Bread (Hook): Even great athletes have weak spots; naming them helps training.

🥬 Filling (Honest Assessment)

Limitations (what this can’t do):
1. Rediscovery may punish genuinely new but valid findings that differ from the source paper.
2. Claim extraction and matching use an LLM judge; though human checks show high agreement, it’s not perfect.
3. The 30-task set is ML-focused; results may not generalize to biology or physics without adaptation.
4. Proprietary agent details are opaque; differences might stem from hidden tools or settings.
5. Some tasks still carry stochastic variance; three runs help but don’t fully stabilize outcomes.
Required Resources: GPU/CPU sandboxing, API access to LLMs, public datasets, and modest compute (most runs under 24 hours on an 80GB A100).
When NOT to Use:
1. If your goal is to reward novel discoveries over agreement with past papers.
2. If your domain demands wet-lab or long training cycles outside compute-light bounds.
3. If you need human-only judgment of narrative quality over claim-level correctness.
Open Questions:
1. How to reduce run-to-run variance and make planning more reliable?
2. Can we design automatic checks for missing controls and causal pitfalls during planning time?
3. How to blend human and LLM judging for even stronger validity at scale?
4. Can rediscovery tasks extend beyond ML into multi-modal or lab-in-the-loop sciences?
5. How to fairly detect and discount training-data contamination with limited visibility?

🍞 Bottom Bread (Anchor): Think of FIRE-Bench like a coach that not only posts your score but circles, “Fix your warm-up (planning) and your cool-down (conclusion).”

06Conclusion & Future Work

🍞 Top Bread (Hook): Imagine testing a scientist robot by asking it to re-prove a known fact without seeing the original steps—can it plan, test, and conclude correctly?

🥬 Filling (Takeaway)

3-Sentence Summary: FIRE-Bench evaluates AI agents on full-cycle scientific rediscovery: given only a high-level question from recent ML papers, agents must plan, implement, run, and conclude with evidence. Scoring at the claim level against ground-truth findings shows that today’s agents average below 50 F1 with high variance, struggling most with study design and evidence-to-claim reasoning. This benchmark delivers both rigorous measurement and a diagnostic lens to guide future agent improvements.
Main Achievement: Turning complex, end-to-end scientific reasoning into objective, verifiable, claim-level evaluation—without relying solely on subjective paper judging.
Future Directions: Stronger planning with explicit control design, execution-grounded inference checks, process-level audits, broader scientific domains, and better contamination defenses.
Why Remember This: FIRE-Bench marks a shift from “nice-looking papers” or “one-number gains” to “did the agent truly rediscover the science?”—a standard that will shape how we trust AI in real research.

🍞 Bottom Bread (Anchor): It’s the difference between a student who copies steps and one who can re-derive the result on their own—FIRE-Bench measures the latter, claim by claim.

Practical Applications

•Evaluate your in-house research agent on end-to-end tasks before deploying it in critical workflows.
•Use the error taxonomy to debug agent failures in planning vs analysis and prioritize fixes.
•Create training curricula that teach agents to design proper controls and avoid overgeneralized conclusions.
•Benchmark different agent stacks and backbones to choose the best performance–cost mix for your team.
•Adopt claim-level scoring to grade internal research summaries and reports more objectively.
•Run ablation studies on tool use (e.g., retrieval, plotting, statistics) to see what boosts rediscovery success.
•Gate access to high-stakes domains (e.g., healthcare) by requiring minimum F1 and low variance on relevant tasks.
•Monitor for data contamination signals by stratifying tasks by difficulty and knowledge cutoffs.
•Automate regression testing of agent updates to ensure improvements don’t break planning or conclusions.
•Use constrained rediscovery tasks in education to teach students scientific method, controls, and fair comparisons.

Version: 1