FIRE-Bench: Evaluating Agents on the Rediscovery of Scientific Insights
Key Summary
- ā¢FIRE-Bench is a new test that checks whether AI agents can fully redo real scientific discoveries, step by step, not just guess answers.
- ā¢Instead of asking AIs to invent brand-new ideas (hard to verify), FIRE-Bench asks them to rediscover proven findings from recent ML papers using only a high-level question.
- ā¢Agents must plan studies, write code, run experiments, and draw conclusions backed by data; their final claims are scored against the paperās true findings.
- ā¢Across 30 tasks, even the strongest agents scored under 50 F1 on average and showed big ups and downs between runs, meaning results are fragile.
- ā¢Most failures came from weak research planning and shaky evidence-to-conclusion reasoning, not just coding mistakes.
- ā¢Agents did better on straightforward, step-by-step tasks and struggled when careful control design or causal thinking was needed.
- ā¢The evaluation uses claim-level precision/recall with an LLM judge plus human spot-checks to keep it fair and consistent.
- ā¢No strong signs of data contamination were found after adjusting for task difficulty and model knowledge cutoffs.
- ā¢Better models usually cost more to run, but efficient execution can improve the performanceācost balance.
- ā¢FIRE-Bench gives a clear, diagnostic way to measure real scientific reasoning so we can build more reliable AI research helpers.
Why This Research Matters
If we want AI to help scientists, we must be sure it can plan solid experiments, run them correctly, and make careful, evidence-backed conclusions. FIRE-Bench checks exactly that by asking agents to rediscover known results, so we can fairly grade their scientific reasoning. This reduces the risk of trusting agents that sound confident but skip critical controls or overstate what data shows. It also gives developers a clear map of where agents failāplanning, execution, or conclusionāso improvements target the right skills. The benchmarkās claim-level scoring makes evaluations objective and scalable. In the long run, this helps build research AIs that are reliable partners in areas like medicine, safety, and policy, where accuracy and honesty matter most.
Detailed Explanation
Tap terms for definitions01Background & Problem Definition
š Top Bread (Hook): You know how in science class you donāt just give an answerāyou make a plan, try things out, gather data, and then explain what it all means? Real science is a full journey, not a single step.
š„¬ Filling (The Actual Concept)
- What it is: This paper asks a big questionācan AI agents do that full science journey and end up with true, checkable discoveries?
- How it works (story of the world before):
- Before, AI agents were good at parts of researchālike reading papers, writing code, or making plotsābut not necessarily the whole process end to end.
- People tried two main kinds of tests:
- Full paper generation: Agents wrote entire papers from a broad prompt, but judging their correctness needed other AIs to be the judges, which can be subjective.
- Single-metric leaderboards: Agents just had to boost one number (like accuracy) on a benchmark. Thatās easy to score, but it doesnāt tell us if the agent truly understood the science.
- The problem: We lacked a reliable way to check if an AI could run complete researchāfrom question to data-backed conclusionāwithout leaning on opinionated grading.
- Why it matters: Without a rigorous, checkable test, we canāt safely trust AI to help in real science where mistakes are costly and confidence must come from evidence.
š Bottom Bread (Anchor): Imagine testing a cooking robot. One test says, āMake a meal and a panel of robots will taste it.ā Another says, āMake the soup 2% saltier.ā Neither proves the robot can run a full dinner from planning to serving. We need a test where the robot plans, cooks, tastes, adjusts, and serves a dish that matches a known recipeāso we can verify it got the core idea right.
š Top Bread (Hook): You know how a teacher prefers math problems with answers in the back of the book? Thatās because the teacher can verify your work without guessing.
š„¬ Filling (LLM-as-judge)
- What it is: LLM-as-judge means using a language model to grade other modelsā work.
- How it works:
- An AI produces an answer.
- Another AI reads it and says if itās good.
- We accept the second AIās opinion.
- Why it matters: If the judge AI is wrong or biased, the grades are unreliable.
š Bottom Bread (Anchor): Itās like asking a robot teacher to grade your essay. If the robot misreads a paragraph, your score might be unfair.
š Top Bread (Hook): Imagine your science fair: a poster without real experiments wonāt win, and a perfect graph without an explanation wonāt, either.
š„¬ Filling (The Gap)
- What it is: We needed a benchmark that checks the whole scientific cycle, not just bits and pieces.
- How it works:
- Give agents a solid, high-level research question from a real, trusted paper.
- Hide the original methods and results.
- See if agents can independently design tests, run them, and reach the true insight.
- Why it matters: This tests real understanding, not memorization or guessing.
š Bottom Bread (Anchor): Itās like giving a student the big question from a chapter test but not the worked examples, then checking if they can figure out the same key lesson the book teaches by doing their own work.
š Top Bread (Hook): Picture two roads: one is a straight sidewalk; the other is a winding hiking trail with rivers to cross. Which is harder?
š„¬ Filling (What failed before and why)
- What it is: Past benchmarks either felt too subjective or too narrow.
- How it works:
- Full-paper tasks: expressive but hard to verify at scale.
- Single-metric tasks: easy to verify but donāt test scientific reasoning.
- Why it matters: Without a middle path, we couldnāt both allow exploration and keep evaluation objective.
š Bottom Bread (Anchor): We needed a game where players are free to try strategies but score points only when they rediscover the true trick that wins, just like the original champions did.
š Top Bread (Hook): Think of everyday stakesālike medicine or climateāwhere wrong conclusions can hurt people.
š„¬ Filling (Real Stakes)
- What it is: Reliable evaluation prevents overconfident deployment of shaky AI scientists.
- How it works:
- Use only verifiable questions with known right answers.
- Check claims precisely, not just overall vibes.
- Diagnose where the process breaksāplanning, running, or concluding.
- Why it matters: This keeps AI research helpers honest, safer, and genuinely useful.
š Bottom Bread (Anchor): If an AI suggests a medical policy, we must know it can plan fair tests, run them correctly, and conclude with evidenceāFIRE-Bench moves us closer to that standard.
02Core Idea
š Top Bread (Hook): Imagine a treasure hunt where you only get the mapās title, not the route. Can you still find the treasure by making your own plan and checking clues carefully?
š„¬ Filling (The āAha!ā Moment)
- What it is: FIRE-Bench tests if AI agents can rediscover known scientific insights from real papersāstarting with just a high-level questionāby planning, experimenting, and concluding with evidence.
- How it works:
- Pick a recent, peer-reviewed ML paper with a clear, verifiable finding.
- Give the agent only the high-level research question and allowed tools/datasets.
- The agent must plan experiments, implement code, run tests, and write conclusions.
- We split both agent and paper conclusions into atomic claims and match them.
- Why it matters: This reveals whether agents truly understand and can reproduce science-like reasoning, not just output nice-looking text.
š Bottom Bread (Anchor): Itās like handing a student the question, āDo plants grow better with more sunlight?ā and seeing if they design the right tests and reach the classic correct patternāwithout peeking.
š Top Bread (Hook): You know how teachers explain the same math idea three ways so everyone gets it?
š„¬ Filling (Multiple Analogies)
- What it is: Three views of the same idea.
- How it works:
- Detective analogy: The agent is a detective; the paperās insight is the solved case; with only the case question, can the agent trace clues and solve it again?
- Cooking analogy: The agent gets the dish name (not the recipe). Can it invent a reasonable recipe, cook it, and serve a dish that matches the originalās core taste?
- Science fair analogy: The agent gets the big question on the poster header; can it run the right experiments and end with the same takeaways on the posterās bottom line?
- Why it matters: If the agent can pass all three stories, itās doing real reasoning.
š Bottom Bread (Anchor): For āLost in the Middle,ā agents test accuracy when key info is early, middle, or late in context, and the correct rediscovery is: middle is worst.
š Top Bread (Hook): Before and after pictures help you see change at a glance.
š„¬ Filling (Before vs After)
- What it is: What changes with FIRE-Bench.
- How it works:
- Before: Evaluations leaned on opinionated judging or narrow metrics.
- After: Evaluations check end-to-end scientific behavior, claim by claim.
- Before: Limited insight into why agents fail.
- After: Fine-grained error taxonomy shows whether planning, execution, or conclusion is the weak link.
- Why it matters: We can now measure real scientific competence and debug it.
š Bottom Bread (Anchor): Instead of just saying āAgent got 70% accuracy,ā we can say āAgent missed the control group, so its conclusion about bias isnāt supported.ā
š Top Bread (Hook): Picture the gears inside a clock; seeing how they mesh explains why timekeeping works.
š„¬ Filling (Why It Worksāno equations)
- What it is: FIRE-Bench anchors freedom to explore with a solid target.
- How it works:
- Agents are free to choose methods, but targets are fixed, verifiable insights.
- Claim-level scoring turns fuzzy essays into checkable bites.
- Tasks are compute-light and public, so results are reproducible.
- Why it matters: This balance encourages genuine reasoning, not guesswork.
š Bottom Bread (Anchor): On āLLMs Lack Self-Correction,ā an agent can try different reflection loops, but it must conclude the true pattern: self-correction doesnāt reliably help most tasks.
š Top Bread (Hook): Building a LEGO set is easier when you know the pieces.
š„¬ Filling (Building Blocks)
- š/š„¬ FIRE-Bench
- What: The full-cycle rediscovery benchmark.
- How: Tasks from 30 recent ML analyses; agents get questions, not methods.
- Why: To test research planning, execution, and conclusion together.
- Example: āDoes info position in long context change accuracy?ā
- š/š„¬ Research-Problem Tree
- What: A map of a paperās big question ā sub-questions ā concrete experiments.
- How: An LLM parser extracts a hierarchical tree aligned to figures/tables.
- Why: Ensures tasks are grounded in verifiable results.
- Example: Root: āDo LLMs show bias?ā Leaf: āCost predictions by race on dataset X.ā
- š/š„¬ Constrained Rediscovery
- What: Agent gets the mid-level question and scope, but not the exact setup.
- How: The leafās core result is ground truth; the parent node becomes the prompt.
- Why: Keeps freedom to explore while anchoring evaluation.
- Example: āDesign a fair test for racial bias in cost prediction with given data.ā
- š/š„¬ Claim-Level Evaluation
- What: Break conclusions into atomic claims; match against ground truth.
- How: LLM-based extractor and entailment judge with human spot checks.
- Why: Objective, scalable, fine-grained scoring (precision/recall/F1).
- Example: āMiddle is worstā matches; āLate is bestā contradicts.
- š/š„¬ Error Taxonomy
- What: Labels where and how agents fail (planning, implementation, execution, conclusion).
- How: LLM-assisted, human-verified trace analysis.
- Why: Turns āit failedā into āit missed the control design.ā
- Example: Method Deviation vs Analysis Failure.
03Methodology
š Top Bread (Hook): Imagine a science cooking show: contestants get a dish name, a pantry, and a time limit. They must plan, cook, and present a dish that matches a classic version. Judges compare bite by bite.
š„¬ Filling (High-Level Overview)
- What it is: FIRE-Bench turns real ML papers into rediscovery tasks and scores agents on claim-level correctness.
- How it works (pipeline): Input (paper) ā [Step A: Select verifiable insights] ā [Step B: Extract research-problem tree] ā [Step C: Create constrained rediscovery task] ā [Step D: Run agents end-to-end] ā [Step E: Claim-level evaluation] ā Output (precision/recall/F1 + diagnostics).
- Why it matters: This recipe checks complete scientific behavior, not just one number.
š Bottom Bread (Anchor): For āLost in the Middle,ā the input paperās figure shows accuracy dips in the middle. The task gives the question about position effects; the agent must design the test and rediscover the dip.
Step A: Source Paper Selection š Hook: You know how a fair game needs clear rules and a scoreboard? š„¬ The Concept
- What it is: Choose 30 recent, peer-reviewed ML analysis papers with public data/tools, compute-light experiments, and non-trivial, verifiable insights.
- How it works:
- Search top venues with LLM-behavior keywords.
- Filter for empirical analysis (not new models or theory).
- Ensure open inputs, 24-hour compute, and figure/table-grounded claims.
- Why it matters: Keeps tasks reproducible and conclusions checkable. š Anchor: Picking āLLMs Lack Self-Correctionā works because it reports clear patterns across datasets that can be re-run quickly.
Step B: Research-Problem Tree Extraction š Hook: Imagine turning a chapter into a mind map: big idea at the top, details as branches. š„¬ The Concept
- What it is: An LLM parser builds a tree: root question ā sub-questions ā leaves (specific experiments tied to figures/tables).
- How it works:
- Use a fixed prompt and greedy decoding to avoid drift.
- Output a JSON tree with node types, links, and evidence.
- Humans spot-check for groundedness and coherence.
- Why it matters: This maps each rediscovery to concrete, verifiable results. š Anchor: For āPremise Order Effects,ā a leaf might tie to āFigure 2: Accuracy by premise permutation.ā
Step C: Constrained Rediscovery Task Instantiation š Hook: Getting the riddleās question but not the solution path is both scary and exciting. š„¬ The Concept
- What it is: Choose a central leaf (ground-truth finding) and prompt the agent with its parent nodeās higher-level question plus allowed scope (datasets/metrics), but hide methods and conclusions.
- How it works:
- Identify main figure/table (leaf l*).
- Use its parent (v*) as the agentās research question.
- Provide datasets, models, and evaluation criteria only.
- Why it matters: Encourages genuine planning and exploration while keeping evaluation objective. š Anchor: āStudy racial bias in medical cost prediction using dataset B; evaluate cost and length-of-stay. Design the experiment.ā
Step D: Agent End-to-End Research š Hook: Like planning a treasure hunt, then actually digging, then reporting the loot. š„¬ The Concepts (three mini-steps)
- Research Planning
- What it is: Decide hypotheses, controls, and procedures.
- How it works: List variables, define baselines, design controlled comparisons.
- Why it matters: Bad plans create misleading results.
- Anchor: In bias detection, first remove race cues, then selectively add labels as a control.
- Experimental Execution
- What it is: Implement code, run models, collect data.
- How it works: Set up environments, write scripts, run batches, log outputs.
- Why it matters: Flaky code or tiny samples can break conclusions.
- Anchor: For āLost in the Middle,ā generate contexts with the answer early/middle/late; run n samples per position.
- Conclusion Formation
- What it is: Turn numbers into careful claims.
- How it works: Aggregate metrics, compare conditions, check significance, avoid overgeneralization.
- Why it matters: Great data can be ruined by careless summaries.
- Anchor: āMiddle positions perform worst; early best; late slightly recovers.ā
Step E: Claim-Level Evaluation š Hook: Grading sentence by sentence keeps things fair. š„¬ The Concept
- What it is: Split both the agentās write-up and the paperās text into atomic, checkable claims; match them with an LLM judge.
- How it works:
- Extract claims with a fixed-prompt LLM for both sides.
- Use entailment checking to mark true positives, false positives, false negatives.
- Compute precision, recall, F1; validate a subset with humans (ā0.89 F1).
- Why it matters: Makes evaluation scalable and fine-grained, not just opinion-based. š Anchor: āMiddle is worstā ā true positive; āLate is bestā ā contradictory false positive.
Secret Sauce š Hook: Sometimes a tiny trick makes the whole recipe work. š„¬ The Concept
- What it is: Constrained rediscovery paired with claim-level scoring.
- How it works:
- Give freedom at the method level but fix the target insight.
- Score conclusions at the claim level, not the whole essay.
- Diagnose process failures with a structured taxonomy.
- Why it matters: Balances exploration with verifiability and reveals where agents stumble. š Anchor: Two agents may use different scripts, but only the one that rediscovers āmiddle is worstā earns high recall.
Concrete Data Examples
- Example 1 (Performance): Claude Code averaged ~46.7 F1; Codex ~41.9; OpenHands (gpt-5) ~37.9; OpenHands (o4-mini) ~31.9.
- Example 2 (Task): āPersona with Catchā saw top F1 ā 88.6 (procedurally direct). āLLM Racial Bias in Medicineā saw many failures due to missing controls (planning errors).
- Example 3 (Cost): Higher-capability backbones tended to cost more; Codex showed a good costāperformance balance via shorter action sequences.
Why Each Step Exists (What breaks without it)
- No careful paper selection ā non-reproducible or non-verifiable tasks.
- No problem tree ā tasks drift away from concrete, scorable findings.
- No constrained prompt ā agents copy methods or chase trivialities.
- No end-to-end run ā we miss real research weaknesses.
- No claim-level scoring ā vague grading, hard to trust or compare.
04Experiments & Results
š Top Bread (Hook): If you race four teams on the same obstacle course three times each, you learn whoās fast, whoās consistent, and where they trip.
š„¬ Filling (The Test)
- What it is: FIRE-Bench measures whether agents can rediscover the true findings across 30 ML analysis tasks.
- How it works:
- Each agent runs each task three times to check consistency.
- Final write-ups are split into atomic claims and matched to the paperās claims.
- We report precision, recall, and F1.
- Why it matters: This directly tests end-to-end scientific reasoning, not just code running or answer guessing.
š Bottom Bread (Anchor): On āLost in the Middle,ā the correct rediscovery is that the middle position is worst; agents that miss the slight late-position recovery lose some recall.
š Top Bread (Hook): Scoreboards make numbers meaningful when you add context.
š„¬ Filling (The Competition & Scoreboard)
- What it is: Four agent setupsāOpenHands (o4-mini), OpenHands (gpt-5), Codex (gpt-5-medium), Claude Code (Claude-4-Sonnet).
- How it works:
- Average F1 (higher is better): Claude Code ā 46.7; Codex ā 41.9; OpenHands (gpt-5) ā 37.9; OpenHands (o4-mini) ā 31.9.
- Variance is high across many tasks and runs.
- Frontier backbones help but do not solve core weaknesses.
- Why it matters: Even top agents are below 50 F1 on averageālike getting a high C when an A is needed for reliable science.
š Bottom Bread (Anchor): Codex sometimes achieves strong task scores with lower cost due to efficient trajectories, while Claude Code leads in average F1 but at higher expense.
š Top Bread (Hook): Some puzzles are one-straight-line; others require careful control experiments.
š„¬ Filling (Task Structure Matters)
- What it is: Agents excel at procedurally direct tasks but struggle on control-based or causal tasks.
- How it works:
- High scores on direct pipelines: āLost in the Middleā (best ā 91.7), āPersona with Catchā (ā 88.6), āCoT Without Promptingā (ā 82.6), āHallucination Snowballingā (ā 80.9).
- Low scores on control-heavy design: e.g., āLLM Racial Bias in Medicineā requires building proper counterfactuals; agents often skipped key controls.
- Why it matters: Planning and causal thinking are the main bottlenecks.
š Bottom Bread (Anchor): Many bias tests failed because agents injected race labels without first establishing a race-neutral baseline.
š Top Bread (Hook): False alarms vs misses tells you where the detector goes wrong.
š„¬ Filling (Error Patterns)
- What it is: Failures concentrate in Research Planning and Conclusion Formation.
- How it works:
- False positives are mostly Contradictory or Unrelated claims; few are valid āAlternativeā insights.
- Frequent errors: Method Deviation (wrong design), Overgeneralization, and Analysis Failures (missing trends).
- Why it matters: Agents need better study design and careful, evidence-grounded summaries.
š Bottom Bread (Anchor): Saying ālate is bestā when the paper shows only a slight late recovery is a Contradictory false positive.
š Top Bread (Hook): Wallet check: performance often costs tokens.
š„¬ Filling (CostāPerformance)
- What it is: Stronger models tend to cost more but can score higher; efficient agents can do more with less.
- How it works:
- Claude Code had the highest average F1 and highest total estimated cost.
- Codex reached a solid F1 with notably lower estimated spend via shorter action sequences.
- Why it matters: Practical deployments must balance accuracy with budget.
š Bottom Bread (Anchor): Tasks with long reasoning chains (e.g., self-correction studies) consumed more tokens across agents.
š Top Bread (Hook): Did they just memorize the answers?
š„¬ Filling (Data Contamination Check)
- What it is: Compare tasks before vs after model knowledge cutoffs, controlled by difficulty.
- How it works:
- Stratify tasks into Easy/Medium/Hard with a rubric.
- Check F1 for pre- vs post-cutoff tasks within each band.
- Observe no consistent pre-cutoff advantage.
- Why it matters: Reduces (but doesnāt eliminate) the worry that agents just memorized papers.
š Bottom Bread (Anchor): For Hard tasks, some agents even did better post-cutoff, suggesting difficultyānot memorizationādrives performance gaps.
05Discussion & Limitations
š Top Bread (Hook): Even great athletes have weak spots; naming them helps training.
š„¬ Filling (Honest Assessment)
- Limitations (what this canāt do):
- Rediscovery may punish genuinely new but valid findings that differ from the source paper.
- Claim extraction and matching use an LLM judge; though human checks show high agreement, itās not perfect.
- The 30-task set is ML-focused; results may not generalize to biology or physics without adaptation.
- Proprietary agent details are opaque; differences might stem from hidden tools or settings.
- Some tasks still carry stochastic variance; three runs help but donāt fully stabilize outcomes.
- Required Resources: GPU/CPU sandboxing, API access to LLMs, public datasets, and modest compute (most runs under 24 hours on an 80GB A100).
- When NOT to Use:
- If your goal is to reward novel discoveries over agreement with past papers.
- If your domain demands wet-lab or long training cycles outside compute-light bounds.
- If you need human-only judgment of narrative quality over claim-level correctness.
- Open Questions:
- How to reduce run-to-run variance and make planning more reliable?
- Can we design automatic checks for missing controls and causal pitfalls during planning time?
- How to blend human and LLM judging for even stronger validity at scale?
- Can rediscovery tasks extend beyond ML into multi-modal or lab-in-the-loop sciences?
- How to fairly detect and discount training-data contamination with limited visibility?
š Bottom Bread (Anchor): Think of FIRE-Bench like a coach that not only posts your score but circles, āFix your warm-up (planning) and your cool-down (conclusion).ā
06Conclusion & Future Work
š Top Bread (Hook): Imagine testing a scientist robot by asking it to re-prove a known fact without seeing the original stepsācan it plan, test, and conclude correctly?
š„¬ Filling (Takeaway)
- 3-Sentence Summary: FIRE-Bench evaluates AI agents on full-cycle scientific rediscovery: given only a high-level question from recent ML papers, agents must plan, implement, run, and conclude with evidence. Scoring at the claim level against ground-truth findings shows that todayās agents average below 50 F1 with high variance, struggling most with study design and evidence-to-claim reasoning. This benchmark delivers both rigorous measurement and a diagnostic lens to guide future agent improvements.
- Main Achievement: Turning complex, end-to-end scientific reasoning into objective, verifiable, claim-level evaluationāwithout relying solely on subjective paper judging.
- Future Directions: Stronger planning with explicit control design, execution-grounded inference checks, process-level audits, broader scientific domains, and better contamination defenses.
- Why Remember This: FIRE-Bench marks a shift from ānice-looking papersā or āone-number gainsā to ādid the agent truly rediscover the science?āāa standard that will shape how we trust AI in real research.
š Bottom Bread (Anchor): Itās the difference between a student who copies steps and one who can re-derive the result on their ownāFIRE-Bench measures the latter, claim by claim.
Practical Applications
- ā¢Evaluate your in-house research agent on end-to-end tasks before deploying it in critical workflows.
- ā¢Use the error taxonomy to debug agent failures in planning vs analysis and prioritize fixes.
- ā¢Create training curricula that teach agents to design proper controls and avoid overgeneralized conclusions.
- ā¢Benchmark different agent stacks and backbones to choose the best performanceācost mix for your team.
- ā¢Adopt claim-level scoring to grade internal research summaries and reports more objectively.
- ā¢Run ablation studies on tool use (e.g., retrieval, plotting, statistics) to see what boosts rediscovery success.
- ā¢Gate access to high-stakes domains (e.g., healthcare) by requiring minimum F1 and low variance on relevant tasks.
- ā¢Monitor for data contamination signals by stratifying tasks by difficulty and knowledge cutoffs.
- ā¢Automate regression testing of agent updates to ensure improvements donāt break planning or conclusions.
- ā¢Use constrained rediscovery tasks in education to teach students scientific method, controls, and fair comparisons.