Probing Scientific General Intelligence of LLMs with Scientist-Aligned Workflows

Wanghan Xu; Yuhao Zhou; Yifan Zhou; Qinglong Cao; Shuo Li; Jia Bu; Bo Liu; Yixin Chen; Xuming He; Xiangyu Zhao; Xiang Zhuang; Fengxiang Wang; Zhiwang Zhou; Qiantai Feng; Wenxuan Huang; Jiaqi Wei; Hao Wu; Yuejin Yang; Guangshuai Wang; Sheng Xu; Ziyan Huang; Xinyao Liu; Jiyao Liu; Cheng Tang; Wei Li; Ying Chen; Junzhi Ning; Pengfei Jiang; Chenglong Ma; Ye Du; Changkai Ji; Huihui Xu; Ming Hu; Jiangbin Zheng; Xin Chen; Yucheng Wu; Feifei Jiang; Xi Chen; Xiangru Tang; Yuchen Fu; Yingzhou Lu; Yuanyuan Zhang; Lihao Sun; Chengbo Li; Jinzhe Ma; Wanhao Liu; Yating Liu; Kuo-Cheng Wu; Shengdu Chai; Yizhou Wang; Ouwen Zhangjin; Chen Tang; Shufei Zhang; Wenbo Cao; Junjie Ren; Taoyong Cui; Zhouheng Yao; Juntao Deng; Yijie Sun; Feng Liu; Wangxu Wei; Jingyi Xu; Zhangrui Li; Junchao Gong; Zijie Guo; Zhiyu Yao; Zaoyu Chen; Tianhao Peng; Fangchen Yu; Bo Zhang; Dongzhan Zhou; Shixiang Tang; Jiaheng Liu; Fenghua Ling; Yan Lu; Yuchen Ren; Ben Fei; Zhen Zhao; Xinyu Gu; Rui Su; Xiao-Ming Wu; Weikang Si; Yang Liu; Hao Chen; Xiangchao Yan; Xue Yang; Junchi Yan; Jiamin Wu; Qihao Zheng; Chenhui Li; Zhiqiang Gao; Hao Kong; Junjun He; Mao Su; Tianfan Fu; Peng Ye; Chunfeng Song; Nanqing Dong; Yuqiang Li; Huazhu Fu; Siqi Sun; Lijing Cheng; Jintai Lin; Wanli Ouyang; Bowen Zhou; Wenlong Zhang; Lei Bai

Probing Scientific General Intelligence of LLMs with Scientist-Aligned Workflows

Intermediate

Wanghan Xu, Yuhao Zhou, Yifan Zhou et al.12/18/2025

arXiv PDF

Key Summary

•The paper defines Scientific General Intelligence (SGI) as an AI that can do science like a human scientist across the full loop: study, imagine, test, and understand.
•It uses a trusted learning recipe called the Practical Inquiry Model (Deliberation, Conception, Action, Perception) to structure what SGI should do.
•The authors build SGI-Bench, a large, scientist-aligned test with four task families: deep research, idea generation, dry/wet experiments, and experimental reasoning.
•They add an agent-based evaluator with tools (search, Python, PDF parsing) and multi-dimensional metrics to grade models fairly and precisely.
•Across more than 1,000 expert-curated cases, current leading models score low on integrated scientific workflows, even if they do okay on parts.
•Models retrieve facts but fail to get exact numeric answers in deep research (often 10–20% accuracy) and struggle to turn creative ideas into feasible plans.
•In code-based dry experiments, programs often run but produce wrong scientific results; in wet lab planning, step orders and parameters are frequently off.
•Multimodal experimental reasoning is better but still unreliable, especially for comparisons across images or domains like materials and earth systems.
•The paper also tries Test-Time Reinforcement Learning (TTRL), which nudges models toward more novel hypotheses during inference without needing labeled answers.
•Overall, the work offers a clear, measurable path toward AI that can truly help discover new science, not just answer trivia.

Why This Research Matters

Science moves society forward, from medicines and clean energy to climate resilience. A benchmark that truly reflects how scientists work helps us build AI that does more than parrot facts—it helps discover new ones. By checking feasibility, procedure order, parameters, and numeric exactness, SGI-Bench pushes models toward real-world usefulness, not just pretty prose. This can shorten research cycles, reduce costly lab errors, and surface creative yet testable ideas. Over time, such AI teammates could help small labs do big science, make data analysis more reliable, and speed safer innovation across fields.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Top Bread (Hook): You know how a science fair project starts with reading about a topic, then dreaming up a clever idea, trying an experiment, and finally explaining what the results mean? Real scientists do that too—again and again, in a loop.

🥬 Filling (The Actual Concept): What it is: This paper is about Scientific General Intelligence (SGI)—the idea that an AI could do the whole science fair loop like a human scientist, not just one piece of it. How it works (step by step):

Study what’s known (deliberation).
Imagine new ideas (conception).
Run experiments (action).
Make sense of results (perception). Why it matters: If AI can only do one step (like memorizing facts) but not plan experiments or interpret data, it won’t help discover new medicine, materials, or climate solutions.

🍞 Bottom Bread (Anchor): Imagine asking an AI, “How do we make cheaper, safer batteries?” It should read papers, propose a new electrolyte, plan lab steps, simulate performance, and explain results—not just recite a Wikipedia paragraph.

The World Before: Before this paper, AI models were great at answering quiz-like questions and writing code or essays. Benchmarks like MMLU and SuperGPQA tested fact knowledge, while others like GAIA measured how well a model used tools. But real science isn’t a stack of trivia or a single to-do list—it’s a cycle that blends reading, creating, testing, and understanding. That full cycle was not measured.

🍞 Top Bread (Hook): Imagine grading a baking contest by only tasting the frosting. You’d miss whether the cake stands up straight or is baked through.

🥬 Filling: What it is: The problem was fragmented evaluation—each benchmark tested one slice of science, not the whole cake. How it works: Prior tests mainly checked

recall and short reasoning (deliberation-only),
tool use steps (action-only), or
puzzle-style logic (isolated reasoning). Why it matters: Without measuring the full workflow, models could look smart but fail at real lab or simulation tasks.

🍞 Bottom Bread: A model might ace a biology quiz (frosting) but still design a lab protocol with steps in the wrong order (cake collapse).

Failed Attempts: People tried harder questions and more tools, but two issues stuck. First, evaluations often ignored feasibility—could an idea actually be done? Second, multi-step scientific tasks need both correctness and process checks (e.g., were the steps sound?), which simple right/wrong scoring can’t capture.

🍞 Top Bread (Hook): Picture assembling furniture. If you only check the final look, you might miss that crucial screws were skipped.

🥬 Filling: What it is: The missing piece was a scientist-aligned, step-aware, cross-domain framework. How it works: You need tests that mirror real research stages, plus metrics for steps, sequences, parameters, and final answers. Why it matters: Without this, we can’t tell whether AI genuinely “does science” or just imitates style.

🍞 Bottom Bread: Two desks can both look like a desk, but one wobbles because the builder ignored the order of steps. Science works the same way.

The Gap: There wasn’t a shared, practical definition of SGI or a benchmark covering the full inquiry loop with metrics that scientists actually care about. That made progress fuzzy, comparisons tricky, and hype easy.

Real Stakes: This isn’t academic nitpicking. Strong SGI could speed drug discovery, improve climate forecasts, and optimize clean energy. Weak SGI could suggest untestable ideas, run wrong simulations, or misread evidence—wasting time or money.

🍞 Top Bread (Hook): Imagine a GPS that knows city names but can’t plan a full route with traffic.

🥬 Filling: What it is: The paper offers a route map—define SGI by the Practical Inquiry Model and test it with scientist-aligned tasks. How it works: Build a benchmark that spans literature research, idea design, dry/wet experiments, and data interpretation, each with multi-part scoring. Why it matters: Now we can measure whether an AI can drive the whole scientific trip, not just name street signs.

🍞 Bottom Bread (Anchor): With this, if you ask “How do we confirm a new superconductor?”, the AI is graded on reading the literature, proposing a test, running or coding a simulation, and explaining the measurements—not merely recalling a famous paper.

02Core Idea

🍞 Top Bread (Hook): Imagine a four-part science compass: Read, Dream, Do, and See. If an AI can steer with all four, it can truly explore.

🥬 Filling (The Actual Concept): What it is (one sentence): The key idea is to define SGI using the Practical Inquiry Model (Deliberation, Conception, Action, Perception) and build SGI-Bench, a scientist-aligned workflow test that measures each part with tailored metrics and tools. How it works (like a recipe):

Map science to four stages (the PIM compass).
Create four matching tasks: deep research, idea generation, dry/wet experiments, experimental reasoning.
Score both process (steps, sequences) and outcomes (final answers, feasibility).
Use an agent-judge with tools (search, Python, PDF) for fair, reproducible grading.
Explore test-time learning (TTRL) to encourage novelty during inference. Why it matters: It replaces scattered, trivia-like tests with a principled, scientist-approved way to see if models can actually do science.

🍞 Bottom Bread (Anchor): It’s like switching from a spelling bee to a full writing workshop where students research, outline, draft, edit, and present—with rubrics for each step.

Multiple Analogies:

Sports team: You don’t measure a soccer player by juggling alone. You test passing (deliberation), playmaking (conception), shooting (action), and reading the field (perception).
Cooking show: Chefs must understand ingredients (deliberation), invent recipes (conception), cook dishes (action), and judge taste/texture (perception).
Space mission: Mission control studies trajectories (deliberation), proposes maneuvers (conception), fires thrusters (action), and reads telemetry (perception).

Before vs After:

Before: Benchmarks were islands—fact quizzes here, tool fiddling there—with no shared definition of scientific ability.
After: We have a map (PIM), a full-course exam (SGI-Bench), and clear scorecards that track both thinking and doing.
Result: We can see exactly where models stumble: missing numbers, weak feasibility, messy lab steps, or fuzzy comparisons across images.

Why It Works (intuition):

Science is a loop; measuring only one part hides true ability. PIM captures the loop, so coverage is complete.
Scientists care about both process and product; multi-dimensional metrics do the same (e.g., steps right? sequence right? parameters right? answer right?).
Tools reduce guesswork. Let evaluators search, compute, and parse to verify claims.
TTRL adds a gentle push toward originality at run time, mirroring how scientists iterate ideas.

Building Blocks (each introduced with the Sandwich pattern):

🍞 Top Bread: You know how you first read a guide before building LEGO? 🥬 Filling: Scientific Deep Research is the “read and integrate” step—models gather evidence across sources and compute exact, often numeric answers. How it works:

Read background and constraints.
Retrieve and stitch facts.
Do unit-checked math.
Give a precise, formatted answer. Why it matters: Without it, later ideas or experiments rest on shaky understanding. 🍞 Bottom Bread: Example: Combining multiple climate datasets to calculate a decade’s ocean heat change to two decimal places.

🍞 Top Bread: Imagine brainstorming cool science fair ideas after reading past winners. 🥬 Filling: Idea Generation turns knowledge gaps into structured, testable plans. Steps:

Identify limits of prior work.
Propose a core idea.
Lay out steps, data, metrics, and expected outcomes. Why it matters: Vague ideas waste time; structured ones can be built. 🍞 Bottom Bread: Example: A new antenna phase-retrieval method with a step-by-step training plan, datasets, and error metrics.

🍞 Top Bread: Think of running a video game physics sim vs mixing real slime. 🥬 Filling: Dry/Wet Experiments split into code-based simulations (dry) and lab protocols (wet). Steps (dry): Fill missing code functions; run tests; check correctness/time. Steps (wet): Choose action order; set parameters; follow lab-safe sequences. Why it matters: This is where ideas meet reality—virtual or physical. 🍞 Bottom Bread: Example: Complete a climate model function, or order PCR steps with exact temperatures and volumes.

🍞 Top Bread: After you bake, you taste and compare cupcakes. 🥬 Filling: Experimental Reasoning interprets images/plots from observations, simulations, or setups; often big multi-choice with reasoning traces. Steps:

Read visuals.
Compare conditions.
Infer causes. Why it matters: Discovery needs correct conclusions, not just data. 🍞 Bottom Bread: Example: Choose which catalyst lowers a reaction barrier by comparing two microscopy images and an energy plot.

Finally, 🍞 Top Bread: Imagine a fair judge who can look things up, run calculators, and explain scores. 🥬 Filling: The Agent-based Evaluation Framework is an “agent-as-a-judge” with tools for selection, customized metrics, inference, and reports. Why it matters: Fairness and reproducibility. 🍞 Bottom Bread: Example: The judge opens PDFs, runs Python unit tests, and writes a clear scorecard for each task.

03Methodology

At a high level: Input (scientist-aligned questions) → Stage 1: Question Selection → Stage 2: Metric Customization → Stage 3: Prediction & Evaluation → Stage 4: Report Generation.

Stage 1. Question Selection 🍞 Top Bread (Hook): Imagine a librarian who finds the right science books for your project. 🥬 Filling: What it is: A questioning agent matches user intent to a subset of SGI-Bench problems across domains and task types. How it works:

Read the user’s goals or pick all tasks by default.
Filter by domain, task family, and difficulty.
Return indices of the chosen set. Why it matters: Right questions = meaningful evaluation; wrong questions = noisy scores. 🍞 Bottom Bread (Anchor): If you want battery science tasks, it won’t hand you astronomy plots.

Stage 2. Metric Customization 🍞 Top Bread: Think of a teacher who tweaks the rubric for a lab vs a worksheet. 🥬 Filling: What it is: A customization agent blends standard scientist-aligned metrics with user-specified priorities. How it works:

Parse user intent (e.g., emphasize feasibility).
Pull predefined metrics for each task.
Optionally invent new metrics using tools (search/PDF) and filter them. Why it matters: Different tasks need different rulers. 🍞 Bottom Bread: For wet labs, sequence order and parameter accuracy matter more than flowery writing.

Stage 3. Prediction & Evaluation 🍞 Top Bread: Picture a fair judge who can check the math, read the paper, and run the code. 🥬 Filling: What it is: Tool-augmented inference plus an evaluation agent that computes scores and rationales. How it works:

Run the target model/agent on each selected question (with allowed tools).
Score outcomes using the metrics.
Produce a rationale for each score, citing evidence. Why it matters: Transparency and repeatability make results trustworthy. 🍞 Bottom Bread: The judge might run unit tests on your code function, verify your final number’s units, and note any missing steps.

Stage 4. Report Generation 🍞 Top Bread: Like a season scoreboard summarizing every game. 🥬 Filling: What it is: A reporting agent compiles per-task and overall scores with narratives. How it works:

Aggregate metric scores.
Visualize patterns across tasks and domains.
Explain strengths/weaknesses. Why it matters: Clear summaries guide research and improvements. 🍞 Bottom Bread: A lab leader can see, at a glance, that their model is strong at image perception but weak at numerical synthesis.

Now, the core task formulations (with Sandwich explanations):

Scientific Deep Research 🍞 Hook: You know how detectives cross-check clues from different witnesses and compute exact timelines? 🥬 Concept: What it is: Literature-centered, multi-step reasoning that often ends with a precise numeric/string answer. How it works:

Use background and constraints.
Retrieve relevant snippets.
Verify units and perform calculations.
Output steps and a final exact answer. Why it matters: If this step is wrong, the entire project tilts. 🍞 Anchor: Compute an RC time constant and identify a frequency threshold by integrating facts across a paper.

Idea Generation 🍞 Hook: Brainstorming is cool, but a buildable plan is cooler. 🥬 Concept: What it is: Structured methodology design—core idea, steps, order, data, metrics, expected outcomes. How it works:

Spot gaps in related work.
Propose a core idea.
Detail steps (with order), data, and metrics.
State expected results. Why it matters: Vague ideas stall; structured ones move forward. 🍞 Anchor: A differentiable spherical near-field pipeline with exact loss metrics and dataset specs.

Dry Experiment (code completion) 🍞 Hook: Finishing a puzzle by filling the missing piece. 🥬 Concept: What it is: Complete masked scientific functions so the code runs correctly and efficiently. How it works:

Read background and data code.
Fill missing functions.
Pass all unit tests; minimize runtime errors. Why it matters: Running code ≠ correct science; both must align. 🍞 Anchor: Implement a stable numerical integrator that reproduces reference outputs.

Wet Experiment (protocol planning) 🍞 Hook: Following a recipe step-by-step with the right oven temperature. 🥬 Concept: What it is: Choose and order atomic lab actions with correct parameters from a defined action pool. How it works:

Understand the procedure.
Select correct action order.
Set parameters (e.g., temperature, volumes). Why it matters: Wrong order or parameters can ruin the experiment. 🍞 Anchor: PCR setup with precise temperatures and cycle counts in the right sequence.

Experimental Reasoning (multimodal) 🍞 Hook: Comparing two X-ray images to decide which bone is fractured. 🥬 Concept: What it is: Read images/plots of different kinds, compare conditions, and pick the right answer with valid reasoning. How it works:

Perceive signals.
Understand attributes.
Compare across images.
Infer causes. Why it matters: Seeing isn’t enough; you must conclude correctly. 🍞 Anchor: Decide which catalyst lowers reaction barriers by analyzing a plot and microscopy images.

Key Metrics (each with Sandwich):

Exact Match (EM) 🍞 Hook: Lock-and-key scoring: either it fits perfectly or not. 🥬 Concept: What it is: Final answer must match exactly. How it works: Compare model answer to gold; 1 for exact, 0 otherwise. Why it matters: Demands numerical and unit precision. 🍞 Anchor: “2.23, 3.2, 10” with the right decimals.
Step-Level Accuracy (SLA) 🍞 Hook: Checking each step of your math, not just the final number. 🥬 Concept: What it is: Judge correctness of each reasoning step. How it works: LLM-judge compares steps to reference; compute proportion correct. Why it matters: Finds where the chain breaks. 🍞 Anchor: Steps 1–5 align, step 6 has a unit slip.
PassAll@k (dry) 🍞 Hook: Not just one lucky try—consistent success. 🥬 Concept: What it is: Proportion of problems where k or more unit tests pass. How it works: Run tests; require passing multiple, not just one. Why it matters: Rewards robust, not brittle, code. 🍞 Anchor: Your function passes all 5 tests for 36% of problems.
Sequence Similarity (wet) 🍞 Hook: The right dance moves in the right order. 🥬 Concept: What it is: How close your action order is to the reference. How it works: Count inversions; score 1.0 means identical order. Why it matters: Wrong order breaks protocols. 🍞 Anchor: Swapping “add reagent” before “cool” lowers your score.
Parameter Accuracy (wet) 🍞 Hook: Baking at 180°C vs 280°C matters. 🥬 Concept: What it is: Fraction of parameters (volumes, temps) correct. How it works: Compare provided parameters to gold. Why it matters: Right steps with wrong numbers still fail. 🍞 Anchor: Using 50 µL instead of 500 µL loses points.
Multi-choice Accuracy (MCA, multimodal) 🍞 Hook: Picking the single correct answer out of a crowd. 🥬 Concept: What it is: 1 if correct option chosen among 10+; else 0. How it works: Average over all questions. Why it matters: Tough discrimination under many choices. 🍞 Anchor: Choose the correct materials-phase diagram.
Reasoning Validity (RV) 🍞 Hook: Show your work. 🥬 Concept: What it is: A 0–10 judge score for the logic behind the choice. How it works: LLM-judge checks alignment with reference reasoning. Why it matters: Ensures answers are justified, not guessed. 🍞 Anchor: A clear causal chain about why sample B outperforms A.
Feasibility (idea gen) 🍞 Hook: Can you actually build it? 🥬 Concept: What it is: Similarity of your implementation graph to an expert template. How it works: Extract your steps; compare to expert workflow. Why it matters: Novelty without buildability stalls progress. 🍞 Anchor: Missing data prep or evaluation loop reduces feasibility.

Test-Time Reinforcement Learning (TTRL) 🍞 Hook: Getting hints while solving a puzzle helps you try fresher ideas. 🥬 Concept: What it is: At inference time, nudge the model toward novel, well-retrieved hypotheses using a reward. How it works:

Retrieve related knowledge.
Propose variations.
Reward higher novelty (and alignment) without needing labels. Why it matters: Encourages creative, grounded hypotheses on the fly. 🍞 Anchor: The model rephrases and refines a catalyst mechanism to be measurably more original yet plausible.

Secret Sauce: Scientist alignment plus multi-dimensional, tool-checked scoring. It evaluates what scientists actually do (and care about), across the full loop, with fine-grained rubrics that catch both wobbly steps and wobbly answers.

04Experiments & Results

The Test: The team evaluated many top models (open and closed) and some agents across all four task families—deep research, idea generation, dry/wet experiments, and multimodal experimental reasoning—using strict, scientist-aligned metrics. The dataset spans 1,000+ expert-curated samples inspired by Science’s 125 Big Questions, across 10 disciplines. Temperatures were set to 0 for consistency, and standardized prompts were used.

The Competition: Prior famous benchmarks (MMLU, SuperGPQA, GAIA, HLE) each cover only part of the scientific loop. SGI-Bench directly compares models on the integrated tasks that mirror how science is actually done.

Scoreboard (with context):

Deep Research (Exact Match): Best models are often below 20% exact answers; many sit around 8–16%. That’s like getting 1–2 questions right out of 10 when exact numbers matter. Step-Level Accuracy is much higher (some ~65%), meaning models do parts of the reasoning right but drop the ball before the finish—like showing good work but writing the wrong final number.
Idea Generation: Highest averages reach the mid-50s, with especially strong novelty and detailedness from leading closed-source models. That’s like crafting creative blueprints with nice diagrams, but many still forget the screws: feasibility scores lag in the teens to low 20s.
Dry Experiments (PassAll@5): The top system reaches about 36.6%. Many code snippets execute, but outputs are often wrong or numerically unstable. Think of programs that run smoothly but calculate the wrong orbit for a satellite.
Wet Experiments: Sequence similarity is low and parameter accuracy only moderate. Models often omit, reorder, or mis-parameterize steps—like mixing chemicals before cooling, or using the wrong volumes—undermining the procedure.
Experimental Reasoning (MCA): Best accuracies hover around low 40s with 10+ options—better than guessing but not reliable. Comparative reasoning (weighing multiple images/conditions) remains a pain point, especially in materials and earth science tasks.
Overall SGI-Score: Even the best aggregate scores are around the low-to-mid 30s out of 100, and closed-source models hold only a small edge over strong open-source ones. Bigger doesn’t automatically mean better science chops.

Surprising Findings:

Newer isn’t always better: Some updated models underperform predecessors on deep research, hinting at regressions or loss of niche knowledge.
Agents don’t guarantee wins: Tool-augmented agents sometimes do better on step accuracy but not always on exact answers; several underperform top LLMs.
Partial alignment vs final truth: The large gap between step-level alignment and exact final answers shows brittle chains—one mistaken unit or misread sentence can sink an otherwise good reasoning path.

Concrete Examples:

In a Chua’s circuit case, a system retrieves the right components and computes an RC time constant with unit conversions, but may round incorrectly or misreport frequency thresholds, losing EM despite decent steps.
In idea generation for antenna phase retrieval, models produce innovative, structured pipelines but miss concrete data acquisition or hyperparameter details, lowering feasibility despite strong novelty.
In dry climate modeling code, a model completes a masked numerical solver that runs but yields slightly off error metrics across all tests, failing PassAll@5.
In wet lab PCR planning, a model confuses annealing and extension temperatures or cycle counts, reducing both sequence similarity and parameter accuracy.
In multimodal reasoning, a model identifies a trend in a plot but miscompares two microscopy images, picking the wrong catalyst as best performer.

Big Picture: Today’s models can imitate parts of the scientist’s workflow—retrieve, write, and even code—but they struggle to integrate those parts into a sturdy, end-to-end chain that survives strict, measurable checks. The test-time RL pilot (TTRL) nudges novelty during inference without needing labeled answers, showing small but meaningful gains in hypothesis originality—an encouraging sign that adaptive, reflective loops at inference time could help bridge gaps.

Takeaway: On SGI-Bench, current AI is like a strong intern: helpful, fast, creative, and decent at parts—but not yet a reliable lead scientist who can own the whole study from start to finish.

05Discussion & Limitations

Limitations:

Fragmented mastery: Models do pieces well (retrieval, surface novelty, code fluency) but struggle to keep chains intact—especially where unit rigor, multi-source synthesis, and exact numerics converge.
Procedure brittleness: Wet lab planning often breaks on step order or parameter choices; dry experiments run but compute the wrong truths.
Comparative reasoning: Weighing multiple images/conditions across domains remains weak, limiting robust scientific discrimination.
Metric dependence on judges/tools: While the agent-judge improves rigor, any LLM-based judging can inherit biases; cross-checking with human experts is still valuable, especially in edge cases.

Required Resources:

Expert-curated data and templates (implementation graphs, reference protocols) across many domains.
Tooling infrastructure (search, PDF parsing, Python sandboxes, unit-test harnesses).
Compute to run multi-model, multi-metric evaluations and to support test-time procedures like TTRL.

When NOT to Use:

If you only need fact recall or short Q&A, SGI-Bench is overkill.
For domains with high bio/chemical hazard where protocol errors could cause harm, don’t rely on current model outputs without expert review.
If you cannot provide tool access or have no tolerance for longer, multi-step runs, the framework’s strengths won’t shine.

Open Questions:

How to fuse retrieval, symbolic math, and uncertainty modeling so numeric chains don’t snap near the end?
What training signals best reward feasibility and constraint-following, not just stylistic novelty—especially for lab protocols and engineering pipelines?
Can multimodal encoders be taught stronger comparative reasoning across heterogeneous images (plots + microscopy + simulations) with grounded explanations?
How to robustly combine test-time learning (like TTRL), planning, and tool use into a self-correcting loop that improves mid-run without human labels?
How to standardize safety guardrails so wet-lab suggestions remain safe, conservative, and compliant while still being innovative?

Honest Assessment: SGI-Bench shows that today’s frontier models are promising collaborators but not principal investigators. The path forward likely needs tighter planning constraints, numerics-aware reasoning, richer multimodal grounding, and training/evaluation signals that reward feasibility and correctness as much as eloquence. Early TTRL results suggest that small, well-aimed nudges during inference can increase novelty; the next step is pairing those nudges with checks that keep ideas buildable and safe.

06Conclusion & Future Work

3-Sentence Summary: This paper defines Scientific General Intelligence (SGI) using a trusted four-part science loop—Deliberation, Conception, Action, Perception—and operationalizes it with SGI-Bench, a scientist-aligned, tool-augmented benchmark spanning deep research, idea generation, dry/wet experiments, and experimental reasoning. Across 1,000+ expert-curated tasks, leading models perform unevenly and score low on integrated workflows: they can retrieve, draft, and code, but struggle with exact numerics, feasible plans, correct lab sequences, and reliable multimodal comparisons. A pilot of test-time reinforcement learning (TTRL) boosts hypothesis novelty without labels, hinting that adaptive, reflective inference could help models become more genuinely scientific.

Main Achievement: A principled, measurable, and comprehensive framework—rooted in the Practical Inquiry Model—that lets the community evaluate (and improve) AI’s ability to actually do science, not just talk about it.

Future Directions:

Numerics-first reasoning: combine verified retrieval, unit-aware math, and uncertainty handling to reduce last-mile errors.
Feasibility training: incorporate procedural constraints, resource assumptions, and simulator checks into learning signals.
Multimodal comparison: strengthen cross-image/domain reasoning with grounded, checkable explanations.
Agentic loops: integrate planning tools, simulators, and TTRL-like feedback for mid-run self-correction.
Safety and standards: build domain-specific guardrails, especially for wet labs, so innovation stays responsible.

Why Remember This: It turns the fuzzy dream of “AI that does science” into a testable target with shared rules, datasets, and scoreboards—so progress becomes visible, comparable, and directed toward real discovery.

Practical Applications

•Evaluate new AI models for lab use by testing their full research workflow capability, not just Q&A.
•Prioritize model improvements (e.g., numerics or feasibility) based on detailed per-metric weakness reports.
•Use idea-generation tasks to brainstorm novel, structured research plans with explicit data and metrics.
•Pre-screen wet-lab protocols for step order and parameter quality before human review.
•Validate and harden code for simulations via dry-experiment unit-test performance.
•Train models with additional signals that reward feasibility and parameter specificity, not just novelty.
•Adopt TTRL-like inference strategies to increase hypothesis originality during literature reviews.
•Build internal R&D dashboards that track model performance across domains (materials, bio, climate).
•Support grant or proposal reviews by checking whether plans align with expert implementation graphs.
•Create educational modules that teach students the full scientific loop with step-aware rubrics.

Version: 1