SciEvalKit: An Open-source Evaluation Toolkit for Scientific General Intelligence

Yiheng Wang; Yixin Chen; Shuo Li; Yifan Zhou; Bo Liu; Hengjian Gao; Jiakang Yuan; Jia Bu; Wanghan Xu; Yuhao Zhou; Xiangyu Zhao; Zhiwang Zhou; Fengxiang Wang; Haodong Duan; Songyang Zhang; Jun Yao; Han Deng; Yizhou Wang; Jiabei Xiao; Jiaqi Liu; Encheng Su; Yujie Liu; Weida Wang; Junchi Yao; Shenghe Zheng; Haoran Sun; Runmin Ma; Xiangchao Yan; Bo Zhang; Dongzhan Zhou; Shufei Zhang; Peng Ye; Xiaosong Wang; Shixiang Tang; Wenlong Zhang; Lei Bai

SciEvalKit: An Open-source Evaluation Toolkit for Scientific General Intelligence

Intermediate

Yiheng Wang, Yixin Chen, Shuo Li et al.12/26/2025

arXiv PDF

Key Summary

•SciEvalKit is a new open-source toolkit that tests AI on real scientific skills, not just trivia or simple Q&A.
•It measures seven core abilities scientists actually use, including seeing scientific images, reasoning with symbols, writing code, and forming hypotheses.
•The toolkit spans six big fields—physics, chemistry, life science, earth science, astronomy, and materials science—using expert-verified datasets.
•It runs models through a unified pipeline that builds prompts, generates answers, executes code when needed, and scores fairly and reproducibly.
•Results show most models are good at science facts but much weaker at writing working code and doing exact symbolic math.
•Gemini-3 Pro is the most balanced overall, while Qwen3-Max shines in code generation and some open models approach closed-source leaders.
•Strong image perception does not guarantee understanding or reasoning on scientific figures; deeper multimodal skills are still hard.
•SciEvalKit reports capability scores, not just one big average, so you can see a model’s strengths and weaknesses clearly.
•The toolkit encourages community contributions and will add more benchmarks, agent tool-use tracks, and new data types over time.

Why This Research Matters

If we want AI to help in real science—like understanding diseases, predicting weather, or designing safer materials—we must test the right skills, not just memory. SciEvalKit checks whether AI can read scientific figures, do exact math, and write code that actually runs, which are the heart of modern research. This makes results more trustworthy for doctors, scientists, and engineers who rely on precise, explainable steps. It also shows developers exactly where to improve models, speeding up progress. Because the toolkit is open and reproducible, the community can compare fairly and build better benchmarks together.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine grading a science fair. If you only check whether the poster looks neat, you miss whether the experiment makes sense, the math is right, and the graphs are read correctly. We need to test all the important parts, not just one.

🥬 The Concept (Scientific Intelligence Evaluation—before SciEvalKit): It’s the way we measure if an AI can do science, not just chit-chat. How it worked before:

Many tests asked short fact questions or simple multiple-choice.
Vision tests focused on basic captions, not scientific figures like spectra, protein structures, or climate maps.
Code tests often checked syntax, not whether code actually runs and gives the right scientific result. Why it matters: If we don’t test the right skills (like symbol math, executable code, reading lab-style figures), we can’t trust AI for real science tasks.

🍞 Anchor: It’s like giving a soccer player a spelling test to decide if they can play on the team. They might ace spelling but still not know how to pass or shoot.

🍞 Hook: You know how a real experiment means planning, measuring, calculating, and explaining? Science isn’t just one step—it’s a whole workflow.

🥬 The Problem: Existing AI evaluations checked single skills in isolation but didn’t cover the full science workflow. How that shows up:

Surface-level correctness: right answer without checking the reasoning process or units.
Narrow tasks: one dataset, one domain, little symbol work or code execution.
One big score: leaderboards that blur where a model is strong or weak. Why it matters: A model might memorize facts yet fail at writing a working data analysis script or at aligning a plot with the right explanation.

🍞 Anchor: It’s like judging a cooking contest only by how the dish looks, not how it tastes, smells, or if the recipe even works.

🍞 Hook: Think of a pilot’s preflight checklist—everything must be checked: fuel, engines, instruments. Science AIs need similar multi-part checks.

🥬 The Gap: We needed a unified, expert-aligned, multimodal, and capability-oriented toolkit. How to fill it:

Use real scientific datasets from multiple fields.
Score by capability (like code, symbols, perception), not just a single average.
Include code execution and LLM-as-a-judge for nuanced answers. Why it matters: With a faithful test, we can compare models fairly, spot weaknesses, and improve them.

🍞 Anchor: It’s like switching from a mystery “overall grade” to a clear report card with math, reading, science labs, and PE—so you know exactly what to practice.

🍞 Hook: Imagine doctors, climate scientists, and engineers asking AI for help. If the AI can’t handle equations, code, or special diagrams, bad advice could slip through.

🥬 Real Stakes: Better evaluation means safer tools and faster discoveries. How it helps:

Medicine: reading scans and linking them to symptoms accurately.
Climate: interpreting satellite maps and running analysis code correctly.
Materials: using equations to predict properties, not just guessing. Why it matters: Without trustworthy tests, we might deploy models that look smart but fail at critical steps.

🍞 Anchor: Like testing a bridge model for both weight and wind. If you skip the wind test, the bridge could sway and fail—even if it looked fine on paper.

02Core Idea

🍞 Hook: You know how a great science student isn’t just good at memorizing facts—they can also read graphs, do equations, code simulations, and propose new ideas?

🥬 The Aha Moment (in one sentence): Evaluate science AIs by the exact skills scientists use—across text, images, symbols, and code—so we see true scientific intelligence, not just trivia recall. How it works:

Define seven core science abilities (perception, reasoning, understanding, symbolic math, code, hypotheses, knowledge).
Gather expert-grade benchmarks across six disciplines.
Run all models through one unified pipeline that builds prompts, executes code, and scores fairly. Why it matters: Without matching how science actually works, scores can be misleading, hiding serious weaknesses.

🍞 Anchor: It’s like testing a musician by scales, sight-reading, ear training, and performance—not just asking them to name notes.

🍞 Hook: Picture three ways to explain it—like sports, school, and airplane checks.

🥬 Multiple Analogies:

Sports Tryout: We don’t only time a sprint; we test passing, strategy, and endurance. SciEvalKit tests all the “positions” of science.
School Report Card: Not one grade, but math, reading, science lab, and writing. Models get capability scores for code, symbols, images, and more.
Airplane Preflight: Many systems need to work together. If code (the engine) fails or symbolic math (the instruments) is off, the whole flight is risky. Why it matters: Single-number leaderboards can hide the weak spots that matter most in real labs.

🍞 Anchor: A model that knows facts (A) but can’t run analysis code (F) shouldn’t be trusted to process satellite data—even if its average is a C.

🍞 Hook: Imagine we flip from “Did it answer?” to “Can it really do science?”

🥬 Before vs After: Before: Benchmarks favored short answers, generic images, and memorized facts. After: Benchmarks cover symbol manipulation, executable code, scientific figures, and hypothesis formation. Why it matters: The new view reveals why models stumble on real research tasks, even if they chat well.

🍞 Anchor: It’s the difference between reciting the recipe and actually baking a cake that rises.

🍞 Hook: Think of science reasoning as a chain—break any link and the result fails.

🥬 Why It Works (intuition):

Capabilities are modular (like links): perception → reasoning → understanding → knowledge → hypotheses; symbols ↔ code.
Execution-aware scoring checks if ideas turn into working code and correct outputs.
Expert-aligned datasets ensure tasks look like what scientists truly do. Why it matters: These checks prevent “shortcut” answers that look right but don’t hold up.

🍞 Anchor: If the AI claims a plot will show a rising trend, but the code it wrote draws the wrong axes, the execution check will catch it.

🍞 Hook: Imagine building blocks you can rearrange to test any science skill.

🥬 Building Blocks:

Taxonomy of Seven Skills: The science abilities scientists use daily.
Expert-Aligned Benchmarks: Curated datasets like ChemBench, SciCode, SFE, MSEarth.
Unified Interface: One way to build prompts and call models, for text and images alike.
Capability-Oriented Scoring: Separate scores per skill so weaknesses are visible.
Hybrid Judging: Rules for clear answers, LLM-as-a-judge for nuanced ones, and code execution for programs. Why it matters: This modular design makes the toolkit fair, extensible, and reproducible.

🍞 Anchor: Like LEGOs for evaluation—you can snap on a new dataset or a new model without rebuilding the whole castle.

03Methodology

🍞 Hook: Think of baking: ingredients → mix → bake → taste test → write the recipe card. A good pipeline does the steps in order and keeps notes.

🥬 High-Level Recipe: Input → Dataset Layer (build prompt) → Model Inference (generate answers) → Evaluation & Testing (score) → Report & Storage (save and compare) Why it matters: Clear steps mean fair, repeatable tests you can trust.

🍞 Anchor: Like a lab protocol: prepare samples, run the machine, analyze results, log everything in your notebook.

— New Concept — 🍞 Hook: You know how a teacher turns a textbook chapter into a clear assignment?

🥬 Dataset Layer (Prompt Builder): It turns raw samples (questions, images, code stubs, options) into a neat multi-modal message the model can read. How it works:

Load the sample (e.g., a ChemBench question or an SFE figure).
Pack it into an ordered list of segments, like {type: 'text', value: '...'} and {type: 'image', value: 'path'}.
Apply the dataset’s instruction style (MCQ, open-ended, numeric). Why it matters: Without consistent prompts, two models might be asked different questions, making scores unfair.

🍞 Anchor: Example—ChemBench MCQ becomes: text: “Which option is correct?” plus the options, exactly standardized.

— New Concept — 🍞 Hook: Imagine sending the same letter to many pen pals, even if some prefer email and others like postcards.

🥬 Unified Interface (Model.generate): All models—local or API—get called the same way with the same message format. How it works:

infer_data() batches and sends messages.
Retries and caching handle errors and partial progress.
Models can customize formatting per dataset while staying in the same interface. Why it matters: You can swap models freely without rewriting the pipeline.

🍞 Anchor: One script can test GPT-5, Gemini-3 Pro, and Qwen3 with the same prompts and logging.

— New Concept — 🍞 Hook: When grading, sometimes a ruler is enough (exact match), sometimes you need a referee for style (semantic sense), and sometimes you must run the program to see if it works.

🥬 Evaluation & Testing (Hybrid Scoring): It combines exact rules, semantic LLM-as-a-judge, and code execution. How it works:

Natural-language matching: normalize units, extract choices, compare strings or numbers.
Code-execution: stitch predicted code into a script, install dependencies, run official tests, count passed cases.
Judge-assisted: use a strong model to check semantic equivalence for open-ended answers. Why it matters: Different tasks need different graders; using the right one prevents unfair penalties or easy loopholes.

🍞 Anchor: SciCode checks whether the AI’s Python actually passes unit tests; SFE may use a judge to see if the explanation truly matches the figure.

— New Concept — 🍞 Hook: Keeping score without a scoreboard is like playing a game with no final tally.

🥬 Report & Storage (Reproducibility): Saves predictions, logs, reasoning chains, and scores in consistent files (CSV/JSON/XLSX). How it works:

Store raw answers and metadata per benchmark and model.
Keep reusable cache and intermediate files for restartable runs.
Produce capability-level summaries for leaderboards. Why it matters: Clear records let others reproduce results and compare fairly.

🍞 Anchor: A folder named Output/{Benchmark}/{Model} keeps everything tidy for audits and reruns.

— Seven Core Skills (in dependency order) —

🍞 Hook: You know how a biologist looks at an MRI and spots organs? 🥬 Scientific Multimodal Perception: The AI finds and labels key scientific things in images (e.g., organs, cells, features in a satellite map). How it works: (a) See the image, (b) detect entities, (c) localize them. Why it matters: If you can’t find the right parts, later reasoning will fail. 🍞 Anchor: On SLAKE, the model identifies lungs in a chest X-ray before answering a clinical question.
🍞 Hook: Like a detective matching footprints (image) with witness stories (text). 🥬 Scientific Multimodal Reasoning: The AI connects image clues with text clues to infer answers step-by-step. How it works: (a) Ground terms to visuals, (b) chain-of-thought across modalities, (c) conclude. Why it matters: Without linking text and visuals, answers become guesses. 🍞 Anchor: On MSEarth, the model reads a climate map and the caption to infer cause and effect.
🍞 Hook: A teacher uses pictures plus captions so students truly get the idea, not just notice colors. 🥬 Scientific Multimodal Understanding: The AI interprets scientific diagrams and encodings (axes, symbols, legends) and aligns them with domain meaning. How it works: (a) Parse structure and symbols, (b) align with scientific concepts, (c) extract precise info. Why it matters: Without understanding encodings, it may misread graphs. 🍞 Anchor: On SFE, the AI matches plot markers to correct variables and units before answering.
🍞 Hook: Think of algebra where letters stand for real quantities. 🥬 Scientific Symbolic Reasoning: The AI manipulates equations, units, and formal expressions to solve science problems. How it works: (a) Translate text to symbols, (b) apply laws, (c) keep units consistent. Why it matters: Without exact math, scientific answers can be confidently wrong. 🍞 Anchor: On PHYSICS, it derives a formula for the minimum speed to skip a stone.
🍞 Hook: Writing a recipe that a kitchen robot can follow. 🥬 Scientific Code Generation: The AI turns scientific intent into working code that passes tests. How it works: (a) Plan algorithm steps, (b) use correct libraries, (c) produce valid outputs. Why it matters: Science runs on code; pretty text isn’t enough. 🍞 Anchor: On SciCode, a conjugate gradient solver must actually converge on unit tests.
🍞 Hook: When something weird happens in an experiment, a scientist proposes explanations to test next. 🥬 Science Hypothesis Generation: The AI suggests plausible, testable ideas grounded in evidence. How it works: (a) Gather clues, (b) propose mechanisms, (c) outline tests. Why it matters: Without good hypotheses, research stalls. 🍞 Anchor: On ResearchBench, it drafts a research idea and a plan for how to validate it.
🍞 Hook: A good library helps you find the right facts fast. 🥬 Scientific Knowledge Understanding: The AI knows domain concepts and relationships and applies them correctly. How it works: (a) Recall facts, (b) connect them to context, (c) answer precisely. Why it matters: Facts are the base layer; without them, higher reasoning wobbles. 🍞 Anchor: On ChemBench, it computes the 174 vibrational modes for C60 using 3N−6 for non-linear molecules.

— Secret Sauce — 🍞 Hook: Great teamwork happens when each person knows their job and the handoffs are smooth.

🥬 Secret Sauce: Strict modularity + unified prompts + capability bins + execution-aware scoring + expert curation. Why it matters: This combo makes the system fair, extensible, and hard to game.

🍞 Anchor: You can plug in a new astronomy code benchmark without touching model APIs or scoring scripts—and it just works.

04Experiments & Results

🍞 Hook: Imagine a science decathlon: running (perception), puzzles (reasoning), lab reports (understanding), exact math (symbols), coding sprints (code), and creative proposals (hypotheses). Different events reveal different strengths.

🥬 The Test: SciEvalKit measures seven core abilities across text and multimodal tasks using expert-grade datasets (e.g., ChemBench, MaScQA, ProteinLMBench; SLAKE, SFE, MSEarth; SciCode, AstroVisBench; PHYSICS, CMPhysBench; ResearchBench). Why it matters: Real science isn’t one skill, so one score can’t tell the full story.

🍞 Anchor: It’s a report card with subjects instead of one final grade.

🍞 Hook: Think of a friendly tournament between teams with different play styles (closed-weight vs open-weight models).

🥬 The Competition: Models include Gemini-3 Pro, GPT-5/5.1, GPT-o3, GPT-4o, Claude 4.5/4.1, Seed1.6-Vision, Qwen3 families, Llama 4 Maverick, GLM-4.5V, Kimi-k2, DeepSeek-R1, Ling-flash-2.0, Grok-2-vision-1212. Why it matters: Broad coverage shows whether strengths generalize.

🍞 Anchor: Same tests, different players—so we can compare fairly.

🍞 Hook: Scores are like school grades; a 60 is a D if others get 90, but an A if the class average is 50.

🥬 The Scoreboard (with context):

Most models score highest on Scientific Knowledge Understanding but drop sharply on Code Generation and Symbolic Reasoning. That’s like getting A’s in reading but C’s in math and programming.
Gemini-3 Pro is the most balanced top performer overall in multimodal and text capabilities (e.g., Sci.MM-Overall ≈ 62.88), leading in hypothesis generation and strong in symbolic reasoning and code compared to peers.
Qwen3-Max stands out in Scientific Code Generation (≈ 43.97, best among tested) and is competitive with top proprietary models in several text abilities.
Qwen3-VL-235B-A22B achieves very high multimodal perception (≈ 72.29) but notably lower understanding/reasoning, showing perception ≠ comprehension.
Compared to general tasks where leaders approach ≈ 90, the same models often fall below ≈ 60 on rigorous scientific tasks—like dropping from an A to a D when the test switches to lab-grade science. Why it matters: Real scientific skills remain the bottleneck.

🍞 Anchor: A model that aces trivia might still write code that fails unit tests—SciEvalKit exposes that gap.

🍞 Hook: Sometimes the plot twists are in the details.

🥬 Surprising Findings:

Perception saturation: Many models recognize objects/regions in scientific images reasonably well, but stumble when asked to interpret axes, units, or domain meaning.
Code ↔ Symbolic correlation: Stronger code generators tend to have better symbolic math, suggesting shared foundations in formal step-by-step thinking.
Version shifts: GPT-5 to GPT-5.1 shows small regressions on several axes, hinting that alignment tweaks don’t automatically improve scientific competence.
Open-source progress: Some open-weight models approach or rival closed models in specific skills (e.g., code gen), narrowing the gap. Why it matters: Investing in symbolic rigor and execution-aware training could yield broad gains.

🍞 Anchor: It’s like seeing players who practice footwork (symbols) also pass better (code)—good fundamentals carry over.

05Discussion & Limitations

🍞 Hook: No toolkit is magic; even the best microscope can’t see everything at once.

🥬 Limitations:

Coverage: Science is huge; current benchmarks can’t capture every subfield, data type, or lab workflow yet.
Judge Dependence: LLM-as-a-judge can introduce bias; cross-judging and calibration help but aren’t perfect.
Execution Environment: Code scoring depends on sandbox setup, libraries, and timeouts; environment drift can affect results.
Hypothesis Grading: Open-ended creativity is hard to score purely; human-in-the-loop remains valuable.
Data & Licensing: Some domain datasets are hard to include due to access or privacy constraints.

🍞 Anchor: Like a science fair that can’t include particle accelerators—some tests must wait.

🍞 Hook: Tools need fuel and a workshop to run smoothly.

🥬 Required Resources:

API access or GPUs/TPUs for local inference;
Python environments with scientific libraries for execution tasks;
Storage for logs, predictions, and artifacts;
Optional access to strong judge models.

🍞 Anchor: Think of it as a lab with instruments (models), reagents (datasets), and notebooks (reports).

🍞 Hook: Hammers aren’t for every job.

🥬 When NOT to Use:

If you need safety, ethics, or clinical deployment approvals—this evaluates scientific ability, not safety policy.
If your model lacks needed modalities (e.g., no vision for image tasks).
For training-time leaderboard chasing without analyzing capability breakdowns.

🍞 Anchor: Don’t use a thermometer to weigh something; pick the right tool.

🍞 Hook: Good science sparks more questions.

🥬 Open Questions:

How to robustly grade long, tool-using research agents end-to-end?
How to better evaluate unit handling, dimensional analysis, and error propagation?
How to incorporate new scientific modalities (e.g., spectra, 3D volumes, molecular graphs) at scale?
How to reduce judge bias and improve explainable grading for open-ended tasks?
How to align training so code and symbol skills improve together?

🍞 Anchor: Next steps are like adding new lab stations—spectroscopy, 3D imaging, robotics—to complete the picture.

06Conclusion & Future Work

🍞 Hook: Imagine switching from a pop quiz to a full science lab practical; the grade tells you who can really do science.

🥬 3-Sentence Summary: SciEvalKit is an open-source toolkit that evaluates AI on seven core scientific abilities across six disciplines using expert-aligned, multimodal, and execution-aware benchmarks. It provides a unified pipeline and capability-oriented scores that reveal where models are strong (facts, perception) and where they struggle (symbolic math, working code, deep multimodal understanding). The results show that true scientific intelligence remains challenging, guiding the community toward better training and testing.

🍞 Anchor: It’s a clear report card for science AIs, not just a single mystery grade.

🥬 Main Achievement: Turning scientific evaluation into a faithful, modular, and reproducible system that measures the skills scientists actually use—especially symbols, code, and figure understanding—rather than just surface answers.

🥬 Future Directions: Add agent tracks with tool-use and verification loops; expand modalities (spectra, volumetric data, molecular graphs); broaden domain coverage; improve judge calibration; and foster community-driven benchmark growth via regular releases.

🥬 Why Remember This: It reframes how we judge scientific AIs—from “Can it answer?” to “Can it really do science?”—and gives everyone a shared, trustworthy yardstick to make progress faster and safer.

Practical Applications

•Select the best model for a lab by comparing capability scores (e.g., pick a code-strong model for data pipelines).
•Run regression tests after fine-tuning to ensure symbolic reasoning or code generation didn’t degrade.
•Diagnose weaknesses (e.g., poor unit handling) and design targeted training or prompts to fix them.
•Benchmark agent workflows that combine retrieval, coding, and figure interpretation in sequence.
•Evaluate custom domain datasets (e.g., spectroscopy plots) by plugging them into the unified pipeline.
•Publish reproducible model cards with capability breakdowns for scientific use cases.
•Screen open-source models to find competitive options for budget-constrained research teams.
•Compare multimodal models to see if perception gains translate into understanding and reasoning.
•Use code-execution scoring to validate that generated analysis scripts meet lab standards.
•Guide curriculum-style training by tracking progress across the seven capabilities over time.

Version: 1