DSGym: A Holistic Framework for Evaluating and Training Data Science Agents
Key Summary
- •DSGym is a unified 'gym' where AI data science agents are tested and trained by actually running code on real datasets, not just chatting about them.
- •The framework standardizes tasks, tools, and scoring in safe, isolated containers so results are fair, comparable, and reproducible.
- •A big problem the authors found is shortcut solvability: many past benchmarks can be answered without even opening the data files.
- •DSGym-Tasks fixes this by auditing old tasks, filtering out shortcut-solvable ones, and adding new suites like DSBio (bioinformatics) and DSPredict (Kaggle-style modeling).
- •Across models, performance drops sharply on true scientific workflows; most failures come from domain-grounding mistakes rather than coding syntax.
- •On tough Kaggle-style modeling, agents often submit valid but weak solutions, showing a simplicity bias and lack of verification or iterative improvement.
- •DSGym also doubles as a training factory, generating execution-verified synthetic questions and step-by-step traces for finetuning small models.
- •A 4B model fine-tuned with DSGym’s synthetic data beats larger models like GPT-4o on standardized analysis tasks, showing data-efficient gains.
- •The framework’s manager–worker containers, read-only datasets, and task schemas make it easy to add new domains, tools, and agent scaffolds over time.
- •DSGym aims to be a living testbed that keeps measurement honest and helps build better, more grounded data science agents.
Why This Research Matters
Real science advances when results come from careful experiments, not guesses. DSGym forces AI agents to run real code on real files, so we can trust that correct answers come from proper analysis. By cleaning up old benchmarks, removing shortcut-solvable tasks, and adding domain-heavy suites, it shows where agents are strong and where they still stumble. It also doubles as a training factory, producing execution-verified examples that teach agents to plan, run, debug, and verify. This means safer automation in labs, hospitals, and businesses, and faster, more reliable discoveries. As DSGym grows, it can keep the field honest and steadily push agents toward truly expert behavior.
Detailed Explanation
Tap terms for definitions01Background & Problem Definition
🍞 Hook: Imagine your school science fair. You have a big question, some messy data from your experiment, and you need to write code to test ideas and see what’s true. Wouldn’t it be great if a helpful robot could roll up its sleeves and do the heavy data lifting—correctly and fairly?
🥬 The Concept (Data-Driven Investigation): What it is: Data-driven investigation is using real data and code to test a question so your answer is backed by evidence, not guesses. How it works (like a recipe):
- Start with a clear question (a hypothesis).
- Load the actual dataset (tables, images, gene matrices, etc.).
- Try analyses step by step (filters, stats, models), checking results each time.
- Decide the answer based on what the numbers show. Why it matters: Without this, answers can come from hunches or patterns in wording, not facts in data. 🍞 Anchor: Like measuring how fertilizer affects plant growth by actually weighing plants, not just reading the labels.
The world before: Data science agents—AI helpers that write and run code—looked promising for science: they can clean data, run statistics, and build predictive models. But the way we tested them was messy. Different benchmarks used different rules, file formats, and grading styles. Even worse, many “file-based” tasks could be solved without opening the file at all. Agents often guessed from priors or pattern matching in the prompt. That inflated scores and made it unclear if agents truly worked with data.
The problem: If an agent can answer “Which variable is most correlated with happiness?” without loading the data, we aren’t measuring data skills—we’re measuring trivia or lucky priors. Scientists need agents that can plan, code, run, check, and only believe what the data supports. We also need broad coverage: not just simple averages on CSVs, but also domain-heavy science like bioinformatics, time series, computer vision, and more.
🥬 The Concept (Shortcut Solvability): What it is: Shortcut solvability is when a task can be answered correctly without touching the actual data files. How it works:
- The agent reads the question prompt.
- It applies priors or patterns (e.g., “smoke” is likely categorical).
- It outputs the right answer by luck or common sense, not by analysis. Why it matters: It makes benchmarks look easy, hides real weaknesses, and misleads progress. 🍞 Anchor: Guessing the winner of a race from their outfit instead of timing the race.
Failed attempts: Prior benchmarks tried to be realistic by attaching files, but many didn’t strictly enforce data interaction. There were inconsistent formats, ambiguous instructions, or mistakes in answer keys. Some tasks were so generic that models solved them from world knowledge alone.
The gap: We needed a single, constant arena where tasks, tools, and grading rules are the same across domains; where every solution must run in an isolated environment; and where we remove tasks solvable without the data. We needed a way to add new tasks easily and train agents with verified, executable examples.
🥬 The Concept (Modular Architecture): What it is: A modular architecture is a system built from swappable parts that fit together cleanly. How it works:
- Define clear pieces: tasks, agents, environments.
- Make each piece plug-and-play.
- Let people add new parts (new tasks, tools) without breaking others. Why it matters: It keeps the system future-proof and easy to extend. 🍞 Anchor: Like LEGO bricks that can build a spaceship today and a castle tomorrow.
Real stakes: In real science—like finding gene markers or forecasting hospital needs—mistakes are costly. If agents “pass” by guessing instead of analyzing, scientists could waste time, trust wrong conclusions, or miss discoveries. With careful, execution-grounded evaluation, we can build agents that genuinely help: they plan, code, run, and verify, turning data into reliable answers. That’s why DSGym exists: to make sure our data science robots do the work for real, not just look smart on paper.
02Core Idea
Aha! Moment in one sentence: If we evaluate and train data science agents inside a standardized, containerized environment that forces real code to run over real files, we get honest, cross-domain measurements—and better agents.
Analogy 1: Sports Arena 🍞 Hook: You know how a fair race happens on the same track with the same rules for everyone? 🥬 The Concept (DSGym): What it is: DSGym is a shared arena where data science agents compete by running code on real data under the same rules. How it works:
- Tasks are standardized.
- Agents submit reasoning and code.
- Code runs in safe containers with read-only datasets.
- Answers are graded consistently. Why it matters: Fair comparisons and real skills—not lucky guesses. 🍞 Anchor: Like timing every runner with the same stopwatch on the same course.
Analogy 2: Science Kitchen 🍞 Hook: Imagine a kitchen where every chef gets the same ingredients, a clean stove, and a recipe card. 🥬 The Concept (Task Object): What it is: A Task Object is a neat package that includes data files, a prompt, a metric, and metadata. How it works:
- D: the files.
- P: the question/instructions.
- M: how we score.
- Z: metadata (domain, tags). Why it matters: No confusion about what to cook or how to taste-test. 🍞 Anchor: A labeled meal kit with ingredients, steps, and the judging rules.
Analogy 3: School Lab 🍞 Hook: In a lab, you use your own bench, tools are labeled, and results must be recorded and checked. 🥬 The Concept (Containerized Environment): What it is: Each agent runs in its own container with a Jupyter kernel, tools, and resource limits. How it works:
- Manager launches a fresh worker container per task.
- Datasets are mounted read-only; agent has a separate writable workspace.
- Code executes statefully across turns.
- Final answers evaluated separately for fairness. Why it matters: Keeps experiments clean, safe, and reproducible. 🍞 Anchor: Your own lab station—no swapping chemicals or mixing notebooks.
Before vs. After:
- Before: Benchmarks were fragmented, sometimes solvable without data, and hard to compare.
- After: One API, consistent scoring, enforced data grounding, and broad coverage—including domain-heavy science and real modeling challenges.
Why it works (intuition without equations): When an agent must run code in a sandbox that can only read the true files, its answer must come from computation, not guesswork. Standardized prompts and metrics remove ambiguity. Auditing and filtering remove easy shortcuts. Adding domain-specific suites (like DSBio) exposes weak spots (e.g., tool usage and biological context), while modeling suites (DSPredict) test full pipelines and iterative improvement.
Building Blocks: 🍞 Hook: Think of assembling a bike from wheels, frame, and chain. 🥬 The Concept (Agent Interface with CodeAct-style tags): What it is: Agents talk in a structured loop with reasoning, code, and final answer sections. How it works:
- <reasoning> plan your steps.
- <code> run Python to analyze.
- <answer> give the final result. Why it matters: Separates thinking, doing, and concluding for clear evaluation. 🍞 Anchor: Writing a plan, doing the experiment, then writing your conclusion.
🍞 Hook: When you add new subjects to school, you also add fitting textbooks and tools. 🥬 The Concept (DSGym-Tasks): What it is: A curated suite that standardizes old datasets, filters shortcuts, and adds new domains. How it works:
- Audit and fix old tasks.
- Remove tasks solvable without data.
- Add DSBio and DSPredict. Why it matters: Measures real data skills across many fields. 🍞 Anchor: A better test that checks you can actually do the work, not just memorize facts.
🍞 Hook: Like checking your homework by running it again to see if you get the same answer. 🥬 The Concept (Execution-Verified Data Synthesis): What it is: A pipeline to generate synthetic training questions and step-by-step solutions that are validated by running code. How it works:
- Agent explores data and proposes a question and answer.
- It solves its own question via code.
- A judge checks clarity, execution, and alignment.
- Keep only high-quality pairs. Why it matters: Creates trustworthy training data that teaches real execution behavior. 🍞 Anchor: Baking cupcakes, tasting them, and only sharing the ones that turn out right.
Together, these pieces form DSGym’s key idea: a live, extensible, execution-grounded testbed that both measures and improves data science agents honestly.
03Methodology
High-level overview: Input (Task Object) → Agent loop (reasoning + code) → Container execution (stateful Jupyter) → Clean evaluation (metrics) → Output (score, logs)
Step 1: Tasks organized for real data interaction 🍞 Hook: You know how a good worksheet tells you what materials to use, what to do, and how the teacher will grade it? 🥬 The Concept (Task Object): What it is: A standardized bundle of D (data), P (prompt), M (metric), and Z (metadata) used for each problem. How it works:
- D: Mounts the real files (CSV, H5AD, images) in read-only mode.
- P: Gives clear instructions and answer format.
- M: Defines exact scoring logic (e.g., accuracy, RMSE).
- Z: Labels domain and tags for analysis. Why it matters: Ensures every task is unambiguous, scorable, and reproducible. 🍞 Anchor: A sealed lab kit with clear steps and a grading rubric.
Task categories: 🍞 Hook: Predict tomorrow’s weather vs. explain why it rained. 🥬 The Concept (Data Prediction Tasks): What it is: Build a model on training data and predict labels for test data. How it works:
- Read D_train and D_test.
- Train a model on D_train.
- Generate predictions on D_test.
- Score using a metric (e.g., RMSE, accuracy). Why it matters: Tests full ML pipelines and engineering choices. 🍞 Anchor: Forecasting web traffic for Wikipedia pages.
🍞 Hook: Solving a riddle by crunching the numbers you’re given. 🥬 The Concept (Data Analysis Tasks): What it is: Answer a question by programmatically analyzing datasets. How it works:
- Load files and inspect columns.
- Transform and compute statistics.
- Apply tests/models as needed.
- Output a precise answer with the required format. Why it matters: Proves the agent can reason with data, not just predict. 🍞 Anchor: Testing whether two groups have different averages using the real table.
Step 2: Agent loop and interface 🍞 Hook: In class, you write your plan, do the experiment, and then write the conclusion. 🥬 The Concept (Structured Agent Interface): What it is: Agents communicate in turns using tags for planning, code, and final answers. How it works:
- <reasoning> for the plan and reflections.
- <code> for Python that runs inside the container.
- <answer> for the final response. Why it matters: Clean separation makes logs clear and grading reliable. 🍞 Anchor: A lab report with Methods, Results, and Conclusion sections.
Step 3: Environment: manager–worker containers with state 🍞 Hook: Everyone gets their own clean lab bench with tools and a notebook. 🥬 The Concept (Isolated, Stateful Containers): What it is: Each task runs in its own Docker container with a Jupyter kernel; datasets are read-only. How it works:
- Manager spins up a worker per task.
- Worker has domain libraries preinstalled.
- State persists across turns (variables, models, temp files).
- Resource limits avoid runaway jobs; final grading runs cleanly outside. Why it matters: Guarantees isolation, safety, and reproducibility. 🍞 Anchor: Your own station where your chemicals (data) can’t be altered and your notes persist.
Support features: 🍞 Hook: Borrowing a calculator that’s checked to work. 🥬 The Concept (Tool Integration): What it is: Safe code-callable tools (e.g., lightweight web search) inside the kernel. How it works:
- Tools are registered functions.
- Agents call them via code.
- Results return as information blocks. Why it matters: Adds power without breaking safety or reproducibility. 🍞 Anchor: Using a certified scale in a lab instead of guessing weights.
🍞 Hook: You can look, but you can’t secretly change the answer key. 🥬 The Concept (Filesystem Protection): What it is: Datasets are mounted read-only; agent writes in a separate workspace. How it works:
- Strict permissions.
- Clear separation of read-only data and writable area.
- Prevents tampering or hidden shortcuts. Why it matters: Keeps results honest and repeatable. 🍞 Anchor: Transparent lockers for materials; you can’t swap ingredients.
Step 4: Dataset curation: cleaning and filtering 🍞 Hook: Before a test, teachers fix broken questions and remove trick items that don’t test real skills. 🥬 The Concept (Shortcut Filtering): What it is: Remove tasks often solvable without data access. How it works:
- Audit tasks for quality and formatting.
- Run multiple top models with data access disabled.
- If most can still answer correctly, drop the task. Why it matters: Forces data-dependent reasoning. 🍞 Anchor: Cross out the quiz question you can answer by guessing the pattern.
New task suites: 🍞 Hook: Adding a science unit with real lab techniques. 🥬 The Concept (DSBio): What it is: 90 expert-built bioinformatics tasks from peer-reviewed studies. How it works:
- Pick public datasets that fit compute limits.
- Turn paper findings and expert analyses into executable, deterministic questions.
- Double-review with gold notebooks for correctness. Why it matters: Tests domain understanding, special file types (e.g., H5AD), and library usage. 🍞 Anchor: Finding which cell type is most correlated with a spatial cluster, using real omics data.
🍞 Hook: Practicing with real sports tournaments to prepare for the championship. 🥬 The Concept (DSPredict): What it is: A collection of Kaggle-like modeling challenges (easy and hard) standardized for DSGym. How it works:
- Crawl competitions via API.
- Filter by rules (CSV submissions, size limits, clear leaderboards).
- Prepare metadata and difficulty splits. Why it matters: Tests end-to-end modeling and iterative improvements. 🍞 Anchor: Beating a leaderboard with a pipeline that loads, trains, tunes, and submits.
Step 5: Training via execution-verified synthesis 🍞 Hook: Making your own practice quizzes and solving them to check they work. 🥬 The Concept (Execution-Verified Data Synthesis): What it is: Generate synthetic questions and solutions that are checked by running code. How it works:
- Agent explores data and proposes distinct, solvable queries with answers.
- Sample diverse solution trajectories.
- An LLM judge checks clarity, execution robustness, alignment, and answer plausibility.
- Keep diverse, high-quality pairs for supervised finetuning. Why it matters: Produces training data that teaches agents to plan, run, debug, and verify. 🍞 Anchor: Writing math problems, solving them, and only keeping the ones that compute cleanly.
Secret sauce: DSGym’s honesty comes from enforcing execution over data in safe, consistent containers, combined with audited tasks that remove shortcuts and add domain depth. Its extensible design lets the community keep growing the gym without breaking fairness or reproducibility.
04Experiments & Results
The test: The authors checked if agents can truly analyze data, handle specialized science workflows, and build end-to-end models. They measured exact-match accuracy for analysis tasks and leaderboard-style metrics for modeling: Valid Submission Rate (did it run and submit?), Above Median Rate (better than the typical team?), and Medal Rate (did it reach a top-tier threshold?).
🍞 Hook: It’s like grading a race by: did you finish, did you beat the average runner, and did you get on the podium? 🥬 The Concept (Leaderboard-style Metrics): What it is: Three modeling metrics—Valid Submission Rate, Above Median Rate, Any Medal Rate—plus Percentile rank for saturated easy sets. How it works:
- Valid: Does the pipeline run and produce a properly formatted submission?
- Above Median: Is your score better than the median competitor?
- Medal: Did you reach a bronze/silver/gold-like threshold? Why it matters: Separates “it runs” from “it wins.” 🍞 Anchor: Finishing the race vs. beating the average vs. earning a medal.
The competition: They compared frontier proprietary models (like GPT-5.1, GPT-4o, Claude Sonnet 4.5/4) and strong open-weight models (Qwen3-Coder 480B, Qwen3 235B, DeepSeek-v3.1, Kimi K2, etc.). Everyone used the same default CodeAct-style agent, no external tools.
Scoreboard with context:
- General analysis (QRData-Verified, DABStep-easy, DAEval-Verified): Strong models did well, but multi-step hard splits (DABStep-hard) were much lower. Think of getting an A on worksheets but a C on the long, multi-part project.
- DSBio (bioinformatics): Accuracy was much lower across the board. Kimi K2 Instruct led at about 43%, with Claude Sonnet 4.5 next. This shows that specialized scientific workflows are genuinely hard.
- DSPredict-Easy (Kaggle playground): Many models had very high Valid Submission Rates (>80–100%). That’s like most runners crossing the finish line.
- DSPredict-Hard: Valid Submission often under 70%, and Medal Rates were near zero. Above Median peaked near 14.3%. This means agents often got a runnable baseline but didn’t push to competitive performance—like finishing the marathon but far from the podium.
- GPT-5.1 with higher reasoning effort performed best among the evaluated models, especially on harder prediction tasks.
Surprising findings: 🍞 Hook: Imagine baking cookies that look perfect but taste plain. 🥬 The Concept (Simplicity Bias): What it is: Agents prefer quick, safe answers (like median baselines) over stronger but more complex solutions. How it works:
- Tool or API hiccup happens.
- Agent switches to a minimal-effort approach.
- Produces a valid but underperforming submission. Why it matters: High valid rates can hide weak modeling and lack of iteration. 🍞 Anchor: Turning in a neat worksheet with only the easiest problems solved.
🍞 Hook: Misunderstanding a biology word and then using the wrong tool. 🥬 The Concept (Domain-Grounding Errors): What it is: Mistakes from misreading scientific context or misusing domain libraries. How it works:
- Misinterpret dataset metadata or tasks.
- Pick wrong preprocessing or analysis methods.
- Get the right-looking numbers for the wrong question. Why it matters: Leads to confident but incorrect science. 🍞 Anchor: Using a kitchen thermometer to measure wind speed.
Numbers made meaningful:
- On DSBio, failure analysis showed 85–96% of sampled errors were domain-grounding related. That’s like most wrong answers coming from misunderstanding the question, not from typos.
- Shortcut filtering caused consistent accuracy drops across models on the same dataset, proving that earlier scores benefited from prompt-only shortcuts.
- A 4B model fine-tuned on 2,000 execution-verified synthetic examples (DSGym-SFT) beat its own base version and even surpassed GPT-4o on some standardized analysis benchmarks. That’s like a junior runner training with carefully checked drills and then outperforming bigger athletes on certain courses.
Takeaway: When evaluation requires code to run over real files, we see the true picture: agents can often “finish” but rarely “medal” on hard modeling; they stumble on specialized science; and they lean on simple baselines unless trained to plan, verify, and iterate.
05Discussion & Limitations
Limitations:
- Coverage: DSGym focuses on deterministic, file-grounded tasks. Real science also includes open-ended exploration, visualization, and multiple valid answers. Those are not yet covered.
- Domain breadth: The current flagship scientific suite is bioinformatics; other domains (chemistry, materials, geoscience) are promising but require similar expert curation.
- Compute and environment: Containerized, stateful execution is powerful but comes with resource limits (memory, timeouts) and occasional API/library incompatibilities that can bottleneck progress.
- Training signal: For RL-style training, credit assignment over long, multi-step workflows remains hard; designing good rewards and verification is an open problem.
Required resources:
- Docker-capable hardware to run isolated containers, ideally with enough CPU/RAM for parallel trajectories.
- Curated datasets and domain-specific containers (e.g., bioinformatics libraries for H5AD) when evaluating specialized workflows.
- Stable internet not required for core tasks (tools like web search are off by default), which supports reproducibility.
When not to use:
- If you need visualization-heavy tasks or subjective reporting (e.g., “write a narrative insight”), DSGym’s deterministic scoring may not fit yet.
- If your models can’t run code or you can’t support containerized execution, you won’t get the main benefit of execution-grounded evaluation.
- If you want a text-only reasoning benchmark, DSGym is overkill.
Open questions:
- How to scale domain coverage without sacrificing expert quality and determinism?
- Can tool abstractions (well-scoped, robust functions) reduce domain errors while preserving problem-solving depth?
- What are the best curricula or RL rewards to push agents beyond simplicity bias toward iterative improvement and verification?
- How to evaluate open-ended scientific discovery fairly—perhaps using execution-trace judges or controlled human-in-the-loop protocols? Overall, DSGym sets a solid foundation for honest, execution-grounded measurement, but the community still needs to broaden domains, handle open-ended science, and teach agents to persist, verify, and refine like real data scientists.
06Conclusion & Future Work
Three-sentence summary: DSGym is a standardized, containerized framework that evaluates and trains data science agents by forcing real code to run over real datasets, ensuring honest, reproducible results. It fixes common benchmark flaws by auditing tasks, filtering out shortcut-solvable items, and adding challenging suites like DSBio (bioinformatics) and DSPredict (Kaggle-style modeling). Beyond evaluation, DSGym synthesizes execution-verified training data that makes even small models measurably better.
Main achievement: Turning evaluation into an execution-grounded, extensible “gym” that both reveals true agent capabilities and produces trustworthy training data—closing the loop between measuring and improving.
Future directions: Expand domain coverage (chemistry, geoscience, economics), add robust tool packs to reduce domain errors, study RL and curricula that reward verification and iteration, and design reliable judges for open-ended scientific discovery. Continue operating as a live testbed so the community can add tasks, tools, and agents while preserving auditable, apples-to-apples comparisons.
Why remember this: DSGym shows that when you make agents actually run code on locked-down data, the fog lifts—you see real strengths, real weaknesses, and a clear path to teaching better habits. It’s a practical blueprint for moving AI data science from “sounds smart” to “proves it with execution,” which is exactly what science needs.
Practical Applications
- •Evaluate your in-house data assistant fairly across many domains using a single, reproducible API.
- •Train a small open-weight model with execution-verified traces to improve planning and debugging skills cost-effectively.
- •Benchmark candidate LLMs for regulated settings (finance, healthcare) where data-grounded analysis is critical.
- •Stand up a classroom lab where students learn by writing agents that truly run code on datasets.
- •Prototype domain-specific agent packs (e.g., bioinformatics) with preloaded libraries and read-only datasets.
- •Run ablation studies on agent scaffolds (ReAct/CodeAct/tree search) in a controlled, containerized environment.
- •Continuously add new tasks or tools without breaking old results, enabling living leaderboards for progress tracking.
- •Use DSPredict to stress-test your team’s AutoML or feature engineering strategies against real leaderboards.
- •Generate high-quality synthetic instruction data that’s execution-verified to fine-tune agents safely.
- •Audit and remove shortcut-solvable tasks from your internal benchmarks to get truer capability signals.