AIRS-Bench: a Suite of Tasks for Frontier AI Research Science Agents

Alisia Lupidi; Bhavul Gauri; Thomas Simon Foster; Bassel Al Omari; Despoina Magka; Alberto Pepe; Alexis Audran-Reiss; Muna Aghamelu; Nicolas Baldwin; Lucia Cipolina-Kun; Jean-Christophe Gagnon-Audet; Chee Hau Leow; Sandra Lefdal; Hossam Mossalam; Abhinav Moudgil; Saba Nazir; Emanuel Tewolde; Isabel Urrego; Jordi Armengol Estape; Amar Budhiraja; Gaurav Chaurasia; Abhishek Charnalia; Derek Dunfield; Karen Hambardzumyan; Daniel Izcovich; Martin Josifoski; Ishita Mediratta; Kelvin Niu; Parth Pathak; Michael Shvartsman; Edan Toledo; Anton Protopopov; Roberta Raileanu; Alexander Miller; Tatiana Shavrina; Jakob Foerster; Yoram Bachrach

AIRS-Bench: a Suite of Tasks for Frontier AI Research Science Agents

Intermediate

Alisia Lupidi, Bhavul Gauri, Thomas Simon Foster et al.2/6/2026

arXiv

Key Summary

•AIRS-Bench is a new test suite that checks whether AI research agents can do real machine learning research from start to finish, not just answer questions.
•Each task gives a clear problem, a real dataset, and a scoring rule, and the agent must write and run its own code to solve it—no starter code provided.
•The benchmark covers 20 hard, recent tasks across NLP, math, code, molecules/proteins, and time series, and works with multiple agent frameworks (harnesses).
•AIRS-Bench measures three things: how often an agent submits a valid answer (VSR), how well it performs across different tasks (normalized score), and how it ranks versus others (Elo).
•Agents beat human state-of-the-art on 4 tasks but fall short on the other 16, showing big room to improve and that the benchmark isn’t close to solved.
•Tree-search (greedy) scaffolds often help agents more than single-pass or linear scaffolds, especially with reasoning-focused models.
•Even the top agents are far from the theoretical ceilings of the tasks, so better scaffolds and smarter search can still move the needle.
•The benchmark tries to reduce unfair advantages by standardizing environments and using a normalization that makes different metrics comparable.
•All task definitions and evaluation code are open-sourced, inviting the community to build stronger, more reliable AI research agents.

Why This Research Matters

AIRS-Bench moves AI from talking about research to actually doing it under fair, repeatable conditions. This helps teams choose reliable agent designs that save time and compute, speeding up discoveries in areas like healthcare, materials, and energy forecasting. By normalizing scores across very different tasks, labs can spot genuine improvements instead of being fooled by easy wins. Open tasks and code make it easier for everyone—from startups to universities—to compare methods and reproduce results. In the long run, this kind of rigorous evaluation builds trust, reduces hype, and focuses investment on strategies that truly advance autonomous scientific research.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine your class is doing a big science fair. Some kids write ideas, some build things, some test and fix mistakes, and some explain what happened. Now imagine a robot helper that could do all of that. How do we grade this robot fairly?

🥬 The Concept (Benchmarks for AI research agents):

What it is: A benchmark is a fair race track with clear rules to judge how well AI research agents do real scientific work from idea to results.
How it works:
1. Pick important, real research problems and datasets.
2. Tell the agent exactly what counts as success (the metric) but don’t give it any starter code.
3. Make the agent write and run its own code to solve the task.
4. Score the result in a way that lets us compare across many different tasks.
5. Repeat across tasks and agents to see who’s truly better.
Why it matters: Without a fair, shared test, we can’t tell if a new agent is actually better or just lucky or using hidden shortcuts.

🍞 Anchor: Like a school sports day where every event has rules (distance, time, points), and the same stopwatch times everyone.

🍞 Hook: You know how a very smart friend can explain things and write essays? That’s like an LLM.

🥬 The Concept (Large Language Models, LLMs):

What it is: LLMs are AI systems that learn patterns in language so they can read, write, and reason with text.
How it works:
1. Read tons of text to learn how words relate.
2. Predict the next word step by step to generate useful responses.
3. Use this skill to plan, write code, and explain ideas.
Why it matters: LLMs are the “brains” behind research agents—they think through instructions and produce plans and code.

🍞 Anchor: When you ask a chatbot to help with homework, that chatbot is powered by an LLM.

🍞 Hook: Building a treehouse is easier with scaffolding that holds you up while you work.

🥬 The Concept (Scaffolds):

What it is: A scaffold is a strategy that guides the LLM’s steps—how to think, try, get feedback, and improve.
How it works:
1. Pick a search style (e.g., one step, step-by-step, or tree search with many branches).
2. Use operators like Draft (make a first version), Debug (fix errors), and Improve (try a better idea).
3. Look at feedback (scores, errors) and decide the next move.
Why it matters: Without scaffolds, even smart models can get stuck, fail to fix bugs, or miss better solutions.

🍞 Anchor: Like following a recipe with taste tests at each step to adjust seasoning.

🍞 Hook: A coach organizes practice, keeps time, and tracks scores so players can focus on playing.

🥬 The Concept (Harnesses):

What it is: A harness is the software that runs the agent and its scaffold, manages tools, executes code, and keeps everything fair.
How it works:
1. Load the task and its files (problem, data, metric).
2. Let the agent draft, run, and evaluate code in a controlled environment.
3. Save results, handle crashes, and enforce time/compute limits.
Why it matters: Without a solid harness, differences in setup—not the agent’s skill—could decide who wins.

🍞 Anchor: Like a science lab manager who sets up the benches, safety rules, and timers for every group.

The World Before: AI models got very good at single-shot answers, but real research is a long journey: form a hypothesis, choose a method, write code, run experiments, analyze, and iterate. Many earlier tests judged only small pieces (like code fixes or a single training run), used starter code, or had inconsistent environments. Also, LLMs might have seen benchmark answers online (data contamination), and agent setups differed so much that results were hard to compare.

The Problem: We lacked a standardized, end-to-end way to test if an AI agent could be a real research helper: inventing, coding, debugging, experimenting, and refining—under fair, repeatable conditions.

Failed Attempts: Prior benchmarks often (1) gave starter solutions, lowering the bar; (2) emphasized short tasks that didn’t need long planning; or (3) mixed environments so much that it was unclear whether wins came from the agent or the setup. On top of that, performance metrics varied wildly across tasks, making cross-task comparisons shaky.

🍞 Hook: Comparing apples and oranges is hard unless you convert them to the same unit.

🥬 The Concept (Normalized score):

What it is: A way to rescale different task metrics so we can compare them fairly on a 0-to-1+ scale.
How it works:
1. Find the worst valid score seen and label it 0.
2. Set the human state-of-the-art (SOTA) as 1.
3. Transform each agent’s raw score to this shared scale using a special curve (the “march of nines”) that values closing tiny gaps near optimal.
Why it matters: Without normalization, a 1% gain could be huge for one task but tiny for another, and we’d misjudge progress.

🍞 Anchor: Like curving grades so a 95% in one hard test and a 70% in a super-hard test can be compared fairly.

The Gap: We needed a carefully curated set of modern, unsolved tasks; a standard task format; fair, harness-agnostic execution; and robust, cross-task metrics. And we needed agents to produce code from scratch, proving true research ability—not just pattern matching.

Real Stakes: Better AI research agents could accelerate discoveries in medicine, energy, and education by exploring ideas faster and more broadly. A shared, open benchmark keeps progress honest and reproducible, reduces hype, and gives the community a clear map of what works, what doesn’t, and where to improve next.

02Core Idea

🍞 Hook: Think of a cooking show where contestants must cook from a mystery basket, write their own recipe, and the judges taste the final dish. No pre-written recipes allowed.

🥬 The Concept (AIRS-Bench’s main insight):

What it is: Make AI agents write and run their own research code on real, recent ML tasks, then score them with fair, comparable metrics so we can truly see who can ‘do science.’
How it works:
1. Each task gives a clear problem, dataset, and metric—plus a project description and evaluation scripts.
2. The agent, powered by an LLM and a scaffold, drafts, debugs, and improves code; harnesses execute everything under the same rules.
3. Results are scored and normalized; rankings across agents use Elo.
Why it matters: This turns vague promises into measurable proof of research ability, across diverse domains, without starter code.

🍞 Anchor: Like a science fair where every team builds from scratch, uses the same lab rules, and is graded with a shared rubric.

The “Aha!” in one sentence: Evaluate agents as end-to-end researchers by forcing them to generate, run, and refine their own code on unsaturated, standardized tasks—and compare them fairly across tasks and setups.

Multiple Analogies:

Sports League: Each task is a different sport; normalized scores put all sports on one leaderboard; Elo ranks teams after every match-up.
Treasure Hunt: Agents plan routes (scaffolds), try paths (draft/debug/improve), and use a map (harness) that’s the same for everyone; normalized scores measure how close each gets to the treasure.
Kitchen Lab: Tasks are recipes to invent; harness is the kitchen; agents combine ingredients (models, training loops), taste, adjust, and plate; judges use the same scoring guide.

Before vs After:

Before: Many tests judged narrow skills, used inconsistent environments, and let agents lean on baseline code.
After: AIRS-Bench demands full-cycle research, no starter code, unified task format, controlled runs, and cross-task comparison.

🍞 Hook: You know how a good measuring tool makes tiny improvements near perfection still count as big wins?

🥬 The Concept (Why it works—intuition behind the math):

What it is: The normalization uses a curve (“march of nines”) so shaving errors near the best-known scores is still recognized as meaningful.
How it works:
1. Define the task’s true theoretical best (like 100% accuracy or 0 error).
2. Map raw scores so that getting 0.99→0.999 counts similarly to 0.9→0.99 (both are 10× closer to perfect).
3. Aggregate these normalized scores across tasks.
Why it matters: Research often struggles to improve near the top; this curve respects that effort and compares apples to apples.

🍞 Anchor: Like stopwatch precision in sprinting: cutting from 10.00s to 9.90s is a huge deal, and the scoring should reflect it.

Building Blocks:

🍞 Hook: Think of a tidy toolbox where each tool has a clear label and place.
🥬 The Concept (Task configuration standard):
- What it is: A neat bundle—metadata.yaml (task facts), project_description.md (instructions), prepare/evaluate scripts, and data folders.
- How it works: Agents read the description, load the data, train models, write submission.csv, and get scored by evaluate.py.
- Why it matters: Consistent, portable tasks remove confusion and allow fair comparisons and easy onboarding.
- 🍞 Anchor: Like a LEGO set with instructions and numbered bags.
Harness-agnostic design: The same task bundle works with different harnesses (AIRA-dojo tree-search; MLGym ReAct style), so results reflect the agent, not the setup.
Metrics trio: Valid Submission Rate (can you even submit right?), Normalized Score (how well across tasks?), Elo (who wins head-to-head?).
Diverse, recent tasks: 20 modern, unsaturated challenges across NLP, code, molecules/proteins, math, and time series.

Together, these pieces make a sturdy bridge from “cool demo” to “proven research skill,” with open-source code that invites the community to iterate and improve.

03Methodology

High-level recipe: Input (Task files: problem, dataset, metric; plus project_description.md, metadata.yaml, prepare/evaluate scripts) → Agent (LLM + scaffold) inside a harness (AIRA-dojo or MLGym) → Draft/Run/Evaluate/Iterate for up to 24 hours on one H-200 GPU → Output submission.csv → Score, Normalize, and Rank.

Step-by-step:

Task Packaging

What happens: Each task is a tidy folder with: • metadata.yaml (task name, dataset source/splits, metric, SOTA reference), • project_description.md (plain-English instructions + scoring rules), • prepare.py and evaluate_prepare.py (data prep for agent vs evaluator), • evaluate.py (official scorer), • data/train and data/test folders (labels hidden from the agent in test view).
Why it exists: It standardizes tasks across harnesses and prevents leaks (agents can’t peek at test labels).
Example: SVAMP math QA asks agents to predict numerical answers; evaluate.py checks accuracy over 300 test rows.

Environment Setup (Harness)

What happens: The harness loads the task, sets time/compute limits (24h, one H-200 GPU), gives the agent access to tools (bash, Python, cached HF models), and logs everything.
Why it exists: To make runs consistent and fair across agents, and to capture crashes or invalid outputs.
Example: Both AIRA-dojo and MLGym allow internet access and a cache of older HF checkpoints (latest ~2021) to avoid rate limits.

Agent Definition: LLM + Scaffold

What happens: An LLM (e.g., reasoning-focused or code-focused) is paired with a scaffold strategy: • One-Shot: Single draft only (AIRA-dojo Draft operator). • Greedy (AIRA-dojo): Tree-search over many candidate solutions with Draft/Debug/Improve operators. • ReAct (MLGym): Sequential think-act-reflect loops.
Why it exists: Different search styles explore differently—breadth (tree) vs depth (sequential)—which affects solution quality.
Example: Greedy search may spin up 10+ code variants, fix errors, and keep the best-performing branch.

Draft → Debug → Improve Loop

What happens: • Draft: The agent writes initial training/inference code using the task files and instructions. • Run & Evaluate: Harness executes the code, gets metric scores from evaluate.py. • Debug: If code crashes or misformats submission.csv, the agent repairs it. • Improve: The agent tweaks models (architecture, hyperparameters, ensembling) to raise scores.
Why it exists: Real research is iterative; this loop codifies the scientific method (hypothesize, test, revise).
Example: For SICK textual classification, the agent fine-tunes two transformers, collects logits via 5-fold CV, and trains a meta-learner to combine them.

Submission and Scoring

What happens: When ready, the agent outputs submission.csv in the right shape; evaluate.py computes the raw metric (e.g., accuracy, MAE); results are stored.
Why it exists: A strict format ensures objective, automated scoring and easier comparison.
Example: For SVAMP, the header must be exactly “Answer” and length must match test set rows.

Aggregation Across Seeds and Tasks

What happens: Each agent-task pair is run multiple times (seeds). Three aggregate metrics are computed: • Valid Submission Rate (VSR): fraction of seeds that produced a valid, scorable submission. • Normalized Score (NS): raw scores mapped via a transform so 0 is the worst seen, 1 is human SOTA, and >1 means above SOTA. • Elo Rating: Fit a Bradley–Terry model to head-to-head results across tasks to produce an order-invariant ranking.
Why it exists: Single runs are noisy; different tasks have different scales; Elo captures relative skill robustly.
Example: If two agents tie or both fail on a task, that counts as a draw in Elo.

The Secret Sauce

Versatile task standard: A single task bundle that ports cleanly across harnesses, minimizing environment effects.
No baseline code: Forces genuine ideation and engineering, exposing true agentic ability.
March-of-nines normalization: Rewards progress near the top, where it’s hardest but most meaningful.
Broad, unsaturated tasks: Keeps the benchmark challenging and future-proof, so it won’t saturate quickly.

Key mini-concepts explained in-action:

🍞 Hook: Grading fair across many subjects. 🥬 Normalized Score (what/how/why):
- What: A shared scale for comparing different raw metrics.
- How: Map worst-seen to 0; human SOTA to 1; use a curve that values tightening tiny gaps near optimum.
- Why: Lets us average scores and tell real improvements from noise. 🍞 Anchor: Curving grades so math and art can be part of one GPA.
🍞 Hook: Do your homework first, then turn it in correctly. 🥬 Valid Submission Rate (what/how/why):
- What: How often an agent submits a valid, scorable file.
- How: Count valid seeds / total seeds per task, then average.
- Why: Great ideas don’t matter if you never submit correctly. 🍞 Anchor: A+ work that’s never turned in still gets a zero.
🍞 Hook: Chess ratings compare players who faced different opponents. 🥬 Elo via Bradley–Terry (what/how/why):
- What: A skill rating based on head-to-head outcomes.
- How: Treat each pairwise task comparison like a game; fit a model to infer skill; convert to Elo scale.
- Why: Order-invariant and robust across mixed matchups. 🍞 Anchor: A league table that fairly ranks teams even if schedules differ.

Resource choices: Each run has 24 hours and one H-200 GPU to keep comparisons fair and budgets sane. Agents may use cached pretrained models but not the latest foundation checkpoints. The same constraints apply to all, highlighting scaffold and reasoning strengths rather than cloud budget.

04Experiments & Results

The Test: AIRS-Bench evaluates whether agents can complete full research cycles and produce competitive scores on 20 modern tasks, not just pass a quiz. It measures:

Valid Submission Rate (VSR): Can you reliably submit a correct-format, scorable result?
Average Normalized Score: After fair rescaling, how close are you to human SOTA across tasks?
Elo Rating: In head-to-head comparisons, who wins more often?

The Competition: 14 agent configurations (LLM + scaffold) were tested across two harness styles:

Scaffolds: One-Shot (single attempt), Greedy (tree search via AIRA-dojo), ReAct (sequential via MLGym).
Models spanned reasoning and code-centric LLMs (e.g., gpt-oss-20b/120b, o3-mini, GPT-4o, CWM, Devstral).
All agents used the same time/compute limits and had access to the same HF cache list.

Scoreboard with context:

Valid Submission Rate (overall average): 59.3%. • This is like a class where only about 6 out of 10 attempts hand in a correctly formatted assignment—just submitting valid work is non-trivial. • Best agents (Greedy with larger reasoning models) achieved VSRs around 94–97% on average, showing the reliability boost from tree search.
Average Normalized Score (overall average): 24.1%. • Think of this as getting about a C when compared to SOTA’s “A+”—clear headroom remains. • Greedy scaffolds dominated, with the top Greedy agent reaching ~0.52 normalized score, much higher than One-Shot configs.
Elo Ratings: • Human SOTA scores were included as another “player,” and there remained a sizable gap between SOTA and the best agent. • Greedy scaffolds typically outranked ReAct and One-Shot, reflecting the power of branch-and-improve search.

Surprising and notable findings:

Tree-search helps broadly: Greedy AIRA-dojo agents often outperformed their One-Shot counterparts by a large margin, for both open and closed models.
Reasoning models shine: Larger or more reasoning-tuned LLMs tended to fare better, especially under Greedy scaffolds.
Agent “personalities”: Some models (e.g., o3-mini) submitted more often but with lower selectivity, leading to many valid-but-weaker runs; others (e.g., CWM) submitted less often but more confidently.
Above-SOTA wins exist but are rare (~1.55% of agent-task seeds): • TextualClassificationSickAccuracy: Agent (Greedy gpt-oss-120b) built a stacked ensemble (RoBERTa-large + DeBERTa-v3-large with logistic regression meta-learner), beating a vanilla RoBERTa SOTA (≈93.1% vs 90.5%). • TextualSimilaritySickSpearmanCorrelation: Agent averaged finetuned RoBERTas with Sentence-BERT cosine similarities via CV weighting, surpassing a strong RoBERTa-large + CoSENT loss baseline (~0.89 vs 0.85). • Winogrande Coreference: A simple DeBERTa-v3-large fine-tuning beat a T5-3B style SOTA (~0.88 vs 0.85). • Rideshare Time-Series MAE: A bidirectional GRU trained by the agent beat a transformer-based time-series foundation model not finetuned on this dataset (~1.153 vs 1.185 MAE; lower is better).

What this means: Even when agents win, they still don’t touch the theoretical ceiling (like perfect accuracy or zero error). The normalization’s march-of-nines curve shows how hard—but meaningful—it is to close the last gaps. The broad takeaway is that AIRS-Bench is far from saturated: smarter scaffolds, better iteration strategies, and more robust coding/execution could yield significant gains.

Category performance patterns (qualitative):

On easier tasks, scores vary widely, indicating that scaffolds and model choices matter a lot.
On the hardest tasks, all agents cluster at low normalized scores, showing consistent difficulty and room to innovate.

Bottom line: Greedy tree-search with strong reasoning models rises to the top; valid submission reliability is itself a major hurdle; and rare but real above-SOTA moments prove that autonomous research agents can discover competitive, sometimes novel, solutions.

05Discussion & Limitations

Limitations:

Compute heavy: Each run gets 24 hours on an H-200 GPU, and there are many seeds and tasks—this limits rapid iteration for smaller labs.
Only 20 tasks: Carefully curated and diverse, but still a subset; some domains (e.g., vision-heavy or multimodal lab tasks) are not included yet.
Data contamination risk: LLM pretraining on internet data means a nonzero chance of prior exposure; task construction tries to reduce this, but it can’t be eliminated entirely.
Fragile agent pipelines: Long reasoning traces can cause context overflows, file formatting mistakes, or lost intermediate results, which hurt both VSR and peak performance.
Cached models are older: The HF cache avoids rate limits but misses the latest checkpoints, potentially capping achievable scores.
Human bottlenecks: Task sourcing/review/verification involve people; scaling to hundreds of tasks will need more automation.

Required Resources:

Time and GPU: 24h per run per task on one H-200 GPU (or equivalent), multiplied by seeds—plan budgets accordingly.
Harness + infra: AIRA-dojo or MLGym setup, dataset preparation scripts, caching infrastructure, and robust logging/monitoring.
Model access: Either API-based or self-hosted LLMs with sufficient context and reasoning ability.

When NOT to Use:

Real-time or low-compute scenarios where 24h runs are impractical.
Domains that require proprietary, sensitive, or multimodal lab data not reflected in the current task set.
Quick leaderboard toggles: If you just want a light sanity check, the full AIRS-Bench protocol may be overkill.

Open Questions:

Better scaffolds: Which search policies (e.g., smarter tree policies, hybrid evolutionary methods, learned controllers) most improve reliability and scores per GPU hour?
Tool use: How should agents decide between ensembling, cross-validation, or foundation-model adapters under strict budgets?
Robustness: How to reduce formatting/debugging failures automatically (auto-repair tools, typed pipelines, or sandboxed I/O schemas)?
Contamination guards: Can we detect/prevent pretraining leakage more reliably (e.g., retro-holdouts, provenance tracking)?
Scaling tasks: What semi-automated pipelines and validations can expand to 100+ tasks without losing quality?

Overall assessment: AIRS-Bench credibly stresses agents on the full scientific loop. The main wins come from scaffold design and reliable execution, not just bigger models. The benchmark is far from solved, offering clear headroom for the community to innovate.

06Conclusion & Future Work

Three-sentence summary: AIRS-Bench is an open, standardized benchmark that makes AI agents do end-to-end machine learning research—ideate, code, run, and refine—on 20 challenging, real tasks. It scores agents fairly across tasks using valid submission rate, a cross-task normalized score with a march-of-nines transform, and Elo ratings, all inside controlled harnesses. Results show big progress but also big gaps: agents beat SOTA on 4 tasks yet remain far from human SOTA and theoretical ceilings overall.

Main Achievement: Turning “AI agent can do research” from a slogan into a measurable, reproducible claim by forcing agents to generate and execute their own code on modern, unsaturated tasks with a rigorous, harness-agnostic evaluation protocol.

Future Directions: Expand task coverage and modalities; develop smarter, compute-efficient scaffolds; improve auto-repair and formatting reliability; explore stronger contamination defenses; and automate task onboarding to scale the suite. Combining richer tool use (ensembles, CV, adapters) with learned search policies could significantly boost both VSR and normalized scores.

Why Remember This: AIRS-Bench sets a high bar for what it means to be a true AI research agent: not just talking about science, but doing it—end to end, under fair rules, and with scores that mean something across very different problems. It offers the community a shared proving ground that encourages real progress and honest comparisons, accelerating the path toward reliable, autonomous scientific discovery.

Practical Applications

•Benchmark your in-house research agent across diverse tasks to identify weak links (e.g., formatting reliability vs modeling).
•Choose the right scaffold by A/B testing One-Shot, ReAct, and Greedy under the same constraints.
•Use normalized scores to prioritize engineering time on tasks where near-SOTA gaps are most meaningful.
•Adopt the task configuration standard to create new, fair, harness-agnostic research tasks in your org.
•Integrate auto-repair steps to boost Valid Submission Rate (e.g., schema checks, format validators).
•Prototype ensemble and cross-validation operators in Greedy search to consistently raise accuracy.
•Track progress over time with Elo to see if new agents truly outperform previous generations.
•Design ablations (time limits, cache contents, operator prompts) to find best compute-performance tradeoffs.
•Onboard new domains (e.g., finance or bio) by wrapping tasks with project_description.md and evaluate.py.
•Use the HF cache pattern to stabilize experiments and reduce dependency on external rate limits.

Version: 1