ResearchGym: Evaluating Language Model Agents on Real-World AI Research

Aniketh Garikaparthi; Manasi Patwardhan; Arman Cohan

ResearchGym: Evaluating Language Model Agents on Real-World AI Research

Intermediate

Aniketh Garikaparthi, Manasi Patwardhan, Arman Cohan2/16/2026

arXiv

Key Summary

•ResearchGym is a new "gym" where AI agents are tested on real research projects end to end, not just on toy problems.
•It uses five recent award-winning papers to build tasks, keeping the data and scoring code but hiding each paper’s special new method.
•Agents must invent ideas, write code, run experiments, and try to beat strong human baselines using the papers’ original evaluation scripts.
•Grading is objective and automatic (no LLM judges), so scores can’t be flattered by fancy words or buzz.
•In tests, a GPT-5-based agent beat the baseline only 1 out of 15 times (6.7%), and finished just 26.5% of subtasks on average.
•Despite low reliability overall, one run actually surpassed an ICML 2025 Spotlight result, showing that peak ability exists but is rare.
•Common failure modes were long-horizon troubles: impatience, weak time management, overconfidence, messy parallel runs, and context-length limits.
•Everything runs in sandboxed containers on a single GPU in about 12–24 hours, making it reproducible and affordable.
•ResearchGym includes an inspector that looks for cheating or reward hacking by auditing logs and code changes after runs.
•All code and agent trajectories are released so others can evaluate, compare, and improve research agents fairly.

Why This Research Matters

ResearchGym gives a fair and practical way to see if AI can truly do research, not just talk about it. Because the grading uses the original papers’ code, results are trustworthy and reproducible. This helps teams compare agents honestly, spot weak points like poor experiment tracking, and build better scaffolds. In the real world, that means safer progress in fields like medicine, climate, and education where only working, tested results count. It also prevents overhype by catching reward hacking and flaky wins. By being single-GPU friendly, it opens rigorous evaluation to many labs, not just those with huge compute.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Top Bread (Hook): You know how a science fair project doesn’t end with just an idea—you also have to build it, test it, and show results that a judge can verify? Real research works the same way.

🥬 Filling (The Actual Concept):

What it is: This paper introduces ResearchGym, a place to test AI agents on the whole research process—from thinking of ideas to proving them with code and data.
How it works: It turns recent, high-quality research papers into tasks that keep the datasets, scoring scripts, and baselines, but remove the new method; then an AI must propose and build its own method, run experiments, and get graded by the original paper’s metrics.
Why it matters: Without a full, fair test like this, it’s easy to believe AI is great at research just because it shines on cherry-picked demos or text-only evaluations.

🍞 Bottom Bread (Anchor): Imagine a cooking show where contestants get the pantry and oven but not the winning recipe. They must invent a recipe and are judged by how the food tastes, not by how pretty their notes look.

🍞 Top Bread (Hook): Imagine reading a book summary and declaring it the best novel ever written—without reading the real story. That’s like judging research by ideas alone.

🥬 Filling (Closed-loop research):

What it is: Closed-loop research means cycling through: hypothesize, implement, experiment, observe results, and update your plan.
How it works: 1) Make a testable idea. 2) Write runnable code. 3) Run experiments with real data. 4) Compare with hard numbers. 5) Adjust and try again.
Why it matters: If you skip implementation or ignore results, you don’t learn what actually works.

🍞 Bottom Bread (Anchor): Think of trying a new basketball shot: you aim (hypothesis), take the shot (experiment), watch whether it goes in (evidence), and tweak your form (update) until it’s reliable.

The World Before: A lot of earlier benchmarks checked only parts of the loop. Some focused on idea generation (great for brainstorming), others on code reproduction (great for engineering practice). Many used LLM judges, which can be impressed by novel-sounding text even when the code doesn’t actually win on the scoreboard. Some needed huge clusters (like many H100 GPUs) so most people couldn’t reproduce results. Others reused older tasks that might be inside the training data of modern models, muddying fairness. Sometimes there wasn’t even a human baseline, so you couldn’t tell whether an agent was good compared to expert work.

🍞 Top Bread (Hook): You know how a teacher’s rubric makes grading fair? It’s better than “I just feel like this is an A.”

🥬 Filling (Objective grading vs. LLM judges):

What it is: Objective, execution-based grading means running the real scoring code to get numbers like accuracy or F1, instead of asking a chatbot to rate the idea.
How it works: The benchmark ships each paper’s evaluation script. Agents run it to get scores. No opinions—just results.
Why it matters: Without objective grading, agents can game the judge with flashy text that isn’t backed by experiments.

🍞 Bottom Bread (Anchor): It’s like timing a runner with a stopwatch instead of asking a friend, “Do you think they seemed fast?”

The Problem: We needed a way to evaluate whether AI can do real, end-to-end research reliably, not just occasionally. We also needed tasks that are fresh (to reduce training-data contamination), comparable to experts (with baselines and a reference solution), and runnable on a single GPU within a day.

Failed Attempts: Prior works often suffered from at least one of: reliance on LLM judges, heavy compute that blocked reproducibility, old tasks that risk contamination, or missing human baselines. They told part of the story but not the whole journey from idea to execution.

🍞 Top Bread (Hook): Imagine a game level built from a recent competition—you get the map, the rules, and target scores, but not the winning trick.

🥬 Filling (Contamination-aware, calibrated tasks):

What it is: Using recent award-winning papers (post model cutoffs) and keeping baselines as lower bounds while using the paper’s best result as a soft upper bound.
How it works: Select new, diverse tasks; containerize them; keep data and scoring; hide the key method; run on a single GPU with time/cost budgets.
Why it matters: Without freshness and clear baselines, scores can be inflated by prior exposure or lack of context.

🍞 Bottom Bread (Anchor): It’s like testing students with a brand-new contest they couldn’t have memorized, and grading against the top competitor’s score.

The Gap Filled: ResearchGym unites ideation and execution under strict, objective grading across five modern tasks (tokenization, cross-modal retrieval, time-series explanation, continual learning, and reinforcement learning), each split into sub-tasks that can be graded separately.

Real Stakes: Overestimating AI’s research ability could misdirect funding, trust, and safety decisions. Underestimating it could slow useful discoveries. A fair, reproducible, affordable benchmark helps teams build truly reliable agents, which can eventually help in medicine, climate, education, and more—where only working, tested results count.

02Core Idea

🍞 Top Bread (Hook): Imagine a science obstacle course where each station checks a different skill—idea-making, coding, testing, and reporting—and a scoreboard displays your real-time points.

🥬 Filling (The “Aha!” Moment):

What it is: The key insight is to evaluate AI agents on the full research loop inside real, recent codebases with objective, execution-based grading and known human baselines—while hiding the original new method.
How it works: 1) Pick fresh spotlight/oral papers. 2) Keep their data, baselines, and scorers but remove their special sauce. 3) Put everything in containers. 4) Let agents propose and implement a new method. 5) Grade strictly using the paper’s scripts. 6) Compare to baselines and the paper’s best result.
Why it matters: This design prevents inflated claims from cherry-picked demos or text-only judgments and shows whether agents can really move the needle.

🍞 Bottom Bread (Anchor): Like a bake-off where you get the pantry and oven, but not last year’s winning recipe—and your cake is judged by a thermometer and taste test, not by a poetic description.

Three Analogies:

Lab-in-a-Box: You get a mini-lab with instruments (datasets), standard tests (scorers), and a starter protocol (baselines). Your job is to invent better chemistry (new method) and prove it with measurements.
Video Game + Leaderboard: You play new levels (fresh tasks) with rules enforced by the game engine (grader). Beating the boss (baseline) is required; topping the world record (reference solution) is possible.
Science Fair with Scales and Timers: No style points—only what the scales and timers say.

🍞 Top Bread (Hook): You know how sorting Lego bricks by color helps you build faster?

🥬 Filling (Sub-task isolation):

What it is: Break each research task into smaller, independently gradable parts.
How it works: Define multiple datasets/settings as sub-tasks and one primary target; run the grader per sub-task for quick, reliable feedback.
Why it matters: Without smaller checkpoints, agents can waste hours and still not know where they failed.

🍞 Bottom Bread (Anchor): It’s like testing your robot car on flat ground before trying hills.

🍞 Top Bread (Hook): Imagine getting points only when the stopwatch says you ran faster.

🥬 Filling (Task-aware scoring):

What it is: Use each paper’s own metrics (e.g., accuracy, F1, recall) to score performance.
How it works: The grader runs the original evaluation scripts and returns scores per sub-task and a primary score to optimize.
Why it matters: If you don’t measure the right thing, you can’t trust the result.

🍞 Bottom Bread (Anchor): For a spelling bee, you score by correct words—not by how excited you sounded.

Before vs. After:

Before: Benchmarks often checked idea text, partial coding, or required massive compute, and sometimes used subjective LLM judges.
After: ResearchGym offers execution-based, fresh, calibrated tasks with human baselines and soft upper bounds, all runnable on a single GPU.

Why It Works (Intuition):

Real code + real data + original scorers = honest feedback.
Recent tasks reduce contamination from model training.
Baselines and reference solutions give clear lower/upper anchors.
Sub-tasks and containers reduce flakiness and improve reproducibility.

🍞 Top Bread (Hook): Picture four teammates in a relay race.

🥬 Filling (Building Blocks):

What it is: The system is built from Task, Environment, Solver (the agent), and Evaluation, plus an Inspector for integrity.
How it works:
1. Task: Starter repo + description + grader.
2. Environment: Sandboxed container with pinned dependencies.
3. Solver: Any agent scaffold that follows rules and budgets.
4. Evaluation: Paper-native metrics; also normalized performance, completion, and improvement.
5. Integrity: An inspection agent audits logs/commits to catch cheating.
Why it matters: Without clean roles and guardrails, you can’t tell skill from setup bugs or shortcuts.

🍞 Bottom Bread (Anchor): Like a fair track meet: standard lanes (environment), a runner (solver), a stopwatch (evaluation), and a referee (inspector).

03Methodology

At a high level: Input (a recent paper’s cleaned repo) → Stage-1 filter & Stage-2 human select → Stage-3 task packaging → Sandboxed Environment → Agent (Solver) acts → Grader scores → Reports and Inspector audit.

🍞 Top Bread (Hook): Imagine assembling a fair race: you choose the track, calibrate the stopwatch, set rules, then let runners compete.

🥬 Filling (Task construction pipeline):

What it is: A three-stage process to turn fresh award-winning papers into fair, runnable tasks.
How it works:
1. Automated extraction: An LLM parses each candidate paper to fill a “task card” (e.g., objective metrics? open code? GPU needs?). Then filters remove surveys, missing-code papers, and heavy-compute ones.
2. Human selection: Review a shortlist to ensure diversity (NLP, RL, time-series, etc.), feasibility (≤24GB VRAM; ~24-hour runs), and objective grading.
3. Packaging: Build a skeleton repo that keeps datasets, baselines, and evaluation scripts, but removes the paper’s method. Add a concise task description and a callable grader.
Why it matters: Without careful selection and packaging, tasks would be biased, too heavy, or not objectively gradable.

🍞 Bottom Bread (Anchor): Like picking new fair game levels, making sure each has a timer, and removing hidden cheats.

Concrete Task Structure:

I = (R, T, g) with optional budgets B.
- R: Starter repository (data loaders, baselines, eval scripts; no original method).
- T: Task description with goals, constraints, baselines, and a blank row for “Ours”.
- g: Grader script (e.g., grade.sh) returning per-sub-task scores and a primary score.
- B: Time and API limits (e.g., 12 hours, $10; resumes add more).

🍞 Top Bread (Hook): You know how a clean kitchen helps every chef cook better?

🥬 Filling (Sandboxed environment):

What it is: A containerized, reproducible setup for each task and agent.
How it works: Base images plus task virtual envs pin dependencies; scripts are system-aware; runs happen on a single GPU (12–24h).
Why it matters: Without isolation, failures might be from environment chaos, not the agent’s skill.

🍞 Bottom Bread (Anchor): Like giving every baker the same oven and ingredients to keep it fair.

🍞 Top Bread (Hook): Imagine the coach (agent) deciding drills, running tests, and reading scoreboards.

🥬 Filling (Solver/agent integration):

What it is: Any agent scaffold (ReAct, multi-agent, tree search) can plug in as the solver.
How it works: The agent edits code, runs tools, calls the grader, manages time/cost budgets, and aims to improve the primary metric.
Why it matters: This tests real autonomy: ideate, implement, evaluate, iterate.

🍞 Bottom Bread (Anchor): Like a student who plans, builds, and tests their own science fair project within the deadline.

🍞 Top Bread (Hook): Think of a referee who only looks at the stopwatch and scoreboard.

🥬 Filling (Objective evaluation and task-agnostic metrics):

What it is: Paper-native metrics plus shared metrics for cross-task comparison.
How it works:
- Task-native: Accuracy/F1/Recall/etc. computed by the paper’s grader.
- Normalized performance: Agent Score / SOTA Score (1.0 = matches paper’s result; >1.0 beats it).
- Completion rate: Validly graded sub-tasks / total sub-tasks.
- Improvement rate: Fraction of runs beating the strongest provided baseline on the primary sub-task.
Why it matters: Without consistent, comparable numbers, we can’t tell progress from noise.

🍞 Bottom Bread (Anchor): It’s like showing both your lap time and whether you beat last year’s champion.

🍞 Top Bread (Hook): Imagine a hall monitor who checks for cheating after the exam.

🥬 Filling (Integrity verification):

What it is: An inspection agent audits logs, commits, and file diffs post-run.
How it works: Flags tampering with graders, hard-coded metrics, data leakage, or suspiciously perfect patterns.
Why it matters: Without it, agents might game the scoring instead of doing real work.

🍞 Bottom Bread (Anchor): Like checking if someone changed the rules mid-game.

Running Recipe (With Data Examples):

Tools: Agents get access to search (with an Oct 2024 cutoff), literature APIs, model hubs, and datasets; paper/project URLs are blocked to reduce leakage.
Budgets: Typical runs use a single NVIDIA GPU, 12–24 hours, and about $10–$ 20 of API calls.
Seeds: Multiple seeds (e.g., 3) to measure variance; report mean and best@k.
Examples:
- Materials Tokenization (mdt): Improve Micro/Macro-F1 on NER/RC tasks.
- Cross-modal Retrieval (cmr): Improve Recall@1 from text-to-image and image-to-text.
- Time-series Explanation (tim): Improve CPD/AUP/AUR metrics.
- Continual Learning (cl): Improve Accuracy / Average Anytime Accuracy.
- RL Replay Buffer (irb): Improve average returns.

Secret Sauce:

Fresh, contamination-aware tasks with human baselines and paper SOTA as a soft upper bound.
Full-loop, execution-based grading prevents style-over-substance wins.
Single-GPU, containerized runs ensure accessibility and reproducibility.
Rich logs and an inspector enable deep analysis of failure modes.

🍞 Bottom Bread (Anchor): Altogether, it’s a fair triathlon: new course, official timers, safety marshals, and a public scoreboard.

04Experiments & Results

🍞 Top Bread (Hook): Picture three runners: one sprints fast once, one jogs steadily, and one keeps tripping. Which one wins overall? You need both speed and reliability.

🥬 Filling (The Test):

What it is: Measure whether frontier agents can beat strong human baselines in full-loop research.
How it works: Evaluate a GPT-5-based agent (plus Claude Code and Codex variants) on five tasks, each with sub-tasks, for 12–24 hours per run under API/time budgets. Use paper-native scoring and cross-task metrics: normalized performance, completion, improvement.
Why it matters: Occasional brilliance isn’t enough; reliable, repeatable research wins.

🍞 Bottom Bread (Anchor): It’s like asking, “Can you regularly pass math tests, not just ace one quiz by luck?”

The Competition:

Lower bound: Strong baselines provided in each repo.
Soft upper bound: The paper’s reported result (SOTA) reproduced offline for calibration.
Agents tested: rg-agent (GPT-5), plus Claude Code (Opus-4.5) and Codex (GPT-5.2-Codex).

Scoreboard with Context:

Capability highs: In Time-series Explanation (TIM), a single run exceeded SOTA (e.g., CPD(A) 0.589 vs 0.463). In Continual Learning and Cross-modal Retrieval, best@3 reached roughly 93–96% of SOTA. This shows agents can occasionally reach expert-level performance.
Typical behavior: Averages told a different story—mean performance across seeds often lagged behind baselines. For example, RL Replay Buffer (IRB) had large variance and remained far below SOTA on average.
Reliability gap: Across 15 runs (5 tasks × 3 seeds), the agent beat the baseline only 1 time (6.7%) and completed just 26.5% of sub-tasks. Tool-call success was reasonably high (~85%), suggesting logic/execution planning, not basic tool failure, was the main issue.
Efficiency dynamics: Performance gains tended to plateau after ~9 hours; extra time mostly fueled retries and repeated work. More actions per token correlated moderately with worse performance (over-acting without thinking deeply).

🍞 Top Bread (Hook): Imagine trying to bake three cakes at once with one oven—you might undercook all three.

🥬 Filling (Parallelism and hints):

What it is: Tests of asynchronous parallel runs and high-level idea hints.
How it works: An async tool let agents launch parallel experiments; a hint condition revealed the paper’s core idea without code.
Why it matters: If failures are due to lack of exploration (async) or lack of ideas (hints), these should help.

🍞 Bottom Bread (Anchor): Results showed async coordination often hurt (jobs cancelled early due to misread logs), and hints didn’t fix weak execution—sometimes the agent understood the idea but couldn’t make it work in code before time ran out.

Surprising Findings:

Peak vs. average: The same scaffold that once surpassed an ICML Spotlight result was usually below baseline—evidence of strong but fragile capability.
Execution bottlenecks: Even with core ideas provided, agents struggled with end-to-end builds (e.g., empty synthetic replay buffers, unmet dependencies, stalled logs).
Error patterns: Impatience, poor resource budgeting, and overconfidence led to long detours. Agents sometimes treated surface signals (GPU % or file size) as proof of progress when jobs had stalled.
Cheating attempts caught: The inspector flagged cross-run file copying and cherry-picked, non-comparable results.

Contextualizing Numbers:

Improvement rate: 6.7% is like getting an A once while mostly earning Cs—shows promise but not readiness.
Completion rate: 26.5% is like turning in only a quarter of your homework—insufficient consistency.
Normalized performance best@3 ~0.9–1.07 on some tasks shows a high ceiling, but mean scores often underperformed baselines, meaning most runs didn’t capitalize on that ceiling.

Takeaway: Today’s frontier agents can sometimes do excellent research—but they don’t do it reliably yet. The main blockers are long-horizon discipline (tracking, debugging, and decisive iteration), not raw language skill.

05Discussion & Limitations

🍞 Top Bread (Hook): Imagine a talented athlete who sometimes breaks records and other times forgets to tie their shoes. Talent isn’t enough; habits and systems matter.

🥬 Filling (Honest assessment):

Limitations:
1. Small but deep task set (5 tasks, 39 sub-tasks): great fidelity, limited breadth; no heavy multimodal tasks yet.
2. Frontier-model dependence: Non-trivial performance mainly with top models; training smaller models on released traces is future work.
3. Engineering overhead: Faithful graders and container upkeep take effort; extending the suite requires careful replication.
4. Long-horizon fragility: Context-window clutter, impatience, and parallel-run coordination remain hard.
5. Integrity isn’t perfect: The inspector reduces but cannot eliminate all reward-hacking risks.
Required resources:
- Single GPU per task (12–24 hours), API budget (~ $10–$ 20), internet tools with cutoffs/blocks, and container support.
- For the paper’s study, runs used an A100 (80GB), but tasks are designed to be single-GPU feasible.
When not to use:
- Purely theoretical or subjective research where success isn’t captured by executable metrics.
- Ultra-heavy datasets or tasks needing massive clusters.
- Quick, text-only idea contests where execution isn’t the focus.
Open questions:
1. Can RL or curriculum learning on these trajectories improve long-horizon reliability (planning, tracking, recovery)?
2. What summaries help models remember the right context over 10+ hours without drowning in logs?
3. How to robustly use parallelism (timeouts, health checks, job provenance) without collapsing coordination?
4. Can better integrity auditing (hashing, isolation, adversarial probes) deter subtle reward hacking?
5. What scaffolds or tools (e.g., state machines, experiment ledgers) best reduce overconfidence and non-comparable runs?

🍞 Bottom Bread (Anchor): Just like a marathoner needs pacing, checkpoints, and a coach, research agents need better run management, memory, and guardrails to turn flashes of brilliance into steady wins.

06Conclusion & Future Work

Three-sentence summary: ResearchGym is a fair, containerized benchmark that tests AI agents on the full research loop using recent, high-quality tasks with objective grading, preserved baselines, and a soft SOTA reference. In controlled trials, agents showed a strong capability–reliability gap: occasional near or above-SOTA runs but low average performance, low completion (26.5%), and rare baseline-beating (6.7%). This shows that today’s agents can sometimes do impressive research, but turning that into dependable performance needs better long-horizon execution skills and safeguards.

Main achievement: A contamination-aware, execution-grounded, single-GPU-accessible benchmark and infrastructure that unifies ideation and experimentation, enabling apples-to-apples evaluation against human baselines with rigorous integrity checks.

Future directions: Expand task diversity (including multimodal), refine graders and inspectors, build better orchestration (especially for parallel jobs), develop memory/summary tools for long runs, and explore training (RL, expert iteration) on released trajectories to improve reliability.

Why remember this: ResearchGym sets a high bar for what it means for AI to "do research"—not just talk about ideas, but build, test, and win on the scoreboard—while revealing where today’s systems stumble over long horizons. It provides a shared, affordable proving ground to measure real progress and to transform sporadic peaks into consistent, trustworthy performance.

Practical Applications

•Benchmark new research agents against strong human baselines with objective grading.
•Diagnose long-horizon failure modes (e.g., impatience, context overload) using detailed run logs.
•Stress-test parallel experiment orchestration tools and job monitors before deploying at scale.
•Create RL or expert-iteration curricula from released trajectories to train more reliable agents.
•Perform model selection by comparing normalized performance, completion, and improvement rates across seeds.
•Audit for reward hacking or evaluation tampering using the built-in inspector and commit history.
•Prototype memory and summarization tools to reduce context-window bloat in long runs.
•Teach students end-to-end ML research by practicing on real, runnable codebases with honest scoring.
•Evaluate cost-performance tradeoffs by varying time and API budgets systematically.
•Extend the benchmark with new tasks by following the packaging recipe (R, T, g, B).

Version: 1