MARS: Modular Agent with Reflective Search for Automated AI Research

Jiefeng Chen; Bhavana Dalvi Mishra; Jaehyun Nam; Rui Meng; Tomas Pfister; Jinsung Yoon

MARS: Modular Agent with Reflective Search for Automated AI Research

Intermediate

Jiefeng Chen, Bhavana Dalvi Mishra, Jaehyun Nam et al.2/2/2026

arXiv PDF

Key Summary

•MARS is an AI agent that runs AI research like a careful scientist and thrifty engineer at the same time.
•It plans experiments with a budget-aware search so it prefers good ideas that are also fast and cheap to run.
•It builds code as a neat, multi-file repository instead of one fragile mega-script, making changes safer and easier.
•It learns lessons by comparing what changed between attempts and which changes actually improved results.
•Its search uses a customized Monte Carlo Tree Search with an efficiency-guided reward that penalizes slow runs.
•On the MLE-Bench challenge, MARS reached state-of-the-art among open-source agents and was competitive with top leaderboard systems.
•Ablation studies show each pillar (budget-aware search, modular code, reflective memory) is necessary for the strong results.
•MARS had many 'Aha!' moments, with 63% of its useful lessons transferring across different branches of the search tree.
•It costs more tokens than some baselines (due to its memory), but it wins more medals, which repays the extra spend.
•MARS keeps to the rules and generates original code rather than copying public notebooks.

Why This Research Matters

Many teams face hard limits on time and compute, so an agent that plans around budgets is more practical than one that just chases tiny accuracy gains. By building modular codebases, MARS makes AI projects easier to maintain, debug, and share—much like clean engineering in the real world. Its reflective memory captures hard-won insights so future runs don’t repeat old mistakes, saving money and time. In education, it models good scientific habits: start simple, change one thing at a time, and write down what truly helped. For companies, this translates into faster iteration cycles, more reliable ML pipelines, and better results under fixed deadlines. Overall, MARS turns trial-and-error into a reusable playbook for smarter, cheaper, and safer AI development.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine you’re doing a long science fair project with only one weekend, one laptop, and a limited supply of batteries. You have to pick which tests to run, write down what worked, and not waste time rebuilding the whole volcano every time you want to tweak the baking soda.

🥬 The Concept (The World Before): Before systems like MARS, AI agents could write code and even fix bugs, but they struggled with AI research projects where trials are slow and expensive. Training a model can take hours, and the reason for an improvement is often hidden among many changes. Earlier agents usually wrote one giant script, pressed run, and hoped for the best. That’s like building a whole robot as a single glued block—if the claw fails, you have to rebuild everything.

How it worked: Agents read the task, produced a long monolithic script, ran it, and tried again if it failed.
Why it mattered: This made experiments fragile, hard to debug, and wasteful with time and money.

🍞 Anchor: Think of a baking recipe written on one long sticky note. If your cookies are too salty, you have to rewrite the whole note to fix just one line. That’s how many older agents handled research code.

🍞 Hook: You know how in team sports, it’s tricky to tell which player’s move actually led to the goal? AI research has the same issue when multiple things change at once.

🥬 The Concept (The Problem—Credit Assignment): Credit assignment is figuring out which change caused a performance jump.

What it is: A challenge where agents can’t easily tell which code change caused the improvement.
How it works (painfully): Try change A+B+C, see a better score, but not know if A, B, C, or their combo helped.
Why it matters: Without solving this, agents can’t learn good rules for the future, so they repeat low-impact ideas or chase noise.

🍞 Anchor: If you switch flour, oven temperature, and chocolate brand at once and your cake gets better, you don’t know which switch to keep next time.

🍞 Hook: Picture shopping with a strict allowance. Even if a premium snack tastes 1% better, you might skip it if it costs 10× more.

🥬 The Concept (Costs Are Real): In AI research, evaluation (like training) is expensive.

What it is: Every experiment consumes time and compute money.
How it works: Longer training runs block you from trying other ideas within the deadline.
Why it matters: A tiny accuracy bump is not worth quadruple runtime when you only have 24 hours.

🍞 Anchor: Choosing a 1-hour run that scores 92 instead of a 4-hour run that scores 92.2 can unlock three extra tries and a better final strategy.

🍞 Hook: Think of LEGO. Building with pieces lets you swap a wheel without remaking the whole car.

🥬 The Concept (Monolithic vs. Modular Repos): Modular construction means writing code in coordinated parts (data loader, model, trainer, utils) rather than one giant file.

What it is: A way to organize research code into smaller, testable files.
How it works: Design parts, implement each file, test them, and plug them together with a main script.
Why it matters: Faster debugging, safer upgrades, and easier reuse across attempts.

🍞 Anchor: If your model architecture is a separate file, you can replace just that file to try a new backbone without touching data processing.

🍞 Hook: Imagine keeping a lab notebook where you not only write what happened, but also compare experiments side-by-side to see which change mattered.

🥬 The Concept (Reflective Memory): A memory that compares differences between solutions to learn causal lessons.

What it is: Comparative Reflective Memory extracts which code or config change likely caused a metric shift.
How it works: 1) Read logs and diffs; 2) List changes; 3) Link each change to impact; 4) Save a general rule (a lesson) for future runs.
Why it matters: Without this, agents drown in logs and can’t reuse what worked.

🍞 Anchor: “Switching to stratified split raised validation F1; keep stratified split for imbalanced data” becomes a reusable lesson cited in later ideas.

Putting it together: The world before MARS lacked three things at once—budget-aware planning, modular codebases, and true comparative learning from past attempts. The gap mattered in daily life because limited budgets are normal: startups, students, and busy labs need smart choices, not just maximal compute. MARS fills the gap with a plan-anything-like-a-scientist toolkit that favors efficient experiments, builds sturdy modular repos, and learns crisp, portable lessons.

02Core Idea

🍞 Hook: You know how a great coach balances stamina (don’t get tired too fast), playmaking (clear positions), and film study (learn from game tapes)?

🥬 The Concept (Aha! in One Sentence): MARS turns AI research into a budget-smart search over a modular code repository, while learning causal lessons from differences between attempts.

How it works:
1. Plan with a budget-aware search that prefers good-yet-fast ideas.
2. Build solutions as multiple modules using a Design–Decompose–Implement pipeline.
3. Compare old vs. new solutions to distill lessons that guide the next moves.
Why it matters: It makes progress reliable under time and money limits, avoids fragile one-file scripts, and actually learns what caused wins.

🍞 Anchor: On a Kaggle-style task, MARS starts with a light model, upgrades components piece by piece, and keeps only upgrades that the comparison proves helpful.

🍞 Hook: Imagine choosing routes on a road trip with a fuel budget.

🥬 The Concept (Budget-Aware Planning): Plan experiments while tracking the cost.

What it is: Picking which run to try next by considering both expected performance and runtime.
How it works: Use a tree search that scores candidates higher if they’re good and fast, lower if they’re only slightly better but very slow.
Why it matters: Within a 24-hour cap, this yields more total attempts and a better final solution.

🍞 Anchor: Given two models scoring nearly the same, MARS prefers the one that trains in 45 minutes over the one that takes 4 hours.

🍞 Hook: Like testing several chess moves in your head before you choose one.

🥬 The Concept (Monte Carlo Tree Search, MCTS): A way to explore many choices smartly.

What it is: A strategy that simulates and scores branches (draft new, debug, improve) and then follows the most promising.
How it works: 1) Select a node using UCT; 2) Expand with a new action; 3) Run and measure; 4) Backpropagate reward to update the tree.
Why it matters: Beats greedy trial-and-error, balancing safe bets with creative tries.

🍞 Anchor: MARS uses MCTS to decide: start a fresh lightweight baseline, fix a broken run, or improve a working pipeline.

🍞 Hook: Think of building a robot from replaceable parts.

🥬 The Concept (Modular Construction): Organize the solution as a set of coordinated files.

What it is: Separate modules for data, model, training loop, configs, and utilities.
How it works: 1) Design modules; 2) Decompose tasks; 3) Implement; 4) Test each file; 5) Connect with a main script.
Why it matters: You can upgrade one piece without breaking the rest.

🍞 Anchor: Swap model.py from a small CNN to an efficient transformer while keeping dataset.py and engine.py the same.

🍞 Hook: Picture writing margin notes that say, “This specific change caused the boost—save it!”

🥬 The Concept (Comparative Reflective Memory): Learn causal lessons by comparing versions.

What it is: A memory that records which change led to what effect and generalizes it.
How it works: 1) Diff current vs. best repo; 2) Map changes to metrics; 3) Distill a lesson; 4) Reuse lessons later with explicit citations.
Why it matters: Prevents repeating old mistakes and accelerates to higher-quality strategies.

🍞 Anchor: “Switch to stratified split for class imbalance (Cite L12)” appears in future plans automatically.

Before vs. After:

Before: Agents chased raw accuracy, ignored time, wrote one big script, and summarized logs without isolating causes.
After: Agents trade off accuracy and speed, build robust repos, and store high-signal, causal lessons that transfer across branches (63% transfer rate in MARS).

Why it works (intuition, not equations):

Efficiency-guided reward filters out slow dead ends early.
Modularity narrows the impact of changes so effects are clearer.
Comparative lessons compress the messy past into compact, actionable rules.

Building blocks:

Budget-Aware MCTS for planning.
Design–Decompose–Implement for code.
Diff-Based Editing for precise updates.
Lesson Learning for causal memory.
Curriculum-Based Exploration to grow from simple baselines to advanced ensembles.

03Methodology

High-level recipe: Input (task + data + metric) → Budget-Aware MCTS chooses an action → Design–Decompose–Implement builds/edits modules → Run and score with efficiency-guided reward → Reflective Memory distills lessons → Loop until budget ends → Output the best repository.

Step A: Task Preparation

What happens: Parse the instruction to find the metric and direction (maximize accuracy? minimize error?), generate metadata and splits, and run Exploratory Data Analysis (EDA).
Why it exists: Ground the problem and prevent data leakage.
Example: For a toxic-comment classification task, create an 80/20 stratified split and report class imbalance.

🍞 Hook: When packing a school bag, you first check the schedule so you bring the right books. 🥬 The Concept (Curriculum-Based Exploration): Start simple and steadily add complexity.

What it is: A plan to try lightweight baselines first, then stronger models and ensembles.
How it works: 1) Propose simple baseline; 2) Learn lessons; 3) Propose upgraded idea; 4) Repeat.
Why it matters: Avoids wasting budget on heavy runs before you know the basics. 🍞 Anchor: Begin with logistic regression or a small CNN; later add better backbones or ensembling if lessons say it helps.

Step B: Resource-Aware Planning with MCTS

What happens: At each iteration, MARS selects a tree node and applies one of three actions: Draft, Debug, or Improve.
Why it exists: To explore widely without blowing the time budget.
Example: If best-so-far hasn’t improved after two local tweaks, spawn a fresh draft.

🍞 Hook: Choosing the next move in a game by balancing safe and bold options. 🥬 The Concept (Actions—Draft, Debug, Improve): Three levers to evolve the repo.

What it is: Draft creates a new pipeline; Debug fixes runtime errors; Improve tweaks a working solution for better metrics.
How it works: MCTS selects which lever to pull based on UCT and past rewards.
Why it matters: Keeps progress even when one path stalls or breaks. 🍞 Anchor: If a run crashes due to a data path bug, pick Debug; if a run is stable but plateaued, pick Improve; if nothing’s promising, Draft anew.

Step C: Efficiency-Guided Reward

What happens: Score each run by performance normalized over history and modulated by runtime penalty.
Why it exists: To prefer fast, good candidates and discourage slow marginal gains.
Example: Two models with the same AUC? The 40-minute one scores higher reward than the 3-hour one.

🍞 Hook: Like grading both the quality and the time it took to finish a test. 🥬 The Concept (Efficiency-Guided Reward): A score that blends accuracy and speed.

What it is: Reward = normalized metric × time penalty factor.
How it works: Normalize metric relative to explored nodes; apply a small negative weight to longer runtimes.
Why it matters: Focuses compute on efficient branches, raising the effective improvement rate. 🍞 Anchor: Budget-aware MCTS achieved a higher effective solution rate (≈19.5% vs. 16.1% for vanilla) because it pruned slow paths.

Step D: Modular Design–Decompose–Implement

What happens: Convert an idea into a repository plan, split into modules, implement each part, and assemble with a main script.
Why it exists: Makes editing, testing, and reuse precise and safe.
Example modules: dataset.py, model.py, engine.py, utils.py, config.py, loss.py, trainer.py.

🍞 Hook: Upgrading a PC by swapping the graphics card without touching the CPU or storage. 🥬 The Concept (Diff-Based Editing): Change only what’s needed, exactly where needed.

What it is: Edits are expressed as diffs: target file, block to replace, new code.
How it works: The agent patches multiple files atomically in one step.
Why it matters: No wasteful full rewrites; fewer new bugs. 🍞 Anchor: Replace just the optimizer settings in engine.py while leaving dataset.py untouched.

Step E: Reflective Memory—Lesson Learning

What happens: After each run, analyze logs and diffs to extract two kinds of lessons: Solution Improvement and Debugging.
Why it exists: To store causal insights compactly and reuse them later.
Example: “For imbalanced classes, use stratified split and class-weighted loss.”

🍞 Hook: Keeping a cheat sheet of what actually helped last time. 🥬 The Concept (Lesson Learning): Distill and manage high-signal rules.

What it is: A pool of concise, cited lessons (top-K kept for context).
How it works: 1) Empirical analysis; 2) Compare with best; 3) Distill rule; 4) Review to remove duplicates; 5) Reuse with citations.
Why it matters: Boosts lesson-utilization (~65.8%) and cross-branch transfer (~63%). 🍞 Anchor: “Data augmentation X improved val F1 by 1.2 points—apply only for small datasets (Cite L7).”

End-to-end example (concrete): For iMet-2020-FGVC7, MARS begins with a lightweight ResNet baseline, learns that balanced sampling helps, upgrades to a stronger backbone, then finally ensembles two efficient models. Each step is chosen by budget-aware MCTS, implemented as diffs in a modular repo, and justified by cited lessons. The result climbs to a silver medal where prior agents stalled.

Secret sauce:

The trio works together: budget-aware planning decides which knob to turn; modularity makes the turn safe and local; reflective memory ensures we keep the good turns and skip the bad ones next time.

04Experiments & Results

The Test: Researchers used MLE-Bench (75 diverse Kaggle-style competitions) with a strict 24-hour budget per task on a single A100 GPU node. They measured three meaningful outcomes:

Above Median Rate: How often the agent beats the median competitor—like scoring above the class average.
Any Medal Rate: How often it wins at least bronze—like making the honor roll.
Gold Medal Rate: How often it reaches the top tier—like getting first place.

The Competition: MARS was compared with strong open-source agents (AIDE and AIRA-dojo) under identical hardware and models, and also contrasted with the official leaderboard (where setups differ). A scaled variant, MARS+, used two parallel search trees (2×A100s) to test scalability.

The Scoreboard (with context):

Controlled setting (same environment and LLMs):
- MARS (Gemini-3-Pro-Preview) reached 65.8% Above Median and 56.0% Any Medal, with 31.1% Gold. Think of 31.1% Gold as getting an A+ nearly one out of three times—while many peers get B’s.
- AIDE and AIRA-dojo scored notably lower Any Medal rates (e.g., 32.4% and 37.8% with Gemini-3-Pro-Preview), showing MARS’s advantage when rules are fair and costs equal.
Scaling up (MARS+): Above Median jumped to 73.3% and Any Medal to 59.6%, surpassing even resource-heavy competitors in similar metrics.
Across task difficulty splits (Lite, Medium, High): MARS consistently outperformed baselines; gains were largest on Medium where smart planning + lessons mattered most.

Surprising Findings:

Efficiency pays off: Budget-aware MCTS lifted the effective solution rate to about 19.5% (vs. 16.1% vanilla), proving that the runtime penalty helps prune slow, low-yield paths.
Lessons really transfer: About 65.8% of solutions reused prior lessons, and 63.0% of those lessons came from different branches—evidence of genuine “Aha!” moments and cross-pollination.
More original than copycat: Code similarity checks showed no medal-winning submission exceeded 60% similarity to top public notebooks, paralleling baseline originality and passing rule audits with 0% violations.

Cost vs. Benefit:

MARS used more input tokens (it carries more context: modules + lessons), raising per-task LLM cost versus AIRA-dojo ( $60.5 vs.$ 39.0). But Any Medal nearly doubled (43.1% vs. 24.4%), which is like paying a bit more for coaching that reliably gets you on the podium.

Big Picture: The combo of budget-aware planning, modular repos, and reflective lessons didn’t just look nice on paper; it produced medal-level lifts under strict, realistic time budgets.

05Discussion & Limitations

Limitations (be specific):

Higher context cost: Reflective memory and modular prompts increase input tokens, raising LLM spend. If your budget is extremely tight, this overhead may be a deal-breaker.
Hyperparameter sensitivity: The latency penalty weight (w) needs tuning; too small and the agent wastes time on slow runs, too large and it chases only fast-but-weak models.
Attribution noise: Comparative lessons infer causality from diffs and metrics, but confounders (random seeds, data quirks) can still sneak in, occasionally producing misleading rules.
Narrow scope today: The experiments focus on machine learning engineering tasks; other scientific domains may require extra domain tools and validators.
Log and metric dependence: If logs are incomplete or metrics don’t represent the true objective, lessons can drift.

Required resources:

One A100 GPU (or similar), ~12 vCPUs, ~220 GB RAM, 24 hours per task (baseline setup), plus a capable LLM (e.g., Gemini-2.5/3-Pro variants).
Stable file system access for building multi-file repositories and running multiple training jobs.

When NOT to use:

Tiny tasks solvable with a single quick baseline where search overhead dominates.
Ultra-tight time or money budgets where long-context prompts are unaffordable.
Problems requiring external web access or special licenses not allowed by the evaluation rules.

Open questions:

Smarter economy: Can we further shrink context cost via retrieval-augmented lesson selection, chunked memory, or learned compression without losing signal?
Stronger causality: Can we mix in controlled ablations or counterfactual replays to reduce confounders in lesson learning?
Broader domains: How does MARS adapt to robotics, simulation-heavy science, or multi-agent lab automation where evaluation is even costlier?
Early stopping + meta-learning: Can we predict losers sooner and warm-start winners better using learned runtime and metric curves?
Human-in-the-loop: What light-touch feedback (e.g., vetoing wasteful branches) gives the biggest boost per minute of expert time?

06Conclusion & Future Work

Three-sentence summary: MARS reframes automated AI research as a budget-conscious search over a modular code repository, powered by lessons distilled from comparing solution diffs. Its Budget-Aware MCTS, Design–Decompose–Implement workflow, and Comparative Reflective Memory work together to move fast, fix safely, and keep what truly helps. On MLE-Bench, this synergy delivered state-of-the-art open-source results and competitive leaderboard performance under strict 24-hour budgets.

Main achievement: Showing that explicit cost-awareness, repository modularity, and causal lesson learning can materially outperform monolithic, cost-agnostic agents on long-horizon ML engineering tasks.

Future directions:

Reduce context costs with smarter memory retrieval and compression.
Strengthen causal inference in lessons with targeted ablations and uncertainty estimates.
Extend to broader scientific fields and integrate early-stopping predictors.

Why remember this: MARS is a blueprint for doing more with less—turning trial-and-error into guided, budget-smart exploration that learns transferable rules. It’s not just better code generation; it’s research strategy encoded as an agent that plans, builds, and reflects like a seasoned engineer.

Practical Applications

•Auto-build robust ML repositories for new datasets with clean module boundaries and documentation.
•Run budget-conscious hyperparameter and architecture search to stay within fixed cloud costs.
•Replicate and adapt past Kaggle solutions by transferring distilled lessons to new tasks.
•Harden pipelines by learning recurring debug fixes and preventing repeated runtime errors.
•Adopt curriculum strategies for faster onboarding: start with baselines, graduate to ensembles if lessons justify.
•Use diff-based editing to implement safe, auditable changes in regulated environments.
•Prioritize efficient models for on-device or edge deployment where runtime matters.
•Teach ML engineering best practices to students via transparent lesson citations and modular repos.
•Accelerate R&D sprints by running two parallel search trees (MARS+) when compute allows.
•Audit originality and compliance with built-in logs, similarity checks, and rule adherence.

Version: 1