Can We Predict Before Executing Machine Learning Agents?

Jingsheng Zheng; Jintian Zhang; Yujie Luo; Yuren Mao; Yunjun Gao; Lun Du; Huajun Chen; Ningyu Zhang

Can We Predict Before Executing Machine Learning Agents?

Intermediate

Jingsheng Zheng, Jintian Zhang, Yujie Luo et al.1/9/2026

arXiv PDF

Key Summary

•Machine learning agents usually improve by writing code, running it for hours, and then using the results to tweak the next try, which is very slow.
•This paper asks a bold question: can an AI predict which solution will work better before we actually run anything?
•The authors create a new task called Data-centric Solution Preference, where the AI picks the better of two code solutions by reasoning from a Verified Data Analysis Report.
•On 18,438 real solution pairs from 26 tasks, reasoning-optimized LLMs predict the winner with up to 61.5% accuracy, beating random guessing and a 'complex-is-better' rule.
•Turning prediction into action, the FOREAGENT system uses a Predict-then-Verify loop to try many ideas quickly, then only execute the most promising one.
•This makes the agent 6× faster, explores 3.2× more ideas, and still improves final performance by about +6% over a strong execution-based baseline.
•A key ingredient is the Verified Data Analysis Report, which turns raw numbers into clear, checked stories the AI can reason about.
•The models’ confidence is well calibrated, so when they feel sure, they are right more often—making them safe filters before expensive runs.
•Scaling to bigger models alone doesn’t help much; reasoning-focused designs and good data representations matter more.
•The approach won’t replace all execution but can slash wasted time, guiding agents toward better runs sooner.

Why This Research Matters

This work shows we can replace many slow training runs with fast, data-grounded predictions, cutting time and cost without sacrificing reliability. That means scientists and engineers can explore more ideas, find better solutions sooner, and use fewer resources. Calibrated confidence lets teams act safely on predictions, verifying only when it really counts. The approach is especially helpful in settings like healthcare or climate modeling, where runs are expensive and time-sensitive. By turning raw numbers into clear, verified data stories, we make AI reasoning more accurate and trustworthy. This is a practical step toward AI systems that think first and run second. It can broadly accelerate research and development across industries.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine you’re practicing basketball. You could shoot for hours to see what works, or you could first look at the court, the wind, and your past shots to guess the best angle before shooting. Which is faster?

🥬 The Concept: Generate–Execute–Feedback.

What it is: Many ML agents follow a loop: write code (Generate), run it on data (Execute), read the results (Feedback), then try again.
How it works: 1) Draft a model or pipeline. 2) Train and evaluate it. 3) Use the numbers to decide what to fix. 4) Repeat many times.
Why it matters: Without this loop, agents wouldn’t improve. But it’s slow because training models takes a lot of time and compute. 🍞 Anchor: A Kaggle-style agent tries a CNN, trains for hours, sees the score, adjusts learning rate, tries again—over and over.

🍞 Hook: You know how traffic jams make a short drive take forever?

🥬 The Concept: Execution Bottleneck.

What it is: The main slowdown in current ML agents comes from the expensive, time-consuming step of running models to get feedback.
How it works: Training deep models or full pipelines can take hours, and you need dozens of tries to improve, so time stacks up quickly.
Why it matters: If every guess needs hours, agents can’t explore many ideas and may miss great solutions. 🍞 Anchor: In some benchmarks, a single run can take up to 9 hours—like a huge traffic jam for every experiment.

🍞 Hook: Imagine using a flight simulator to practice landings before you touch a real plane.

🥬 The Concept: World Models.

What it is: A world model is an internal simulator that predicts what would happen without doing the real, expensive action.
How it works: 1) Learn patterns of how actions change the world. 2) Imagine outcomes for new actions. 3) Choose good actions based on these predictions.
Why it matters: If the simulation is good enough, you can skip many real trials and still learn fast. 🍞 Anchor: A self-driving car tests lane-change plans in its head before steering on the road.

🍞 Hook: Think like a chef picking the right recipe by first looking at the ingredients on the counter.

🥬 The Concept: Data-centric Solution Preference.

What it is: Given two code solutions and a data report, decide which solution will perform better—without running either one.
How it works: 1) Read the task description. 2) Read a Verified Data Analysis Report about the dataset. 3) Read two candidate solutions. 4) Predict the better one and provide confidence.
Why it matters: If we can often pick the winner before executing, we save tons of time and compute. 🍞 Anchor: For a small, noisy dataset, the AI prefers a regularized gradient-boosting model over a giant deep net—even before training.

The world before this paper: Agents were getting better at code generation and at using run-time logs to improve models. But the execution step stayed painfully slow. People tried speeding things up with heuristics—like pruning unlikely options or favoring fancier models—but those shortcuts often fooled themselves. For instance, a complexity heuristic that says “deeper is better” fails when the dataset is tiny or skewed.

The problem: Can we replace many expensive runs with smart, data-grounded predictions? If an AI could reason about data properties (like size, imbalance, leakage risks) and match them to algorithm properties (like overfitting risks, capacity, sample efficiency) before running, we could explore more ideas in less time.

Failed attempts:

Pure complexity heuristics: Often wrong because they ignore data realities.
Code-only judgment: Better than random, but misses critical clues hidden in the dataset.
Raw numbers alone: LLMs don’t handle raw numeric streams as well as language; the signal stays buried.

The gap: We need a way to feed the AI a trustworthy, human-readable summary of the data that maps numbers to meaning—and then test whether the AI can use that to pick winners reliably.

The paper’s answer: Build a Verified Data Analysis Report that turns raw stats into a semantic story. Then test if LLMs can use that story to predict which solution will do better, before execution. Finally, plug this predictor into an agent so it can try many ideas quickly and only run the most promising ones.

Real stakes: Faster agents mean less energy use, lower costs, and quicker iteration in science and industry—from choosing the best medical imaging model to speeding up climate forecasting pipelines. Imagine replacing 9 hours of waiting with 1 second of thinking—again and again. That is why this matters.

02Core Idea

🍞 Hook: You know how a good coach can look at two practice plans and, just from the team’s stats and the drills, tell which one will help more—before anyone breaks a sweat?

🥬 The Concept: Predicting Before Executing.

What it is: The key insight is that an LLM can choose the better ML solution by reasoning from a verified, language-based data report and the code—without running anything.
How it works: 1) Convert raw data stats into a Verified Data Analysis Report. 2) Give the LLM the task, the report, and two solutions. 3) Ask it to pick the likely winner and give a confidence score. 4) Use that prediction to prune the search, then verify the top pick with a real run.
Why it matters: This cuts down wasted runs, accelerates exploration, and still keeps accuracy high by verifying the final choice. 🍞 Anchor: Faced with two text-classifiers and a report showing a small dataset with class imbalance, the LLM picks the simpler, regularized model; the later execution confirms it wins.

The “Aha!” in one sentence: If you feed an LLM a checked, plain-language story about the data, it can reason which algorithm fits that data better—often enough to skip many slow runs.

Three analogies:

Cooking: The pantry report says, “We have ripe tomatoes, fresh basil, and no cream.” Before cooking, a chef picks pasta al pomodoro instead of alfredo.
Sports: A coach reads player stats (small team, fast guards) and picks a fast-break strategy over a heavy post-up game—before tipoff.
Travel: The weather-and-traffic report suggests subway over driving; you choose quickly without testing both.

Before vs. After:

Before: Agents relied on Execute-first thinking. They spent hours training to learn which idea was better.
After: Agents can Predict-then-Verify. They imagine results using data-grounded reasoning, execute only the best bet, and move faster.

Why it works (the intuition behind the math-free logic):

LLMs are great at understanding language. If we translate numeric quirks (imbalance, leakage risk, short texts, OOV rates) into a clear narrative, the model can line them up with algorithmic traits (capacity, regularization, sample efficiency) to infer fit.
This defeats bad shortcuts like “complexity always wins” by making the data’s needs explicit, so the model weighs trade-offs instead of chasing shiny architectures.

🍞 Hook: Picture a library of Lego pieces you snap together to build something new, fast.

🥬 The Concept: Building Blocks of the Approach.

What it is: The solution has four blocks—(1) Verified Data Analysis Report, (2) Pairwise Preference Prediction, (3) Confidence Calibration, (4) Predict-then-Verify loop.
How it works: 1) Profile-Verify-Verbalize the data. 2) Ask the LLM to choose between two solutions and give confidence. 3) Trust high-confidence picks more. 4) Use predictions to filter many candidates, then run the top choice.
Why it matters: Each block reduces waste: better inputs, smarter picks, safer decisions, and fewer runs. 🍞 Anchor: The agent generates 10 candidate codes, uses the predictor to select the top 1 with high confidence, and only trains that one—saving hours while still improving scores.

03Methodology

High-level recipe: Task + Data → Verified Data Analysis Report → Pairwise Prediction (with confidence) → Predict-then-Verify in the agent.

Step A: Curate the playground (the Preference Corpus)

What happens: The authors collect real solution trajectories from two ML agents (AIDE and AutoMind) across 26 tasks, then prune and deduplicate them with experts to keep high-quality, diverse examples. They turn 895 solid instances into 18,438 pairwise comparisons.
Why this step exists: We need many grounded A/B choices where the true winner is known from actual executions, so we can test whether predictions beat chance.
Example: For an image task, they might pair “ResNet with augmentations” vs. “Vision Transformer without augmentations,” with the label showing which actually scored higher on the hidden test.

🍞 Hook: You know how a teacher checks a student’s math notes to make sure the story problems are explained clearly, not just numbers copied from a calculator?

🥬 The Concept: Verified Data Analysis Report (Profile–Verify–Verbalize).

What it is: A trustworthy, human-readable summary of the dataset that connects observations to modeling implications.
How it works: 1) Code Profiling creates scripts to compute stats (masking labels/outcomes to avoid leakage). 2) Execution & Verification runs scripts to produce clean logs. 3) Verbalization turns logs into an explanation of what the stats mean for modeling.
Why it matters: LLMs reason better from language than raw numbers. This report gives them the right clues safely. 🍞 Anchor: For a patent-matching task, the report notes short phrases, character n-gram signals, anchor-target structure, and class imbalance—then explains why character-aware models and careful ranking losses may matter.

Step B: Define the prediction task

What happens: Each input bundle has the task description, the Verified Data Report, two code solutions, and a system prompt. The model outputs: (1) chain-of-thought reasoning, (2) predicted winner (0 or 1), and (3) a confidence score (0–1).
Why this step exists: The confidence is crucial as a gate for the agent—when the model is more certain, we trust it more to skip expensive runs.
Example: “Solution 1 likely wins because dataset is small and imbalanced; Solution 1 uses strong regularization and stratified CV. Confidence: 0.78.”

🍞 Hook: Like choosing the best of two shoes by reading the trail conditions before hiking.

🥬 The Concept: Pairwise Preference Prediction.

What it is: Decide which of two solutions will perform better, given the task and the data report, without actually running them.
How it works: 1) Align data needs with algorithm traits. 2) Penalize risky mismatches (overfitting, leakage). 3) Prefer sample-efficient, well-validated setups when data is scarce. 4) Output a winner and confidence.
Why it matters: This reduces the candidate pool early, saving time. 🍞 Anchor: With only 5k examples, the predictor favors LightGBM with 5-fold CV over a huge transformer trained once with a random split.

Step C: Integrate into an agent (FOREAGENT)

What happens: The agent widens exploration by generating many candidates in parallel (m ≫ k), filters them using the predictor’s confidence gate (e.g., c ≥ 0.7), selects top-k, and only then executes the winner(s). This is the Predict-then-Verify loop.
Why this step exists: It decouples cheap exploration (thinking) from expensive verification (training).
Example: Generate 10 candidates, predict pairwise winners to build a shortlist, keep the top 1, train it, and move on quickly.

🍞 Hook: It’s like trying many paper airplane designs in your head and only folding the best one with real paper.

🥬 The Concept: Predict-then-Verify Loop.

What it is: A conservative cycle where the model first predicts likely winners, then runs just the best to anchor progress.
How it works: 1) High-volume generate. 2) Confidence-gated selection. 3) Verify only the top pick. Repeat.
Why it matters: This keeps speed high and risk low—most bad ideas never reach the training stage. 🍞 Anchor: The agent explores 3.2× more nodes and still finishes 6× faster because it only trains the most promising candidates.

Step D: The safety features

Confidence calibration: The model’s self-reported confidence correlates with accuracy, so a 0.8 confidence really means “very likely right.” Without calibration, the agent could trust bad guesses too often.
Anti-heuristic training: Instructions forbid “complexity wins” shortcuts; predictions must link data observations to method choices.
Balance and filtering: Pair positions are balanced to avoid position bias; ambiguous pairs are filtered to protect label quality.

Secret sauce:

Turning raw stats into semantic narratives makes LLM reasoning click.
Pairwise framing is easier and more reliable than full listwise ranking.
Confidence gating aligns prediction reliability with action—only strong bets trigger expensive runs.

What breaks without each piece:

No Verified Report: The LLM misses key data cues; performance drops toward code-only levels.
No Pairwise Framing: Global ranking becomes unstable; top-1 accuracy collapses.
No Confidence Gate: The agent wastes time on weak guesses and may slow down again.
No Verification: The system could drift on wrong beliefs; verification anchors reality.

04Experiments & Results

The Test: Can LLMs pick winners without running code?

Setup: 18,438 real pairwise comparisons from 26 tasks across CV, NLP, and Data Science. Each pair includes task description, Verified Data Report, two candidate solutions, and the ground-truth winner from actual execution.
Metric: Micro-averaged pairwise accuracy—how often the model picks the true winner.
Baselines: Random guess (50.0%) and a complexity heuristic (50.8%) that prefers fancier solutions.

The Competition: State-of-the-art LLMs

DeepSeek-V3.2 in “Thinking/CoT” mode and GPT-5.1 with reasoning instructions, plus a breadth of Qwen models for scaling analysis.

The Scoreboard (with context):

DeepSeek-V3.2-Thinking: 61.5% accuracy. That’s like scoring an A- when random is a 50–50 coin flip and the “fancy is better” rule barely improves to 50.8%.
GPT-5.1: 58.8% accuracy. Solidly above chance and the heuristic.
Domain patterns: Strong on NLP (up to ~66.9%), decent on easy tasks (~63.9%), and struggles more with complex code or subtle intra-family differences.

Surprising findings:

Verbal beats numeric: Performance rises stepwise from code-only (~56.7%) → numeric stats (~59.0%) → verbal reports (~61.3%). LLMs reason better from clear language than from raw numbers.
Reasoning matters: Chain-of-thought (Thinking mode) jumps from ~55.9% to ~61.3%, and remains robust across temperatures.
Confidence is trustworthy: Higher self-reported confidence bins show higher accuracy—useful for safe gating.
Not just scale: Bigger models alone plateau; reasoning-centric designs outperform mere parameter increases.
Ranking is harder: Beyond pairwise, global listwise ranking accuracy@1 drops notably (~31% at N=50), showing pairwise is the right granularity for now.

Agent-level impact (FOREAGENT):

6× speedup: Converges to peak performance in about one-sixth the time of the execution-only baseline (AIDE).
3.2× exploration: With prediction filtering, the agent evaluates many more ideas within the same time.
+6% Beat Ratio improvement: On AI4Science-style tasks (including unseen ones), the agent outperforms more human leaderboard submissions than the baseline on average.

Interpretation:

61.5% vs. 50% sounds small, but it compounds: Every wrong run you skip saves hours. Multiply that across hundreds of attempts and you get massive time and cost savings.
Calibrated confidence enables a conservative policy: trust high-confidence picks first, then verify—delivering both speed and reliability.
The Verified Data Report is the engine: it lifts performance beyond code-only and defeats “complexity bias.”

05Discussion & Limitations

🍞 Hook: Think of using a weather app to decide quickly—useful most days, but you still look outside before leaving.

🥬 The Concept: Honest Limits and When Not to Use It.

What it is: The approach is powerful but not magic; it has boundaries and resource needs.
How it works: It works best when the data report captures the dataset’s key traits and when the competing solutions differ in meaningful, data-relevant ways. It struggles on extremely subtle distinctions within the same algorithm family or when the data story is incomplete.
Why it matters: Knowing the edges helps you deploy it wisely—use it to prune, then verify the final pick. 🍞 Anchor: If two ResNet variants differ only by tiny augmentation tweaks, the predictor may be unsure; run both if time allows.

Limitations:

Corpus imbalance: More examples of popular tasks than niche ones means results may generalize less in long-tail scientific domains.
Pairwise sweet spot: Listwise/global ranking is still weak. The method shines in A/B filtering, not in ordering long lists.
Ceiling from validation–test gaps: Even execution-based validation is an imperfect proxy for true test performance, so purely static prediction will always have a ceiling.

Required resources:

An LLM with strong reasoning (e.g., a “thinking” mode), plus compute to generate Verified Data Reports (profiling and verbalization).
A task environment to run occasional verifications and anchor predictions in reality.

When not to use:

Ultra time-insensitive contexts where training everything is fine.
Hyper-nuanced model comparisons with almost no data-signal differences and very tight performance margins.
Situations where you cannot produce a safe, faithful data report (e.g., strict privacy without approved profiling).

Open questions:

Can we improve listwise ranking to keep consistency across many candidates?
How far can specialized reasoning architectures push accuracy beyond ~61.5%?
Can interactive simulations or richer data-grounding further reduce the validation–test gap?
How to generalize reliably to rare domains with few examples while keeping calibration intact?

06Conclusion & Future Work

Three-sentence summary: This paper shows that LLMs can often predict which ML solution will work better by reading a Verified Data Analysis Report and the code—before running anything. By inserting this predictor into a Predict-then-Verify loop, the agent explores far more ideas, executes far fewer, and still improves outcomes. The result is a 6× speed boost, 3.2× broader search, and about +6% performance gains over a strong execution-only baseline.

Main achievement: Turning world-model style predictive reasoning into a practical, calibrated filter for ML agents, grounded in data semantics rather than fragile heuristics.

Future directions: Strengthen listwise ranking; design architectures that better integrate numeric data and language; build interactive simulators to further align prediction with real execution; and extend the approach into training-time reward models for even faster RL-style optimization.

Why remember this: It’s a blueprint for compressing hours of execution into seconds of reasoning—showing that good, verified data stories plus thoughtful prediction can reshape how ML agents search, choose, and improve.

Practical Applications

•Speed up AutoML searches by predicting poor candidates away before training.
•Reduce cloud compute bills by executing only high-confidence model candidates.
•Improve MLOps pipelines with a pre-execution filter that flags risky overfitting setups.
•Accelerate Kaggle-style experimentation by quickly narrowing to the most promising ideas.
•Assist data scientists in model selection for small or imbalanced datasets using the Verified Data Report.
•Prioritize hyperparameter trials by predicting likely winners and skipping weak configurations.
•Guide feature engineering choices by linking data traits to suitable model families.
•Enable faster AI4Science workflows by expanding search breadth while keeping time budgets fixed.
•Serve as an execution-free reward model to speed up RL-style agent training loops.
•Provide calibrated risk assessments (confidence-gated) for when to run costly training jobs.

Version: 1