Toward Ultra-Long-Horizon Agentic Science: Cognitive Accumulation for Machine Learning Engineering

Xinyu Zhu; Yuzhu Cai; Zexi Liu; Bingyang Zheng; Cheng Wang; Rui Ye; Jiaao Chen; Hanrui Wang; Wei-Chen Wang; Yuzhi Zhang; Linfeng Zhang; Weinan E; Di Jin; Siheng Chen; Yanfeng Wang

Toward Ultra-Long-Horizon Agentic Science: Cognitive Accumulation for Machine Learning Engineering

Intermediate

Xinyu Zhu, Yuzhu Cai, Zexi Liu et al.1/15/2026

arXiv PDF

Key Summary

•This paper builds an AI agent, ML-Master 2.0, that can work on machine learning projects for a very long time without forgetting what matters.
•Its big idea is cognitive accumulation: turning raw experiences into stable knowledge and then into reusable wisdom over time.
•It uses a Hierarchical Cognitive Caching (HCC) system with three layers: L1 Evolving Experience, L2 Refined Knowledge, and L3 Prior Wisdom.
•The agent carefully moves information between layers using context prefetching, context hits, and context promotion.
•On OpenAI’s MLE-Bench (75 real Kaggle tasks), it achieved a 56.44% medal rate, beating previous methods across all difficulty levels.
•HCC shrinks the bloated context while keeping the brain of the project intact, preventing confusion during long debugging cycles.
•Ablations show each layer (experience, knowledge, wisdom) matters; removing any one hurts results.
•This approach points toward ultra-long-horizon autonomy—agents that can sustain strategy over days or weeks of experimentation.
•It offers a blueprint for AI that learns from projects like people do: remember the steps, extract lessons, and collect best practices.
•The system is resource-intensive but shows how to scale deliberate, memory-aware AI research on purely computational tasks.

Why This Research Matters

Real ML work is a marathon, not a sprint, and this paper shows how to keep an AI’s head clear for the whole race. By separating raw details from settled lessons and long-term wisdom, agents stop wasting compute relearning the same basics. Teams gain faster iterations, fewer dead ends, and a growing library of proven playbooks. This can improve productivity for data scientists, lower costs for companies, and speed up research progress. Beyond Kaggle-like tasks, the method points to general scientific agents that can plan, remember, and improve over weeks. Smarter memory means safer, more explainable decisions because each phase’s summary shows what worked and why. In short, it’s a practical path to AI that learns how to learn, not just how to fit a model.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine building a giant LEGO city over many weekends. If you only remember the last few bricks you placed, the city ends up messy. You need a plan, notes about what worked, and a list of tricks you can reuse next time.

🥬 The Concept — Machine Learning Engineering (MLE):

What it is: MLE is the hands-on craft of turning data and models into working solutions that produce real predictions and scores.
How it works: 1) Understand the task and data, 2) Pick and code a model, 3) Train and validate it, 4) Submit predictions and read the score, 5) Improve by iterating.
Why it matters: Without MLE, ideas stay as theory; you won’t get reliable, leaderboard-ready solutions.

🍞 Anchor: In a Kaggle competition, MLE is everything from loading CSV files to tuning learning rates and saving submission.csv.

🍞 Hook: You know how a student can’t carry every textbook in their backpack, so they choose what to bring and what to summarize?

🥬 The Concept — Context Management:

What it is: Context management is choosing what information an AI keeps close at hand to think well right now.
How it works: 1) Collect signals from the environment (errors, logs), 2) Prioritize what’s important, 3) Keep the rest nearby in summaries, 4) Update as you learn.
Why it matters: Without it, the AI’s “backpack” overflows with details, and it loses track of the mission.

🍞 Anchor: When debugging code, you keep the latest error full text, but you only keep summaries of yesterday’s runs.

🍞 Hook: Think of planning a school musical. Rehearsals take weeks, and your choices today affect the big performance later!

🥬 The Concept — Ultra-long-horizon Autonomy:

What it is: The ability for an AI to keep strategy and adjust plans over very long stretches—days or weeks.
How it works: 1) Set long-term goals, 2) Run experiments, 3) Gather delayed feedback, 4) Correct course without forgetting past lessons.
Why it matters: Without it, the AI forgets why it started and chases random fixes.

🍞 Anchor: Training a model over 24 hours with many trials—only long-horizon autonomy keeps the plan steady while adapting.

🍞 Hook: You know how in class you do many practice problems, then remember key rules, and later you share tips that work for all subjects?

🥬 The Concept — Cognitive Accumulation:

What it is: Turning raw experiences into stable knowledge and finally into reusable wisdom over time.
How it works: 1) Do stuff (experience), 2) Summarize what really mattered (knowledge), 3) Distill cross-task patterns (wisdom), 4) Reuse them.
Why it matters: Without it, every new task restarts from zero.

🍞 Anchor: After many Kaggle tasks, you remember that GroupKFold avoids leakage on user-based datasets—wisdom you reuse next time.

🍞 Hook: Computers use fast small caches (close to the CPU), bigger slower memory, and even slower disks. Why? To put the right stuff at the right distance.

🥬 The Concept — Hierarchical Cognitive Caching (HCC):

What it is: A three-layer memory system for agents that separates short-term traces, mid-term summaries, and long-term wisdom.
How it works: 1) Keep fresh execution details nearby (L1), 2) Promote phase summaries as knowledge (L2), 3) Distill cross-task wisdom (L3), 4) Retrieve and update across layers.
Why it matters: Without tiers, the agent’s mind clogs with logs, losing big-picture guidance.

🍞 Anchor: Like L1/L2/L3 in computers, HCC keeps current logs handy, preserves conclusions, and packs best practices for future tasks.

The world before: LLM agents were good at short, single-shot answers but struggled in real research: long feedback loops, huge logs, and many cycles of trial-and-error. People tried bigger context windows, memory paging, and long summaries. These helped store more, but didn’t explain how messy, raw execution turns into clean, reusable strategy. The problem: Static or heuristic memory can balloon with details and still miss the “lessons learned.” The failed attempts: Flat memories and ad-hoc summarization either lost key details or grew unmanageably, and often didn’t control promotion/eviction policies. The gap: We needed a process to evolve information—from raw to refined to transferable—plus clear rules for what to keep where and when to move it. Real stakes: In real projects, wasted compute and confusion kill progress. For students, it’s like studying for weeks but not remembering the key theorems; for companies, it’s time and GPU money lost. This paper fills that gap with HCC: a principled way to keep agents strategically sharp over very long horizons by growing wisdom, not just logs.

02Core Idea

🍞 Hook: Imagine a chef who keeps today’s recipe notes on the counter, a binder of tried-and-true tips on the shelf, and a mental playbook of cooking wisdom for any cuisine.

🥬 The Concept — The “Aha!” Moment:

What it is: Don’t stretch a single memory; evolve it. Distill execution traces into knowledge, then into wisdom, and structure it in layers so the agent stays sharp for days.
How it works: 1) Separate memory by timescale (now, soon, later), 2) Promote info upward only when stable, 3) Retrieve the right layer when needed, 4) Keep loops of learning going.
Why it matters: Without evolution and structure, the agent drowns in details or forgets strategy.

🍞 Anchor: Instead of re-reading all logs, the agent uses concise phase summaries and a library of prior best practices to plan the next move.

Multiple analogies:

School Binder: Daily worksheet (L1), unit summary sheet (L2), semester cheat-sheet (L3). You write daily, summarize weekly, and study from the cheat-sheet for finals.
Sports Team: Live play calls (L1), post-game analysis (L2), season playbook (L3). You act now, learn patterns, and refine your playbook for every future game.
Map App: Current turn-by-turn directions (L1), route overview (L2), learned driving habits (L3). You avoid traffic now, choose smarter routes next trip.

Before vs After:

Before: Agents stuffed long contexts with raw histories; attention scattered; plans drifted; improvements slowed.
After: Agents keep tight working memories, roll up outcomes into phase knowledge, and bank cross-task wisdom. Planning stays coherent; learning compounds.

🥬 The Concept — Research Plan Phases:

What it is: Chunking work into planned phases with multiple exploration directions and concrete suggestions.
How it works: 1) Propose directions, 2) Run them (possibly in parallel), 3) Collect results, 4) Promote summaries, 5) Plan next phase.
Why it matters: Without phases, you can’t cleanly promote knowledge or steer strategy.

🍞 Anchor: Try three ideas—change features, adjust model, tweak loss—then summarize what worked to guide the next round.

Why it works (intuition):

Information changes stability over time. Fresh logs are volatile; conclusions harden after repeated validation; wisdom emerges when similar lessons recur across tasks.
By aligning storage to stability (fast-close for volatile, summarized for stable, distilled for transferable), the agent avoids overload while keeping strategy intact.
Promotion is the “glue”: it converts noise into signal on a schedule (after a phase or at task end), so lessons don’t get lost.

Building blocks:

L1 Evolving Experience: raw, high-fidelity traces for immediate debugging.
L2 Refined Knowledge: compact phase summaries that preserve reasoning and key judgments.
L3 Prior Wisdom: reusable, task-agnostic templates and priors, retrievable by similarity.
Migration Policies: prefetching (start with relevant wisdom), hit rules (pull raw if current, summaries if past), promotion (phase-level and task-level distillation).

🥬 The Concept — Cognitive Accumulation (revisited as the engine):

What it is: The ongoing cycle: act → summarize → distill → reuse.
How it works: 1) Try a plan, 2) Summarize what worked and why, 3) Distill cross-task rules at task end, 4) Fetch these rules in new tasks.
Why it matters: It turns time into compound learning—like interest but for ideas.

🍞 Anchor: After several image tasks, the agent “remembers” strong backbones and augmentations that usually help, speeding up the next win.

03Methodology

At a high level: Input (task description + data) → Context Prefetching (retrieve similar-task wisdom) → Initial Coding + Execution (L1 experience grows) → Phase Planning (choose directions) → Parallel Exploration + Logging (L1) → Phase-level Promotion (to L2 knowledge) → Next Phases (repeat) → Task-level Promotion (to L3 wisdom) → Output (final code + submission + updated wisdom).

🥪 Concept — L1 Evolving Experience:

What it is: The agent’s working memory of current raw traces—plans, code diffs, errors, metrics.
How it works: Keep raw logs only for the active phase plus key plan markers; discard or promote older raw traces after each phase.
Why it matters: Without L1, the agent can’t precisely debug or react to immediate errors.
Example: If Val F1 drops after enabling CutMix, L1 contains the exact training logs and code lines to revert or adjust.

🥪 Concept — L2 Refined Knowledge:

What it is: Mid-term summaries from a completed phase: judgments (e.g., feature X harmful), insights (e.g., leakage under split Y), rationale.
How it works: After running several directions, an LLM condenses results into a compact, decision-ready summary, then removes the bulky logs.
Why it matters: Without L2, strategic memory drowns in raw details; planning loses consistency.
Example: “Higher image resolution with ConvNeXt-L boosted F1; Asymmetric Loss helped with imbalance; Stop trying MixUp-only.”

🥪 Concept — L3 Prior Wisdom:

What it is: Cross-task, reusable playbooks: templates, stable hyperparameter priors, robust pipelines.
How it works: At task end, distill durable, task-agnostic tips; index them by embeddings for semantic retrieval next time.
Why it matters: Without L3, every new task restarts from scratch.
Example: “For multilabel leaves, strong augmentations + high-res backbones + Asymmetric Loss are reliable starters.”

🥪 Concept — Research Plan Phases (operational):

What it is: Each phase proposes m directions with q concrete suggestions and runs them, often in parallel.
How it works: 1) Draft a hierarchical plan, 2) Execute suggestions (e.g., change backbone, loss, split), 3) Record metrics, 4) Summarize into L2.
Why it matters: Phases create clean boundaries for promotion and strategic resets.
Example: Phase 2 tests ViT-B/16 vs ConvNeXt-L vs Swin-T with fixed augmentations; picks the winner for Phase 3.

🥪 Concept — Context Prefetching:

What it is: A warm start that pulls relevant L3 wisdom before coding.
How it works: Embed the new task descriptor; retrieve past wisdom with cosine similarity above a threshold; inject into initial context.
Why it matters: Without prefetching, the agent spends hours rediscovering common-sense baselines.
Example: For multi-label plant disease, it fetches tips about Asymmetric Loss and high-res backbones.

🥪 Concept — Context Hit (retrieval policy):

What it is: A rule that decides whether to fetch raw events (from L1) or summaries (from L2) into the model’s prompt.
How it works: If an event is from the active phase, include it raw; if from a finished phase, include its L2 summary; always include plan markers.
Why it matters: Without it, the context bloats (too many raw logs) or becomes vague (only summaries).
Example: While in Phase 3, include Phase 3 errors raw, but compress Phase 1–2 into a couple of L2 knowledge blocks.

🥪 Concept — Context Promotion:

What it is: Turning raw trajectories into L2 knowledge (phase-level) and task histories into L3 wisdom (task-level).
How it works: Phase-level: summarize all directions’ logs into a compact, evaluative brief and evict raw logs; Task-level: after final solution, distill portable tips, embed, and store.
Why it matters: Without promotion, nothing stabilizes—insights vanish in the noise.
Example: Phase-level: “Resolution helps more than optimizer tweaks.” Task-level: “For imbalance, Asymmetric Loss is a strong default.”

Recipe with concrete example flow:

Input → Prefetch: Read task description (e.g., plant pathology). Retrieve L3 wisdom about multilabel classification and augmentations.
Initial Code: Draft a baseline (e.g., ViT-B/16, 224x224), train, print Val F1, write submission.csv.
Plan Phase 1: Three directions—(A) increase resolution to 384, (B) swap backbone to ConvNeXt-L, (C) add Asymmetric Loss.
Parallel Runs: Execute A, B, C; collect logs in L1.
Phase Promotion to L2: Summarize results—“B and C helped; A helped a bit; combine B+C; stop MixUp-only.” Remove raw logs.
Plan Phase 2: Try ConvNeXt-L + Asymmetric Loss + tuned LR schedule; compare cosine vs step decay; adjust augmentations.
Repeat: Execute, summarize, refine.
Task-level Promotion to L3: Distill the general recipe for similar image multilabel tasks; store with embedding key.

Secret sauce:

Structural differentiation (L1/L2/L3) decouples fast-changing execution from stable strategy and reusable wisdom.
Migration policies keep the right data at the right time, preventing context saturation while preserving the decision backbone.
Phase boundaries provide natural checkpoints for reflection and consolidation.

04Experiments & Results

🥪 Concept — MLE-Bench:

What it is: A test of 75 real Kaggle-style tasks to measure how well agents do full ML pipelines.
How it works: Agents get 24 hours per task to produce valid submissions and compete on leaderboard-like metrics.
Why it matters: It simulates real-world MLE with noisy logs, delayed feedback, and many iterations.
Example: Tasks include tabular, image, and text problems with different splits and metrics.

🥪 Concept — Medal Rate:

What it is: The percentage of tasks where an agent reaches Bronze/Silver/Gold-level performance.
How it works: Compare an agent’s score to competition thresholds (median, silver+, gold), then average across tasks.
Why it matters: It’s a user-friendly way to read overall success—like letter grades.
Example: 56.44% medal rate is like getting an A/A− when many peers are at B.

The test: Researchers measured medal rate (Bronze/Silver/Gold), Valid submission rate (does it run/end correctly), and Median+ (beats half of humans). They compared ML-Master 2.0 against strong baselines including OpenHands, MLAB, AIDE, R&D-Agent, AIRA-dojo, FM Agent, MLE-STAR, Thesis, and Leeroo across low/medium/high difficulty. The competition: Some baselines focus on iterative refinement, others on search/evolution, yet most manage memory either by large flat histories or heuristics, not by cognitive differentiation with rules for promotion and reuse.

The scoreboard (with context):

Overall medal rate: 56.44% for ML-Master 2.0. Think of it as consistently winning medals in more than half of all races—state-of-the-art among listed methods.
By difficulty: Low 75.8% (dominant), Medium 50.9% (first), High 42.2% (first). That means it maintains edge even when tasks get tough.
Robustness: 95.6% valid submission rate—so it rarely breaks—and 63.1% Median+, meaning it beats at least half the human participants in most tasks.
Relative gain: Compared to ML-Master (previous version), the overall medal rate nearly doubles relatively (from 29.3% to 56.4% reported in text), showing the power of HCC and promotion rules.

Ablations (what happens if you remove pieces):

Remove L1 (Experience): valid falls to ~54.5%, medals ~22.7%—without detailed working memory, the agent can’t fix bugs or respond to errors.
Remove L2 (Knowledge): medals drop—keeping everything raw or everything summarized doesn’t let strategy crystallize.
Remove L3 (Wisdom): above-median and medal rates decrease—starting cold wastes time rediscovering the basics.

Surprising findings:

Context control matters more than sheer size: With HCC, peak context shrank from >200k to ~70k tokens on a long task, yet performance improved. Less clutter, more clarity.
Compounding improvement over time: Medal rate curves climbed steadily as phases accumulated—evidence that cognitive accumulation truly compounds.
Warm-start pays off: Preloading with wisdom from 407 prior tasks boosted early phases, especially in medium/high complexity where blind exploration is costly.

05Discussion & Limitations

Limitations:

Domain scope: Results are on MLE-Bench; although diverse, it’s still a computational sandbox. Physical labs or extremely long multi-week pipelines may add new constraints.
Resource demands: The setup used many CPUs/GPUs and large RAM/SSD. Not every team can afford this scale, especially for parallel exploration.
Promotion quality: Phase/task summaries depend on LLM reasoning. If summarization misses a subtle bug or falsely crowns a weak idea, the agent may lock into suboptimal paths.
Similarity retrieval: Prefetching hinges on good task embeddings. If the descriptor is poor or the similarity threshold is off, wisdom fetched may mislead early phases.
Catastrophic accumulation: Poor governance (thresholds, timing) could bloat L2/L3 or harden bad habits; careful policies are essential.

Required resources:

Reliable execution environment (OS, packages, GPU drivers), fast storage for logs and checkpoints, and stable internet if needed for tools.
Access to strong LLMs for coding and promotion; summarization quality influences L2/L3 quality.
Time budgets that allow multiple phases; ultra-short budgets limit accumulation benefits.

When not to use:

Tiny, one-shot tasks where a simple script suffices; HCC overhead may be unnecessary.
Highly volatile tasks where rules change every hour; stored wisdom may go stale faster than it helps.
Strict real-time systems where summarization delays are unacceptable.

Open questions:

Automated policy learning: Can the agent learn optimal promotion thresholds and phase lengths from data rather than fixed heuristics?
Cross-domain transfer: How well does L3 wisdom move from tabular to images to text—or from Kaggle-like tasks to industrial settings?
Trust and verification: How can we audit L2/L3 to avoid locking in clever but brittle tricks?
Multi-agent collaboration: Can agents share L3 across a team and negotiate conflicts between their wisdom libraries?
Life-long maintenance: What pruning and versioning keep L3 fresh without losing rare but valuable tricks?

06Conclusion & Future Work

Three-sentence summary: ML-Master 2.0 turns long, messy project histories into a neat ladder of experience → knowledge → wisdom, managed by a three-level cache. This structured evolution keeps the agent strategically focused for days, enabling steady, compounding improvement. On 75 real Kaggle tasks, it set a new bar for medal rate while using less cluttered context.

Main achievement: Proving that governed cognitive accumulation—via Hierarchical Cognitive Caching and promotion policies—unlocks ultra-long-horizon autonomy far better than simply stuffing bigger contexts.

Future directions: Learn promotion and retrieval policies automatically, stress-test cross-domain transfer, and build team-level wisdom sharing with auditing. Explore hybrid setups that mix learned scoring with human-in-the-loop checkpoints for safety. Extend beyond MLE-Bench to real-world, longer-latency research loops (e.g., simulations with delayed signals).

Why remember this: It reframes memory not as a bigger bag but as a smarter brain. By separating what’s fresh, what’s settled, and what’s timeless, ML-Master 2.0 shows how AI can work like seasoned scientists—keeping the details that matter, the lessons that last, and the playbooks that travel.

Practical Applications

•Bootstrap new ML projects by prefetching prior wisdom templates matched to the task descriptor.
•Run multi-direction phase plans (model, data, loss) in parallel, then promote concise L2 summaries to steer the next phase.
•Set thresholds for when to promote from L1 to L2 to keep prompts lean without losing key insights.
•Build a shared L3 wisdom library across a team so future tasks start with strong priors.
•Use context-hit rules in prompts to combine raw current errors with summarized past phases for focused debugging.
•Automate task-level promotion to produce reproducible DATA/MODEL summaries for documentation and onboarding.
•Track context length over time and trigger promotion when growth crosses safe limits.
•Add similarity-based retrieval to your agent so it picks the right prior pipelines for new problems.
•Instrument valid submission checks early to raise the system’s performance floor.
•Log per-phase decisions and outcomes to enable postmortems and prevent repeating dead ends.

Version: 1