Nested Learning: The Illusion of Deep Learning Architectures

Ali Behrouz; Meisam Razaviyayn; Peilin Zhong; Vahab Mirrokni

Nested Learning: The Illusion of Deep Learning Architectures

Intermediate

Ali Behrouz, Meisam Razaviyayn, Peilin Zhong et al.12/31/2025

arXiv PDF

Key Summary

•The paper introduces Nested Learning, a new way to build AI that learns in layers (like Russian dolls), so each part can update at its own speed and remember different things.
•It shows that common training tools (like Adam or momentum) are actually memory systems that compress and organize gradients over time.
•It reframes in-context learning as a natural result of having multiple learning levels, not just a Transformer trick.
•The authors propose Delta Gradient Descent (DGD), which adjusts learning using both the new data and the model’s current state, helping with non-i.i.d. data.
•They design a Continuum Memory System (CMS) where memories update at many time scales, reducing forgetting and enabling recovery of older knowledge.
•They combine ideas into Hope, a self-modifying sequence model with CMS that shows promising results in continual learning, long-context reasoning, and few-shot tasks.
•They also propose an optimizer, M3, that mixes multiple momentums at different time scales and uses orthogonalization to remember more of the past.
•Overall, the paper argues that architecture and optimization are two sides of the same nested system and should be co-designed.
•Experiments suggest improvements (up to about 15% on some tasks) in continual and long-context benchmarks compared to strong baselines.

Why This Research Matters

AI systems today rarely keep learning after deployment without costly retraining or risking forgotten skills. Nested Learning offers a blueprint for models that adapt at multiple speeds, like the human brain’s rhythms. This can make assistants that grow with families, tutors that remember student progress, and robots that refine habits over months. Better memory handling reduces errors when tasks drift over time. Co-designing architecture and optimizer as one nested system could make future AI more reliable, resilient, and personalized.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

You know how some video games let you upgrade different parts of your character at different times—speed today, armor tomorrow, skills next week—so you get better in many ways, not just one? AI hasn’t really done that. Most modern deep learning models, especially big language models (LLMs), do a ton of learning during pre-training, then mostly freeze. They can use what’s in the current prompt (in-context learning), but they don’t keep learning from life as it happens.

🍞 Hook: Imagine a friend who remembers everything from years ago but forgets what happened five minutes ago. That’s like anterograde amnesia. Many LLMs act a bit like that: they’re great with what’s in their context window and what they learned long ago during pre-training, but they don’t write new long-term memories after deployment.

🥬 The Concept (Gradient Descent):

What it is: Gradient descent is a step-by-step way for a model to make better guesses by moving a little downhill on the error curve each time.
How it works:
1. Make a prediction
2. Measure how wrong it is (the “gradient”)
3. Nudge parameters a small step to reduce that error
4. Repeat many times
Why it matters: Without it, models can’t learn from data efficiently. 🍞 Anchor: Like rolling a marble down a bumpy hill toward the lowest point; each roll is a learning step.

🥬 The Concept (Associative Memory):

What it is: Associative memory links a key to a value—like “question → answer” or “face → name.”
How it works:
1. Store pairs (keys, values)
2. When a query arrives, find matching keys
3. Return the linked value or a smart mix
Why it matters: Without it, models can’t quickly recall useful facts or patterns. 🍞 Anchor: You hear the first bars of a song (key) and instantly remember the lyrics (value).

🥬 The Concept (In-Context Learning):

What it is: In-context learning means a model figures out a new task from examples in its prompt without changing permanent weights.
How it works:
1. Read the provided examples
2. Infer the pattern
3. Apply the pattern to new items in the same context
4. Forget when the context disappears
Why it matters: It gives quick adaptability, but memories vanish with the prompt. 🍞 Anchor: Learning a new card game by watching a few rounds, then playing okay for that night—but forgetting the rules next day.

The problem: Stacking more layers has given us powerful pattern recognizers, but several issues remain:

Computational depth doesn’t always grow with more layers; some models still can’t run complex multi-step algorithms well.
Capacity gains plateau for certain parameters even when we scale up.
Training can settle into so-so solutions if the optimizer’s memory of past gradients is too shallow.
Most crucially, models don’t naturally keep learning after pre-training without extra machinery and often forget older skills when they do (catastrophic forgetting).

Failed attempts so far:

Re-training or fine-tuning after deployment: Expensive, risky for forgetting, and not continuous.
External memory add-ons: Help, but add complexity and still need careful training.
Learned optimizers: Promising, but brittle, compute-heavy, and hard to generalize.

The gap: We lack a simple, unified way to let different parts of a model learn at their own speeds, share knowledge across time, and keep memories from instantly fading or colliding.

Brain hint: Human brains use many time scales at once (fast gamma, medium beta, slow theta/delta waves). Different rhythms help us rapidly adapt and also lay down longer-term memories. The brain’s wiring is surprisingly uniform and reusable; areas can take on new roles when needed. This suggests a design principle: a uniform set of modules, each updating at its own frequency, working together.

This paper’s idea: See a whole model—not just its layers, but also its optimizer—as a family of nested learning problems. Each piece has its own “context flow” (what it pays attention to) and its own update frequency (how fast it learns). When you look this way, you realize:

Optimizers like Adam are themselves associative memories, compressing gradient information over time.
Pre-training is like in-context learning over a giant context (the whole training set).
Transformers secretly already have extreme “frequencies”: attention updates infinitely fast (non-parametric read of all tokens), while MLPs update at frequency zero during inference (frozen).

Why this matters in daily life: If your digital tutor, assistant, or household robot could keep learning gently from new experiences without forgetting old ones, it would feel less like a frozen app and more like a helpful companion that grows with you. That requires models to manage memories across many time scales and to treat optimization as part of learning, not just a separate training step.

02Core Idea

🍞 Hook: Imagine a stack of Russian dolls. Each doll is complete, but the smaller ones live inside the bigger ones, and each can be opened at a different time. What if a model learned like that—many little learners inside bigger ones, each with its own pace?

🥬 The Concept (Nested Learning):

What it is: Nested Learning (NL) treats a model as several learning problems inside each other, each with its own context and update speed.
How it works:
1. Break the system into levels (fast, medium, slow)
2. Each level optimizes its own objective on its own context
3. Levels pass knowledge across (by conditioning, initialization, gradients, or generation)
4. The whole system learns together as an interconnected module
Why it matters: Without nesting, we freeze most parts and lose rich, multi-timescale learning. 🍞 Anchor: A classroom with daily quizzes (fast), weekly projects (medium), and semester portfolios (slow) all feeding into each other.

Multiple analogies to the same idea:

Orchestra: Violins (fast) play notes quickly, cellos (medium) carry themes, and bass (slow) anchors harmony. Together they make music.
City traffic: Green lights change quickly (fast control), lane signs update monthly (medium), and road maps change yearly (slow). All coordinate flow.
Sports team: Players adjust instantly (fast), coaches tweak plays weekly (medium), and managers set season strategy (slow). That’s nested learning.

🥬 The Concept (Context Flow):

What it is: Context flow is the stream of information a level uses to learn (tokens for attention, gradients for optimizers, episodes for meta-learners).
How it works:
1. Define what a level sees (its context)
2. Define how it updates from that context
3. Decide how its state influences other levels
Why it matters: If levels stare at the wrong stream (or none), they can’t learn the right things. 🍞 Anchor: Different rivers feed a lake from different directions; each river (context) matters.

🥬 The Concept (Multi-Level Optimization):

What it is: The whole model is solved by several connected optimizations, not one monolithic training loop.
How it works:
1. Fast level adjusts quickly (e.g., attention, fast memory)
2. Mid level updates occasionally (e.g., momentum, deep memory)
3. Slow level updates rarely (e.g., base weights)
4. Information flows between them via defined channels
Why it matters: Without this structure, models adapt too slowly or forget too fast. 🍞 Anchor: A bakery runs ovens (fast), inventory (medium), and supplier contracts (slow). You need all three layers to deliver fresh bread daily.

Before vs. After:

Before: Architecture and optimization were separate. In-context learning seemed like a Transformer-only quirk. Memory was split into “short-term vs. long-term” boxes.
After: Architecture and optimization are two parts of one nested system. In-context learning is what happens whenever a fast level adapts on its context. Memory is a spectrum across many time scales.

Why it works (intuition):

Levels can specialize: fast levels capture recent quirks; slow levels preserve robust structure.
Optimizers as memories: momentum and preconditioners are not just math tricks; they’re learnable compressors of gradient history.
Self-referential updates: A level can learn how to update itself based on current state and input, like DGD, making learning smarter for non-i.i.d. data.

Building blocks introduced in the paper:

Expressive Optimizers: Treat optimizers (e.g., Adam, Muon) as associative memories and generalize them to hold richer, longer histories.
Self-Modifying Learning Module: A sequence model that can learn its own update rules (learn how to learn).
Continuum Memory System (CMS): Replace a single “short/long-term” split with many memories updating at different frequencies, allowing partial recovery if something is forgotten.

Put simply, the “aha!” is one sentence: Treat everything—layers and optimizers—as nested learners with their own contexts and speeds, then let them talk to each other.

03Methodology

At a high level: Input tokens → Fast level writes/reads a short-term memory (like attention or linear fast weights) → Mid level updates deeper memories and optimizer states less often → Slow level updates base weights rarely → Output tokens. All levels exchange knowledge through defined channels (conditioning, initialization, backprop, or generation).

🥬 The Concept (Self-Modifying Learning Module):

What it is: A model component that learns not just what to predict, but how to update itself.
How it works:
1. Observe input and current state
2. Propose an update rule (e.g., how big a step to take, what to keep/forget)
3. Apply the update and measure improvement
4. Learn to make better updates next time
Why it matters: Without this, the model can’t tailor its own learning to the data’s rhythms. 🍞 Anchor: It’s like a student who also learns better study habits over time, not just the subject.

Step-by-step recipe of the NL system:

Define levels and frequencies:
- Fast level: updates per token (e.g., fast weights, attention-like memory, in-context learners)
- Medium level: updates per chunk or episode (e.g., deep memory, momentum)
- Slow level: updates per dataset pass or phase (e.g., base MLP weights)
Assign context flows:
- Fast sees tokens and local errors
- Medium sees sequences of gradients or chunk summaries
- Slow sees broad objectives (e.g., NTP) and meta-learns initial states
Choose knowledge transfer methods:
- Conditioning (non-parametric): fast output depends on medium state
- Initialization (meta-learning): slow level optimizes initial fast-memory states for quick adaptation
- Backprop across levels: gradients flow through level boundaries
- Generation: one level generates weights or contexts for another (hypernetworks or optimizer data)
Train the whole module:
- Alternate or interleave updates according to each level’s schedule
- Keep each level’s objective and retention regularizer simple and stable

Optimizers as associative memories (how, why, example):

How: Momentum accumulates gradients (keys) into a state (values) via EMA or more advanced rules.
Why: It gives the optimizer a longer, more informative view of the landscape to avoid bad local choices.
Example: Adam stores both mean and variance of gradients; Muon orthogonalizes steps to reduce interference.

🥬 The Concept (Delta Gradient Descent, DGD):

What it is: A learning rule that updates weights using both the current gradient and the current weight state, adding an adaptive decay tied to the present input.
How it works:
1. Compute the usual gradient from the current example
2. Mix in a state-aware term that gently pulls weights toward stability where needed
3. Take the update step that balances new info with existing structure
4. Repeat per token/sample
Why it matters: Ordinary updates treat each step in isolation; DGD respects that sequence data are not i.i.d. and avoids overreacting. 🍞 Anchor: Like steering a bike knowing your current speed and leaning angle, not just the direction of the next turn.

🥬 The Concept (Continuum Memory System, CMS):

What it is: Many memories arranged by update frequency, from fast to slow, instead of a single short-vs-long split.
How it works:
1. Build several MLP-based memories that update on different schedules (e.g., every token, every 32 tokens, every 1,000 tokens)
2. Each block compresses its own context into its parameters
3. Aggregate their outputs (e.g., weighted sum or learned aggregator)
4. Use meta-learning to set good initial states so fast memories adapt quickly without drifting
Why it matters: If one memory overwrites something, a slower memory can still hold it, enabling recovery and reducing catastrophic forgetting. 🍞 Anchor: Multiple notebooks: a scratch pad (fast), a daily journal (medium), and a yearly scrapbook (slow). If you tear out a scratch page, the memory still lives in the journal or scrapbook.

🥬 The Concept (Continual Learning):

What it is: The ability to keep learning new tasks over time without erasing what you learned before.
How it works:
1. Use fast memories for rapid task pickup
2. Periodically consolidate useful bits into slower memories
3. Balance plasticity (learn new things) and stability (keep old things)
4. Rehearse or replay through architectural design (e.g., CMS) rather than external buffers
Why it matters: Without it, models either stop learning or forget. 🍞 Anchor: Learning new piano pieces while still remembering old ones because you practice at different tempos and schedules.

🥬 The Concept (Expressive Optimizers):

What it is: Upgraded optimizers that store richer, longer gradient histories and map steps into smarter directions.
How it works:
1. Treat momentum as a deep memory (possibly MLP-based, with nonlinear outputs)
2. Add multiple momentums at different time scales
3. Orthogonalize steps (e.g., Newton–Schulz) to avoid interference
4. Learn which history to keep or forget (delta rules)
Why it matters: Standard EMA forgets too fast; expressive optimizers can steer better during long, changing tasks. 🍞 Anchor: Not just remembering yesterday’s weather, but trends over weeks and seasons to plan a trip.

Putting the pieces together: Hope and M3

🥬 The Concept (Hope, a self-referential sequence model with CMS):

What it is: A sequence model that learns to modify itself, equipped with a CMS so different parts update at different speeds.
How it works:
1. Fast level does in-context learning (attention or fast weights)
2. CMS layers update on schedules, storing compressed knowledge at multiple time scales
3. Meta-learned initial states let fast levels adapt quickly
4. The whole system backprops across levels or uses initialization/conditioning to transfer knowledge
Why it matters: It combines quick adaptation with lasting knowledge, aiding continual and long-context tasks. 🍞 Anchor: A student who takes notes every class (fast), organizes weekly (medium), and reviews before finals (slow).

🥬 The Concept (M3, Multi-scale Momentum Muon optimizer):

What it is: An optimizer with multiple momentum terms at different time scales plus orthogonalization, inspired by CMS.
How it works:
1. Keep a fast momentum (per step) and a slow momentum (per chunk)
2. Map both through orthogonalization (Newton–Schulz) to reduce interference
3. Combine them (e.g., weighted) to choose a smart direction
4. Update parameters
Why it matters: It “remembers” both recent and older gradient structure, helping navigation of nonstationary tasks. 🍞 Anchor: Using both your short-term memory and your long-term plan when hiking a changing trail.

Training and efficiency notes:

Only the levels scheduled to update at a given time step consume compute; others remain fixed.
CMS enables sequence parallelization within chunks (fast training), similar to modern RNN/linear-attention tricks.
You can initialize CMS blocks from a pre-trained Transformer’s MLPs to get strong starting points and then let them adapt at their own speeds.

04Experiments & Results

The Test: The authors evaluated whether nesting levels and using CMS/expressive optimizers help with:

Continual learning and in-context tasks (learning a new language, class-incremental learning, QA on new corpora)
Long-context understanding (needle-in-a-haystack, BABILong)
Language modeling and common-sense reasoning
In-context recall/memorization and language recognition tasks
Optimizer comparisons, including the proposed M3

What they measured and why:

Accuracy/F1 on task performance (Do we get the right answers?)
Few-shot/zero-shot adaptation (How fast does the model pick up new tasks?)
Forgetting metrics (How much do older tasks degrade?)
Long-context retrieval success (Can it find and use information far back in the input?)
Efficiency proxies (How much extra overhead do multi-level updates add?)

The competition (baselines):

Standard Transformers and strong LLM-style backbones
Modern RNN/linear-attention models (fast-weight, delta-rule variants)
Conventional optimizers (SGD, AdamW, Muon)

The scoreboard (with context):

On continual learning setups, systems using NL ideas (Hope + CMS, and expressive optimizers like M3) showed improved retention and faster re-adaptation. Reported gains reached up to about 15% on selected tasks—like moving from a solid B to an A on tricky evolving exams.
On long-context reasoning (e.g., needle-in-a-haystack), CMS-equipped models were better at not “dropping the needle” as the haystack grew. Think of it as remembering where you put your keys, even after a long day.
For few-shot generalization and in-context recall, meta-learned initial states for fast memories helped the model “click” into a task more reliably with just a handful of examples.
For optimizers, M3 often navigated changing objectives better than single-momentum baselines, highlighting the value of multi-time-scale gradient memories and orthogonalized steps.

Surprising findings:

Optimizers as memories is not just a metaphor: treating them like associative memories led to practical designs (e.g., multiple momentums, delta-style forgetting) that made a measurable difference.
Pre-training looks like mega in-context learning: viewing it this way clarifies why models can be great adapters in some settings but stubbornly static after deployment—it’s a levels-and-frequencies mismatch.
CMS can sometimes “resurrect” knowledge: if a faster memory overwrote some patterns, a slower memory may still hold them, allowing recovery through backprop-linked initialization.

Caveats and fairness:

Multi-level systems can add overhead; the authors aimed to update only the scheduled parts to control cost.
Some results are proof-of-concept; broader scaling studies and ablations would help separate which components matter most in which settings.
Benchmarks include diverse tasks, but more real-world, online settings (no clear train/test split) would further validate continual aspects.

Takeaway: Across varied tasks, nesting levels, using state-aware rules like DGD, and arranging CMS to span many timescales gave the models sturdier memory and quicker adaptation without dramatic forgetting, beating strong baselines on several fronts.

05Discussion & Limitations

Limitations:

Complexity and compute: Multi-level systems are trickier to implement and can cost more to run, especially if many levels update frequently.
Stability and tuning: Each level has its own objective, schedule, and retention. Getting these to play nicely may require careful hyperparameter search and monitoring.
Scaling orthogonalization: Methods like Newton–Schulz can be costly for very large matrices; approximations or low-rank tricks may be needed.
Partial persistence: While CMS helps, it’s not magic. Catastrophic forgetting can still occur if levels are poorly coordinated or if new data are extremely adversarial.

Required resources:

Compute with support for chunked/parallel sequence training and selective level updates
Optimizer libraries that allow multi-momentum and custom preconditioning
Datasets organized into episodes/streams to exercise different time scales

When not to use:

Tiny datasets or one-off tasks where simple fine-tuning suffices
Hard real-time systems with ultra-tight latency budgets if multi-level updates can’t be amortized
Scenarios where strict immutability is a feature (e.g., legally frozen models)

Open questions:

Best ways to choose and adapt frequencies: Can levels auto-tune their update schedules based on data drift?
Theory of interference: How do we formally measure and minimize cross-level interference for both activations and gradients?
Memory capacity vs. cost: What is the Pareto frontier between number of levels, performance gains, and compute?
Task-aware optimizers: How do we co-design architecture and optimizer for specific domains (e.g., code, math, speech) under NL?
Consolidation across sleep-like phases: Can offline consolidation (e.g., replay with CMS) safely improve long-term retention without external buffers?

Honest assessment: NL offers a powerful lens and practical prototypes (DGD, CMS, M3, Hope), but it shifts complexity from a single monolithic learner to a family of learners. The promise is big—continuous, resilient learning—but it will take engineering discipline and more large-scale studies to make it routine.

06Conclusion & Future Work

Three-sentence summary: This paper reframes a model and its optimizer as a nested family of learners, each with its own context and update speed. By treating optimizers as associative memories and replacing a single “short/long” memory split with a Continuum Memory System, the model adapts quickly while preserving older knowledge. Prototypes like DGD, M3, and Hope show promising gains in continual, long-context, and few-shot tasks.

Main achievement: Unifying architecture and optimization under Nested Learning, then turning that lens into practical designs—especially CMS—that reduce forgetting and improve adaptation.

Future directions:

Auto-tuning level frequencies and retention to match data drift
Scalable, efficient orthogonalization and deep momentum memories
Stronger theory of cross-level interference and consolidation
Domain-specific co-design of architectures and optimizers under NL

Why remember this: It suggests a simple, deep idea—treat everything as a learner at its own time scale—and shows how that single shift can unlock continuous improvement, smarter memory, and models that grow with us instead of freezing in place.

Practical Applications

•Personal AI tutors that track progress daily (fast), unit-by-unit (medium), and over semesters (slow) without forgetting.
•Customer support agents that learn new product updates quickly while preserving older troubleshooting knowledge.
•On-device assistants that adapt to a user’s style across days and months without full retraining.
•Robotics systems that refine routines across shifts while retaining safety-critical procedures.
•Healthcare triage models that adjust to changing clinic patterns while preserving rare-case knowledge.
•Code assistants that keep up with project conventions over sprints while remembering legacy practices.
•Document QA systems that adapt to new corpora but still retrieve older, relevant facts.
•Recommendation engines that blend fresh trends with long-standing user preferences.
•Language models that handle very long documents, recalling early details even after many pages.
•Research agents that learn from iterative experiments while retaining foundational methods.

Version: 1