Progressive Residual Warmup for Language Model Pretraining

Tianhao Chen; Xin Xu; Lu Yin; Hao Chen; Yang Wang; Shizhe Diao; Can Yang

Progressive Residual Warmup for Language Model Pretraining

Intermediate

Tianhao Chen, Xin Xu, Lu Yin et al.3/5/2026

arXiv

Key Summary

•Training big Transformers can wobble at the start because every layer tries to learn all at once.
•This paper proposes Progressive Residual Warmup (ProRes), which lets early layers learn first and deeper layers join later.
•ProRes multiplies each layer’s residual output by a small number that starts at 0 and slowly grows to 1, with deeper layers warming up more slowly.
•Across many model sizes (130M → 7B) and setups, ProRes reduces perplexity and improves zero-shot accuracy.
•It especially helps hard-to-train variants like Post-LN and very deep models, making training stabler with fewer loss/gradient spikes.
•ProRes changes the optimization path so activations grow gently and layers settle in a smoother order.
•A simple linear schedule for the warmup works well by default; "reverse" or "equal" schedules can hurt or even diverge.
•On out-of-distribution tests like LAMBADA and WikiText, ProRes gives even bigger gains than on the training corpus.
•The method is tiny to implement (a scalar per layer over time) and plays nicely with existing inits and norms.
•Bottom line: making learning time-aware and depth-aware speeds convergence, improves generalization, and stabilizes pretraining.

Why This Research Matters

Training-phase-aware methods like ProRes save compute by preventing early instability that wastes steps. They also unlock deeper models that previously underperformed or diverged, enabling richer reasoning and longer-context understanding. Because ProRes is tiny to implement, existing pipelines can adopt it quickly with little engineering effort. The approach improves not just training loss but also generalization on hard, unseen datasets, which matters for real-world reliability. It offers a clean lens on optimization: ordering who learns when can be as impactful as model size or data scale. Finally, this idea can inspire similar time-aware strategies in other domains, from multimodal Transformers to reinforcement learning fine-tuning.

Reading Workflow

Turn this paper into a decision

Scan fast. Promote only the papers that survive triage.

No workflow history yet.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Top Bread (Hook) You know how when building a tower of blocks, you press the lower blocks firmly before stacking higher ones? If you rush to add the top blocks while the bottom wiggles, the whole tower can wobble.

🥬 Filling (The Actual Concept)

What it is: Training giant language models made of Transformer layers is like stacking many blocks; the bottom (early) layers support the top (deep) layers.
How it works: In classic training, every layer changes the representation at the same time; this can cause chaos early on when nothing is stable yet.
Why it matters: Without a careful order, training can be unstable, slow to converge, and deeper layers may learn poorly.

🍞 Bottom Bread (Anchor) Imagine 120 stacked layers all pushing the representation in different directions on step 1. That’s a recipe for wobbly learning.

🍞 Top Bread (Hook) Imagine a school hallway. If every class starts switching rooms at once, the hall clogs. But if grades go in waves—6th first, then 7th—everything flows.

🥬 Filling (The Actual Concept)

What it is: The Transformer architecture is a stack of layers that pass messages forward and backward during training.
How it works (Transformer Architecture):
1. Tokens go in.
2. Each layer refines the token’s hidden state via attention and a feed-forward network.
3. A residual path adds the layer’s change back to the original signal so information isn’t lost.
Why it matters: This stack lets models learn deep, rich patterns, but only if updates stay stable.

🍞 Bottom Bread (Anchor) Like passing a note up the bleachers: each row adds a bit, but you still keep the original note so nothing important disappears.

🍞 Top Bread (Hook) You know how you write a draft in pencil before tracing over it in pen? The light pencil marks are like a gentle change before committing heavily.

🥬 Filling (The Actual Concept)

What it is (Residual Connection): A shortcut that adds a layer’s proposed change to the current representation instead of replacing it.
How it works: Input goes in; the layer proposes an update; the shortcut adds input + update.
Why it matters: Without residuals, deep networks can forget or blow up signals; residuals keep learning steady and reversible.

🍞 Bottom Bread (Anchor) It’s like adjusting a photo with a small slider and keeping the original nearby.

🍞 Top Bread (Hook) If every student whispers, the classroom stays calm; if they all shout, chaos! Teachers often normalize volume so learning can happen.

🥬 Filling (The Actual Concept)

What it is (Normalization): A way to keep signals at similar scales so layers learn at similar speeds.
How it works: Measure the size of the signal and scale it to a standard range before or after residuals.
Why it matters: Without normalization, some layers shout, others whisper, and training gets unstable.

🍞 Bottom Bread (Anchor) It’s like setting a classroom “inside voice” so everyone can be heard.

🍞 Top Bread (Hook) When you learn piano, you don’t start with concert pieces. You begin with scales, then songs, then performances.

🥬 Filling (The Actual Concept)

What it is (Layer-wise Learning): Letting different parts of a model learn at different paces.
How it works: Early layers often settle faster; deeper layers refine later.
Why it matters: If you force everyone to go fast at once, you trip over yourself.

🍞 Bottom Bread (Anchor) Piano first: left hand basics, then right hand, then together.

🍞 Top Bread (Hook) Think of a warmup before running sprints. Jog first, then accelerate.

🥬 Filling (The Actual Concept)

What it is (Warmup Strategy): Start with smaller updates, then gradually increase.
How it works: For some steps, keep learning gentle before allowing full-strength changes.
Why it matters: Avoids early shock to the system that can cause instability.

🍞 Bottom Bread (Anchor) Just like muscles need a warmup, so do layers.

The World Before:

Transformers became the backbone of large language models, using residual connections and normalization to enable very deep stacks.
But early training is noisy: gradients spike, activations can grow too fast, and deeper layers may act on shaky inputs.

The Problem:

All layers update from step 1, so deeper layers can inject noise before early layers stabilize.
This can cause slow or unstable convergence, especially in very deep or Post-LN variants.

Failed Attempts:

Better inits and norms (Pre-LN, DeepNorm, LayerNorm Scaling) help at step 0, but they stay fixed while training dynamics change.
Static tricks bound updates at initialization but can be too conservative later, limiting learning capacity.

The Gap:

We needed a training-phase-aware method that coordinates who learns when: early layers first; deeper layers later.

Real Stakes:

Faster, stabler training saves compute; better depth scaling unlocks stronger models; smoother learning improves generalization—even on new data like LAMBADA and WikiText.

02Core Idea

🍞 Top Bread (Hook) Imagine building a skyscraper. Foundation first, then floors, then the rooftop garden. You wouldn’t pour the 50th floor before the 1st cures.

🥬 Filling (The Actual Concept)

What it is (Progressive Residual Warmup, ProRes): A simple rule that starts each layer’s residual at zero and gradually increases it to one, with deeper layers taking longer.
How it works:
1. Give each layer l a scale α(l, t).
2. Start α(l, t) = 0 so the network behaves like identity at step 0.
3. Warm α(l, t) up toward 1 over time; shallow layers reach 1 sooner, deep layers later.
4. Once warmed, each layer contributes fully.
Why it matters: Early chaos is reduced, shallow layers settle first, deep layers refine later—leading to faster, stabler, better learning.

🍞 Bottom Bread (Anchor) It’s like dimmer switches per floor: turn on the lobby lights first, then brighten higher floors as the building stabilizes.

The "Aha!" in one sentence:

Let early layers learn first by gradually turning up each layer’s residual signal in depth order, so deeper layers only fully engage once their inputs are stable.

Three analogies:

Orchestra: Strings set tempo softly (shallow layers), then brass joins (deep layers) once the melody is clear.
Cooking: Sauté onions (base flavor) before adding spices (refinements) so nothing burns or clashes.
Sports: Warm up with drills (foundation), then run full plays (advanced coordination).

Before vs After:

Before: All layers talk loudly from the start; deeper layers act on wobbly inputs.
After: A calm roll-in where early layers stabilize the language foundation; deeper layers polish reasoning and long-range patterns.

Why it works (intuition, no equations):

Identity at initialization means zero surprise: the model starts as a pass-through, preventing early explosions.
Bounded updates over time and depth tame the warmup chaos but relax later so capacity is not capped.
Respecting the stack’s dependency ensures deeper layers build upon increasingly clean, meaningful representations.

Building blocks:

Residual scaling per layer over time.
A schedule α(l, t) that increases from 0 to 1, slower for deeper layers.
Plug-in simplicity: works with Pre-LN, Post-LN, Sandwich-LN, DeepNorm, LNS; pairs with various inits.
Default linear schedule is strong; alternatives exist (e.g., linear-square, stagewise-L) and can be tuned per architecture.

🍞 Bottom Bread (Anchor) Think of a class play: narrators (early layers) enter first, then supporting cast (mid layers), then the dancers (deep layers). Timing makes the story work.

03Methodology

High-level pipeline: Input → Token embeddings → Stacked Transformer layers with residual warmup scales → Output logits.

Core equations with gentle, kid-friendly explanations and concrete numbers:

🍞 Top Bread (Hook) Imagine adding a small tweak to a picture, then adding that tweak to the original so you don’t lose details.

🥬 Filling (The Actual Concept)

What it is: The standard Pre-LN residual update.
How it works (formula): $x_{l+1} = x_l + F(Norm(x_l))$ . Example: If $x_l = 2.0$ and the layer proposes $F(Norm(x_l)) = 0.3$ , then $x_{l+1} = 2.0 + 0.3 = 2.3$ .
Why it matters: Adds refinements without erasing the base signal.

🍞 Bottom Bread (Anchor) Like adding a 0.3 brightness increase to an image while keeping the original intact.

🍞 Top Bread (Hook) Now put a dimmer on that tweak, starting near 0 and sliding up to full brightness.

🥬 Filling (The Actual Concept)

What it is: ProRes scales the residual by a time-and-depth-dependent number.
How it works (formula): $x_{l+1} = x_l + \alpha(l, t) \cdot F(Norm(x_l))$ . Example: If $x_l = 2.0$ , $F(Norm(x_l)) = 0.3$ , and $\alpha(l, t)=0.5$ , then $x_{l+1} = 2.0 + 0.5 \times 0.3 = 2.15$ .
Why it matters: Early on, changes are gentle; later, they reach full strength.

🍞 Bottom Bread (Anchor) Like slowly turning up a music track from quiet to normal volume.

🍞 Top Bread (Hook) Think of a relay race: first runners go early; later runners wait their turn.

🥬 Filling (The Actual Concept)

What it is: A default linear warmup schedule for the per-layer scale.
How it works (formula): $\alpha(l, t) = \min\!\Big(\frac{t}{T \times l},\; 1\Big)$ . Example: If $T=1000$ , layer $l=2$ , and step $t=1500$ , then $\alpha(2,1500)=\min(1500/(1000\times 2),1)=\min(0.75,1)=0.75$ .
Why it matters: Shallow layers reach 1 sooner; deeper layers take longer, creating a smooth, ordered rollout.

🍞 Bottom Bread (Anchor) Like opening gates along a path one after another.

🍞 Top Bread (Hook) Sometimes you normalize the tweak itself so it doesn’t overdo it.

🥬 Filling (The Actual Concept)

What it is: Sandwich-LN with ProRes scales the normalized residual.
How it works (formula): $x_{l+1} = x_l + \alpha(l, t) \cdot Norm(F(Norm(x_l)))$ . Example: If $Norm(F(Norm(x_l)))=0.4$ and $\alpha(l,t)=0.25$ , then $x_{l+1} = x_l + 0.25 \times 0.4 = x_l + 0.1$ .
Why it matters: Keeps residual changes well-behaved while still warming them up.

🍞 Bottom Bread (Anchor) It’s like adding pre-measured spice—then turning up how much of it you mix in over time.

🍞 Top Bread (Hook) For some methods you shrink deeper layers’ tweaks simply because they’re deep.

🥬 Filling (The Actual Concept)

What it is: LayerNorm Scaling (LNS) shrinks each layer update by $1/\sqrt{l}$ ; ProRes can still gate it over time.
How it works (formula): $x_{l+1} = x_l + \alpha(l, t) \cdot F(Norm(x_l)) / \sqrt{l}$ . Example: If $l=4$ , then $1/\sqrt{l}=1/2=0.5$ . If $F(Norm(x_l))=0.6$ and $\alpha(l,t)=0.5$ , the added change is $0.5 \times 0.6 \times 0.5 = 0.15$ .
Why it matters: LNS tames depth growth at init; ProRes adds time-awareness so learning isn’t capped later.

🍞 Bottom Bread (Anchor) Like using a smaller spoon for higher shelves but still turning up how much you pour over time.

Step-by-step recipe to train with ProRes (Pre-LN example):

Choose a schedule: start with linear warmup and pick T (e.g., T=1000).
For each training step t and layer l, compute α(l, t) using $\alpha(l, t) = \min(\frac{t}{T \times l}, 1)$ . Example: With $T=1000$ , $l=3$ , $t=1800$ , $\alpha=\min(1800/3000,1)=0.6$ .
Forward pass each layer with $x_{l+1} = x_l + \alpha(l, t) \cdot F(Norm(x_l))$ . Example: If $F(Norm(x_l))=0.25$ and $\alpha=0.6$ , add $0.15$ .
Backprop as usual; α(l, t) is a fixed scalar (no parameters to learn).
Keep your usual optimizer, LR schedule, and clipping settings.

The secret sauce:

Time-aware, depth-aware residual gating that starts at identity (no surprises), bounds early updates, then frees capacity later.
Minimal code change; works with many norms/inits; improves both stability and final quality.

04Experiments & Results

🍞 Top Bread (Hook) Think of a science fair: you don’t just say “it’s better”—you show scores, races against others, and how it behaves under stress.

🥬 Filling (The Actual Concept)

What it is: The authors trained many models (130M → 7B params) across variants (Pre-LN, Post-LN, Sandwich-LN, DeepNorm, LNS, DS-Init, Scaled Init) on datasets like C4-en, with tests on WikiText and LAMBADA, plus reasoning benchmarks.
How it works: They compared perplexity (lower is better) and zero-shot accuracy (higher is better), and watched for loss/gradient spikes.
Why it matters: Real improvements mean faster convergence, better generalization, and stability at scale.

🍞 Bottom Bread (Anchor) It’s like testing a car: lap times (speed), fuel use (efficiency), and how steady it is on bumpy roads (stability).

The test:

Pretraining on C4-en (50B tokens, seq len 1024) with AdamW and Warmup–Stable–Decay LR.
Benchmarks: PIQA, SIQA, HellaSwag, WinoGrande, ARC-Easy/Challenge, OBQA, RACE, LAMBADA, MMLU; plus perplexity on WikiText/LAMBADA.

The competition:

Baselines: Pre-LN, Post-LN, Sandwich-LN, DeepNorm, LayerNorm Scaling, DS-Init, Scaled Init.
ProRes applied to each, mostly using a simple linear schedule with T=1000.

Scoreboard highlights with context:

Pretraining perplexity on C4-en (1.3B Pre-LN): 10.32 → 9.86 with ProRes (a 0.46 drop). That’s like shaving meaningful seconds off a marathon time when others are already elite.
Consistent gains across sizes (130M, 350M, 1.3B) and across norms/inits; Post-LN benefits the most (it’s the wobbliest baseline, so ProRes stabilizes it greatly).
Zero-shot reasoning (1.3B averages): +1.27% over baselines—think moving from a solid B to a B+ across a tough test suite. Biggest jumps on HellaSwag, ARC-Easy, OBQA, and LAMBADA.
Out-of-distribution: On LAMBADA perplexity, average $reduction ≈ 4$ .86—much larger than on the training corpus—showing stronger generalization.

Surprising findings:

The linear schedule is robust; but some schedules can backfire: “equal” (all layers warm together) can hurt or diverge; “reverse” (deep first) is especially bad.
For Post-LN, “linear-square” or “stagewise-L” can beat plain linear—gentler introductions help.
LayerNorm Scaling (LNS) + ProRes gives smaller gains at large scale: static down-weighting by $1/\sqrt{l}$ plus time gating can over-dampen deep layers.

Depth scaling and stability:

As layers increase (12 → 120), Pre-LN with ProRes keeps improving and eventually leads the pack; LNS keeps up early but falls behind at great depth.
Spike scores (loss/grad $spikes ≥ seven$ std devs) stay near zero with ProRes even as depth grows—like a smooth heart rate under heavy exercise.

Overall take:

ProRes gives a better optimization path: early stability, smoother activation growth, and cleaner layer-wise evolution, leading to faster convergence and stronger generalization.

05Discussion & Limitations

🍞 Top Bread (Hook) Even great tools have best-use cases, required gear, and spots where they’re not ideal.

🥬 Filling (The Actual Concept)

Limitations:
1. Schedule choice matters: “equal” or “reverse” can harm or diverge; overly long warmups may underuse deeper layers within a fixed budget.
2. Architecture dependence: Best schedule varies (e.g., Post-LN likes gentler curves like linear-square).
3. Interaction with static scalings (e.g., LNS) can over-dampen deep layers at large scale.
4. Tested mainly on decoder-only LLMs with standard training; other modalities (vision, audio) and training recipes need further validation.
Required resources:
- Same compute class as baseline training (the study used $8×H800$ ); ProRes adds negligible overhead—just per-layer scalars over time.
- Usual data/optimizer setups; no special memory tricks.
When not to use:
- Very shallow models (few layers) that already train stably.
- Ultra-short training runs where deep layers won’t have time to warm up.
- Pipelines already using strong, dynamic per-layer gating that achieves similar effects.
Open questions:
1. Can the schedule be learned automatically (e.g., meta-learned α(l, t) or adaptive controllers)?
2. How does ProRes interact with Mixture-of-Experts, curriculum learning, or RLHF stages?
3. Can signals like spike score or activation growth guide on-the-fly schedule adjustment?
4. Does ProRes improve multimodal Transformers and very long-context models?

🍞 Bottom Bread (Anchor) Think of ProRes as traffic lights turned on in sequence across a growing city grid. It works wonderfully when streets (layers) are many and rush hour (warmup) is chaotic, but you still need the right timing plan for each neighborhood (architecture).

06Conclusion & Future Work

Three-sentence summary:

ProRes is a tiny change—multiply each layer’s residual by a scale that starts at 0 and warms to 1, with deeper layers warming more slowly.
This time-aware, depth-aware gating calms early chaos, respects the stack’s dependency, and frees full capacity later, yielding faster convergence, better generalization, and greater stability across many architectures and sizes.
Experiments show consistent perplexity drops, accuracy gains, fewer spikes, and stronger depth scaling up to 120 layers and 7B parameters.

Main achievement:

Turning residual learning into a staged process—identity at init, progressive activation across depth—demonstrating that who learns when is as crucial as how much you scale at init.

Future directions:

Auto-tuning or learning the schedule, architecture-aware schedules (e.g., Post-LN-specialized), extension to MoE/multimodal models, and feedback-driven adjustments using stability signals.

Why remember this:

A single scalar per layer over time can reshape optimization: small, principled timing beats one-size-fits-all static scaling. ProRes shows that scheduling learning across depth is a powerful, practical lever for making big models train better.

Practical Applications

•Stabilize pretraining of deep LLMs (e.g., 48–120 layers) by enabling an early-layers-first learning order.
•Rescue wobbly Post-LN runs by using gentler ProRes schedules (e.g., linear-square or stagewise-L).
•Improve generalization on OOD text (e.g., LAMBADA) with minimal code changes.
•Speed up convergence in fixed-budget pretraining by reducing early loss/gradient spikes.
•Combine with existing inits (e.g., DS-Init, Scaled Init) and norms (RMSNorm, Sandwich-LN) for additive gains.
•Scale to larger models (e.g., 7B) confidently by managing early training chaos.
•Use spike score monitoring to validate stability and, in future, adapt schedules on the fly.
•Deploy ProRes as a default training knob: start with linear T=1000, then tune per architecture if needed.
•Pair with Warmup–Stable–Decay LR for smoother optimization phases.
•Apply to curriculum/data-mixture schedules to align layer warmup with data difficulty over time.

Version: 1

Key Summary

Why This Research Matters

Turn this paper into a decision

Detailed Explanation

01Background & Problem Definition

02Core Idea

03Methodology

04Experiments & Results

05Discussion & Limitations

06Conclusion & Future Work

Practical Applications

Notes