Late-to-Early Training: LET LLMs Learn Earlier, So Faster and Better

Ji Zhao; Yufei Gu; Shitong Shao; Xun Zhou; Liang Xiang; Zeke Xie

Late-to-Early Training: LET LLMs Learn Earlier, So Faster and Better

Intermediate

Ji Zhao, Yufei Gu, Shitong Shao et al.2/5/2026

arXiv PDF

Key Summary

•Big idea: use a small, already-trained model to help a bigger model learn good habits early, so the big one trains faster and ends up smarter.
•LET guides the big model’s early layers using the small model’s late-layer features, then gently turns that guidance off as training continues.
•Across 1.4B–7B models, LET cuts time-to-quality by up to 1.6× while improving downstream accuracy by about 5% over standard training.
•It works even when the helper model is 10× smaller than the target model, unlike classic distillation that usually needs a larger teacher.
•Aligning the small model’s last layer to the big model’s early layer (L2E) is the most stable and best-performing strategy in ablations.
•LET consistently lowers language-modeling perplexity across vocabularies (OPT, Pythia, SmolLM), showing robust modeling gains.
•A simple schedule (weight λ that decays to zero by step S_stop) keeps guidance helpful early and out of the way later; λ≈0.1 works best.
•LET slightly reduces throughput early on, but the faster convergence more than compensates, yielding shorter overall time to strong performance.
•Quality of the helper matters: very old or low-quality helpers (e.g., GPT-2 era) can weaken gains, but LET stays more robust than RKD.
•LET is architecture-agnostic and extends beyond text (e.g., time series), making it a practical way to reuse community-trained models.

Why This Research Matters

Training giant models is expensive and slow, which limits who can build them and how quickly the field can improve. LET lets teams reuse the community’s many small, pretrained models as “training wheels,” delivering faster learning and better final results without architectural surgery. That means more progress per dollar and per kilowatt-hour, making AI research greener and more accessible. It also shortens iteration cycles, so safety fixes, bias reductions, and capability upgrades can land sooner. Beyond text, the idea generalizes to time series and potentially other domains, multiplying the value of existing open-source models. In short, LET helps everyone go faster together while spending less.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine you’re learning basketball. A younger player shows you their favorite finishing moves (the end results), so in your first practices you copy those moves during warmups. You get the feel much faster, and later your coach lets you freestyle.

🥬 The Concept: Language Modeling

What it is: Teaching a computer to predict the next word in a sentence so it can read and write well.
How it works: (1) Feed the model a sentence up to a certain word; (2) It guesses the next word; (3) Compare its guess to the real word; (4) Nudge its brain (weights) to do better next time; (5) Repeat millions of times.
Why it matters: Without this core training, the model can’t understand patterns in language, making all later skills weak.

🍞 Anchor: For the prompt “Paris is the capital of ____,” a good language model learns to say “France.”

The World Before: Training large language models (LLMs) meant lots of data, lots of GPUs, lots of days. The recipe was: scale the model and the dataset, train with next-word prediction, and wait. This worked, but it was expensive and slow. Meanwhile, the community built many smaller open-source models that already understand a lot—yet we mostly ignored them when starting new, larger models from scratch.

🍞 Hook: You know how a 5th grader can show a 2nd grader neat tricks to make math faster? The 5th grader isn’t a professor—but their shortcuts still help.

🥬 The Concept: Knowledge Distillation (KD)

What it is: A way to train one model by having it imitate another model’s outputs.
How it works: (1) Run a teacher model on data; (2) Get the teacher’s predictions; (3) Train the student to match these predictions; (4) Keep going until the student behaves like the teacher.
Why it matters: KD can compress big models into smaller ones. But if the teacher is huge, it’s costly; if the teacher is small, the student can end up worse.

🍞 Anchor: A small student model learns to pick “France” because a big teacher said that “France” has the highest probability.

The Problem: Can we make a big model learn faster without paying for a giant teacher? Could we reuse the many small, pretrained models already available to kickstart training of much larger models?

Failed Attempts: Reverse-KD (small teacher → big student) often underperforms when the teacher is much smaller. Other methods grow models over time but require architecture surgery and careful schedules.

The Gap: We lacked a simple, architecture-agnostic way to reuse a small model’s know-how that doesn’t shackle the big model later or require a massive teacher.

Real Stakes: Faster pretraining saves money and energy, speeds up research cycles, lowers the barrier for labs and startups, and can make improvements accessible beyond the biggest companies.

🍞 Hook: Think of learning a song. If someone shows you the final melody (the “later” knowledge), your fingers learn useful patterns early, so you practice smarter.

🥬 The Concept: Late-to-Early Training (LET)

What it is: A new training style where a big model’s early layers copy helpful signals from a small model’s late layer at the start of training.
How it works: (1) Run the small, pretrained “helper” model on the text and take its last hidden features; (2) Nudge the big model’s early-layer features to look similar; (3) Slowly fade out this nudging over steps; (4) Let the big model take over and keep learning from data.
Why it matters: Without LET, the big model wastes time relearning patterns the small model already figured out. LET gives it a head start.

🍞 Anchor: Early in training, the big model’s first few layers learn to shape “Paris … capital … France” similarly to the small model’s mature features, so it gets smart faster.

02Core Idea

🍞 Hook: You know how teachers sometimes give you the answer key for a few problems so you see what a correct solution looks like before you practice on your own?

🥬 The Concept: Late-to-Early Training (LET)

What it is (aha! moment): Use a small model’s mature features (late layer) to guide a big model’s early layers in the early training steps, then gradually remove that guidance.
How it works: (1) Take the helper model’s final-layer representation; (2) Align the big model’s early-layer representation to it; (3) Use an extra term in the loss that rewards similarity; (4) Linearly reduce this extra weight to zero by a chosen step; (5) Continue standard training.
Why it matters: It seeds the big model with proven “feature shapes” quickly, improving speed and final quality, even when the helper is 10× smaller.

🍞 Anchor: It’s like training wheels: copy balance early, ride solo later.

Multiple Analogies:

Hiking trail: The helper model leaves footprints (late features). The big model’s early layers follow them at the start, then veer to better paths as it becomes confident.
Coloring book: The helper provides outlines; the big model fills in and later draws without outlines.
Lego building: Start with a small set’s finished subassembly (late features) to snap the first pieces of a bigger set into the right shape quickly.

Before vs After:

Before: Big models learned everything from scratch; small teachers helped only if they were large and costly.
After: Even a much smaller helper can boost both speed (up to 1.6×) and quality (~5% accuracy gain) by donating mature patterns to early layers early on.

Why It Works (intuition, no equations):

Early layers create the base shapes of understanding. If you aim them toward a helpful shape at the beginning, later layers can refine instead of fixing mistakes.
Using the helper’s final layer is like getting the most condensed summary of what matters; mapping it to early layers leaves plenty of room above to adapt and surpass the helper.
Fading out the extra guidance prevents the big model from overfitting to the helper’s limits.

Building Blocks (the smaller pieces):

Data flow: Text → helper model’s last features + big model’s early features → compare similarity → add to normal training loss → update big model.
Alignment: If feature sizes differ, resize the big model’s early feature to match; then normalize and compare via a similarity score.
Schedule: A weight (call it λ) starts reasonably small (≈0.1 works best in tests) and drops to zero by step S_stop so guidance doesn’t linger.
Choice of layers: Best is helper last layer to big model early layer (L2E). Other pairings are less stable.

🍞 Anchor: After a few thousand steps, the training wheels are off—yet the big model now rides faster and farther than if it had started bare-handed.

03Methodology

At a high level: Input text → run both models → compare helper’s last-layer features to big model’s early-layer features → combine this with the usual next-word loss → update big model → gradually turn off the extra comparison.

Step-by-step (like a recipe):

Prepare the ingredients

What happens: Pick a helper model (small, already trained), a target big model, a dataset (e.g., The Pile), and two knobs: λ (initial guidance strength) and S_stop (when guidance fully fades out).
Why it exists: Without these, the big model has no “training wheels” and you can’t control how long to keep them.
Example: Helper = SmolLM-135M; Big = 1.4B model; λ = 0.1; S_stop = 1500 steps.

Forward pass through both models

What happens: Feed the same batch of text to both. Save the helper’s final-layer features and the big model’s early-layer features (e.g., layer 3).
Why it exists: We need two feature snapshots to compare shapes and nudge the big model’s early layers toward the helper’s mature patterns.
Example: Input “The Eiffel Tower is in ____.” Helper produces a last-layer vector per token; big model’s layer 3 produces its own vector per token.

Make features comparable

What happens: If the two feature sizes differ, resize the big model’s early-layer feature to the helper’s size (a simple 1D resampling along the feature dimension). Then normalize both so only direction, not magnitude, matters.
Why it exists: Without compatible sizes and normalization, comparison would be biased or undefined.
Example: If helper has 768 dims and big model has 1024 dims at layer 3, we interpolate the 1024-d vector down to 768-d before comparing.

Compute two losses and add them

What happens: (a) Usual language modeling loss (predict the next token correctly). (b) Alignment loss that rewards similarity between the two features (e.g., negative cosine similarity). Total loss = usual loss + λ × alignment loss, with λ shrinking linearly to 0 by S_stop.
Why it exists: The usual loss teaches language; the alignment loss shapes early features to be useful sooner. If you drop (b), you lose speed; if you keep it forever, the big model might inherit the helper’s limits.
Example: Early steps: λ=0.1, so the alignment matters. After S_stop, λ=0, so training is back to standard language modeling.

Update only the big model

What happens: Backpropagate the total loss and update the big model’s weights. The helper model stays frozen.
Why it exists: We want to accelerate and improve the big model; the helper is just a guide.
Example: GPU updates the big model’s parameters while the helper is used only for forward passes early on.

Fade out the guidance

What happens: Each step, shrink λ toward zero so the big model gradually flies solo.
Why it exists: Without fading, the big model might copy the helper too much and plateau at the helper’s ceiling.
Example: By step 1500, λ≈0.0; alignment loss disappears and we continue with standard training.

The Secret Sauce (what’s clever):

Late-to-early layer alignment (L2E): Use the helper’s last-layer features (most mature) to shape the big model’s early layers (most foundational). This leaves later layers free to adapt and surpass the helper.
Late-to-early step scheduling: Only use guidance early and then stop. This gives a head start without handcuffs.
Architecture-agnostic: Since we match features, not logits or specific layers by design, helper and big model can come from different families.

Important parameters and choices:

Which layers to align: Helper last layer → big model early layer works best in ablations; middle-layer alignments are weaker.
λ (guidance weight): Too big (≥1.0) makes the big model cling to the helper and hurts learning; too small (0.01) helps but less; around 0.1 is a sweet spot.
S_stop (when to stop): Longer guidance helps early progress but can limit late performance if kept too long; pick a step that balances kickstart vs autonomy.

Mini example with data:

Sentence: “Whales are the largest animals on ____.”
Step 1–1000: Helper’s final features and big model’s early features align with λ=0.1; the big model quickly learns useful shapes connected to ocean/sea context.
Step 1000–1500: λ decays; the big model relies more on its own gradients from next-word prediction.
After 1500: Pure language modeling; the big model can now surpass the helper’s abilities.

04Experiments & Results

The Test: The team measured two things: (1) Downstream tasks (9 benchmarks like ARC, PIQA, BoolQ, etc., in a one-shot setting) to see real-world skill, and (2) Perplexity on The Pile (how surprised the model is by test text—lower is better) to check modeling quality.

The Competition: LET was compared to standard training (Baseline), SALT, and Reverse Knowledge Distillation (RKD). Models tested included 1.4B and 7B targets, with helpers around 125–160M (for 1.4B) and up to 1.7B (for 7B), from families like OPT, Pythia, and SmolLM.

The Scoreboard (with context):

Speed and quality: For a 1.4B model on The Pile, LET reached better downstream performance in under two-thirds of the steps—about a 1.6× speedup—while ending around 5% higher average accuracy than Baseline. That’s like finishing your homework in 40 minutes instead of 60, and still scoring higher.
Robust modeling: Across vocabularies (OPT, Pythia, SmolLM), LET consistently lowered test perplexity throughout training—showing the gains aren’t vocabulary tricks.
Small-but-mighty helpers: Even when the helper was 10× smaller than the target, LET still beat Baseline; RKD often fell behind Baseline in the same setting.
Layer-pairing ablations: Aligning helper last layer → big model early layer (L2E) was the clear winner in stability and final performance. Other pairings tended to wobble after alignment ended or simply underperformed.
λ sweeps: λ≈0.1 worked best. Larger λ made the big model over-copy the helper; smaller λ gave weaker (but still positive) gains.
Stronger than a bigger baseline: A LET-1.4B model outperformed a Baseline-3B model trained in the same regime, showing how much “smarter” the learning became.

Surprising/Notable findings:

Architecture-agnostic: Helpers from different model families and attention variants still provided benefits; SmolLM as helper often performed best among similar-sized helpers.
Tokenizer mismatch: Even with different tokenizers (helper vs target), LET stayed effective, though exact matches are slightly cleaner.
Limits of old helpers: Using older, weaker helpers like GPT-2 (trained on earlier data) reduced gains and could underperform Baseline—yet LET still surpassed RKD there.

Training efficiency notes:

Throughput: LET’s early-stage throughput is slightly lower than Baseline (due to running a helper forward pass), but the faster convergence yields a net win in time-to-quality.
Memory: LET uses helper features (not logits), which trims peak memory vs some alternatives; and because the helper is used only early, the added cost shrinks over total training time.

05Discussion & Limitations

Limitations (be specific):

Early overhead: LET runs a helper forward pass early on, slightly reducing throughput compared to Baseline in that phase.
Scale not yet maxed: Experiments top out at 7B targets and ~20B tokens; behavior at 70B+ and trillion-token scales needs validation.
Helper quality matters: Very old or low-quality helpers (e.g., GPT-2-era) can weaken or undo gains.
Hyperparameter sensitivity: λ and S_stop matter; wrong choices can cause over-copying (too big/too long) or weak help (too small/too short).
Domain mismatch: If the helper was trained on very different data, its features may be less useful.

Required Resources:

One small pretrained helper model checkpoint; compute to run its forward pass during early steps.
Usual LLM pretraining stack (GPUs, optimizer, data loader). No architecture surgery is needed.

When NOT to Use:

If you must maximize raw throughput at the very start and can’t afford any helper pass.
If your only available helper is clearly low-quality or trained on mismatched, outdated data (empirically, this can hurt).
If you need perfect reproducibility with zero extra moving parts (LET adds a schedule and alignment step to manage).

Open Questions:

Scaling laws: How do λ, S_stop, and layer choices change with 70B+ models and massive datasets?
Better alignment signals: Can alternatives to cosine (e.g., log-sum losses) or learned projectors push gains further at negligible cost?
Multi-helper strategies: Could ensembles of small helpers, or curriculum-style swapping, help even more?
Cross-domain: How far can LET go beyond text (e.g., speech, code, multimodal) and with what tweaks?

06Conclusion & Future Work

3-Sentence Summary: LET lets big models borrow late-stage wisdom from small models during the earliest steps and layers, then removes that guidance so the big model can surpass its helper. This simple, architecture-agnostic trick speeds up training (up to 1.6×) and improves accuracy (~5%) even when the helper is 10× smaller. Across tasks and vocabularies, LET lowers perplexity, boosts downstream results, and remains robust when set up well.

Main Achievement: Showing that “late-to-early” alignment—helper last layer to big model early layer, early in training only—provides a dependable, scalable way to accelerate and strengthen LLM pretraining using readily available small models.

Future Directions: Validate at 70B+ scales; refine alignment losses and schedules; explore multiple helpers or curriculum helpers; extend to multimodal and specialized domains; automate λ/S_stop/layer selection.

Why Remember This: LET turns the community’s many small pretrained models into practical training wheels that make huge models learn earlier, faster, and better—saving time, money, and energy while lifting final performance.

Practical Applications

•Pretrain a new 1–7B LLM faster by aligning its early layers to a 100–500M helper’s final features for the first few thousand steps.
•Warm-start domain models (e.g., biomed or legal) by using a small domain helper to guide early features before switching to standard training.
•Reduce compute budgets for startups or labs by achieving target accuracy in fewer tokens/steps.
•Speed up research ablations: reach a comparable validation perplexity faster to test ideas more quickly.
•Improve data efficiency: outperform a larger baseline model at the same token budget by using LET on a smaller target.
•Cross-family bootstrapping: use a helper from a different architecture or attention variant to seed good early representations.
•Tokenizer-flexible pretraining: proceed even when helper and target tokenizers differ, accepting small trade-offs.
•Curriculum helpers: start with one small helper and swap to another slightly stronger helper mid-way (future extension of LET).
•Time-series acceleration: apply LET in non-text domains (e.g., classification with TimesNet as helper) to improve accuracy and speed.
•Low-VRAM setups: use feature alignment (not logits) to keep memory overhead manageable during the early phase.

Version: 1