Improving Recursive Transformers with Mixture of LoRAs

Mohammadmahdi Nouriborji; Morteza Rohanian; Omid Rohanian

Improving Recursive Transformers with Mixture of LoRAs

Intermediate

Mohammadmahdi Nouriborji, Morteza Rohanian, Omid Rohanian12/14/2025

arXiv PDF

Key Summary

•Recursive transformers save memory by reusing the same layer over and over, but that makes them less expressive and hurts accuracy.
•This paper adds Mixture of LoRAs (MoL): tiny, low-rank expert adapters placed inside the shared feed-forward network with a router that picks the best experts per token.
•MoL brings back the lost expressivity of shared layers while keeping the model compact and fast.
•A modernized architecture called ModernALBERT (50M–120M parameters) combines RoPE, GeGLU, and FlashAttention with MoL.
•On GLUE, SQuAD-v2, and BEIR, ModernALBERT matches or beats bigger fully-parameterized baselines, setting a new bar for small models.
•MoL outperforms alternatives like Mixture-of-Adapters and Relaxed Recursive Transformers in controlled comparisons.
•A simple expert-merging step turns the many experts into one adapter at inference, keeping accuracy but cutting latency and memory use.
•Knowledge distillation and careful initialization make training data-efficient, reaching strong results with a modest 30B-token budget.
•Conditional computation inside the shared FFN is the key idea that restores layer-wise diversity without bloating parameters.

Why This Research Matters

Smaller yet smarter language models can run on everyday devices, bringing high-quality understanding to phones, classrooms, and clinics without giant servers. Conditional computation means models only work hard when they must, saving energy and cost—good for the planet and your battery. By merging experts for deployment, we keep accuracy high while cutting latency, making real-time apps (like chat, search, or assistive tools) feel snappy. Better retrieval and reasoning help search engines, customer support, and education tools give precise, trusted answers. Robust performance with modest training budgets democratizes AI research and development. Finally, these ideas point toward efficient, powerful future LLMs and multimodal systems that are more accessible to everyone.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine you have one really good math teacher teaching every grade from 1st to 12th. It saves money because you only pay one teacher, but some lessons will feel too simple for older kids and too hard for younger kids.

🥬 The Concept (Recursive Transformers): What it is: A recursive transformer shares the same layer across many depths, like reusing the same teacher in every classroom. How it works: (1) Build one transformer block (attention + feed-forward). (2) Reuse that block many times to create a deep model. (3) Save tons of parameters because you store one set of weights instead of many. Why it matters: Without something extra, all layers act the same, so the model loses the special roles different layers usually play (like early layers noticing letters, later layers noticing meanings).

🍞 Anchor: ALBERT reused the same block and got small, strong models, but sometimes had to make layers wider to keep up, which eats efficiency.

🍞 Hook: You know how some tools are multi-purpose, like a Swiss Army knife? Great for packing light, but each tool can be a bit basic.

🥬 The Concept (Expressivity Trade-off): What it is: Sharing parameters shrinks model size but also shrinks the variety of things each layer can learn. How it works: (1) With sharing, every depth uses the same weights. (2) Layers stop specializing (e.g., syntax vs. semantics). (3) Performance on tricky tasks can drop. Why it matters: If all layers think the same way, the model can miss fine details or domain-specific patterns.

🍞 Anchor: It’s like trying to write essays, do algebra, and paint with only one pencil; you can do it, but it’s not ideal.

🍞 Hook: Think about calling in specialists—like a doctor for bones, another for skin, another for eyes—only when you need them.

🥬 The Concept (Mixture-of-Experts, MoE): What it is: A method where a router picks a few specialized experts to process each token. How it works: (1) A router scores which experts fit a token. (2) Only top experts run (sparse compute). (3) Combine their outputs. Why it matters: You get high capacity without running every expert all the time, but total parameters still get large.

🍞 Anchor: If a sentence mentions “stocks,” a finance expert helps; if it mentions “soccer,” a sports expert helps.

🍞 Hook: Imagine adding little clip-on upgrades to a bike instead of buying a whole new bike.

🥬 The Concept (Low-Rank Adaptation, LoRA): What it is: A tiny, efficient adapter that gently nudges big weight matrices without changing the whole thing. How it works: (1) Freeze the big weights. (2) Add a small low-rank A and B to create a lightweight update. (3) Train only A and B to adapt. Why it matters: Massive savings in parameters and memory, with strong performance.

🍞 Anchor: Like adding a small booster motor to a bike that lets you go uphill without replacing the whole bike.

🍞 Hook: Imagine turning lights on only in the rooms you’re using.

🥬 The Concept (Conditional Computation): What it is: Only running parts of the network when needed. How it works: (1) A router decides which parts are useful. (2) Activate those parts. (3) Skip the rest. Why it matters: Saves time and energy while boosting capacity where it counts.

🍞 Anchor: In a library, you don’t read every book; you check the catalog and pick only the relevant ones.

🍞 Hook: Suppose your phone camera has clever math that keeps the right details and tosses the noise.

🥬 The Concept (Gated GELU, GeGLU): What it is: A gated activation that decides how much information to pass. How it works: (1) Split a signal. (2) One path becomes a gate (how much). (3) Multiply gate × content. Why it matters: Helps gradients flow and features stay expressive in feed-forward networks.

🍞 Anchor: Like a faucet controlling water flow so your sink doesn’t overflow.

🍞 Hook: Imagine labeling seats in a theater so everyone knows where to sit.

🥬 The Concept (Rotary Position Embeddings, RoPE): What it is: A smart way to encode token order into attention. How it works: (1) Rotate query/key vectors with position-aware angles. (2) Preserve relative positions. (3) Help attention compare distances naturally. Why it matters: Makes models better at understanding order and long-range relations.

🍞 Anchor: It’s like adding seat numbers so people find their row and seat easily.

🍞 Hook: Think of an express checkout lane at the supermarket.

🥬 The Concept (FlashAttention): What it is: A fast, memory-aware way to compute attention exactly. How it works: (1) Break attention into tiled chunks. (2) Keep data on-chip to reduce memory traffic. (3) Compute softmax safely and quickly. Why it matters: Speeds up training/inference and reduces memory use.

🍞 Anchor: Like scanning items in batches so the cashier is both fast and accurate.

🍞 Hook: Imagine a big-kid helping a little-kid with homework by explaining answers gently.

🥬 The Concept (Knowledge Distillation): What it is: A smaller student model learns from a larger teacher’s soft predictions. How it works: (1) Teacher produces probabilities. (2) Student tries to match them. (3) Student learns teacher’s dark knowledge (fine distinctions). Why it matters: The student becomes strong with less data and compute.

🍞 Anchor: It’s like getting study tips from a top student so you learn faster.

🍞 Hook: Think of a science fair where you test what happens if you remove one ingredient from slime.

🥬 The Concept (Ablation Studies): What it is: Experiments that remove or change parts to see what really helps. How it works: (1) Turn a feature off. (2) Measure performance change. (3) Repeat to isolate contributions. Why it matters: Separates what’s essential from what’s optional.

🍞 Anchor: Like baking cookies without sugar to learn sugar’s role.

The Problem Before This Work: Recursive (shared) transformers were tiny and efficient but lost layer-wise diversity. People tried static adapters (always on), Mixture-of-Adapters (experts after the FFN), and Relaxed Recursive Transformers (depth-specific LoRA without token routing). These helped, but none made the shared feed-forward network itself conditionally adapt during pretraining. The Gap: Put conditional computation inside the shared FFN to restore expressivity while keeping parameters tiny. The Stakes: Smaller, smarter models mean phones that understand you better, search engines that find exact answers faster, and greener AI that uses less energy—helping education, accessibility, and everyday apps run smoothly.

02Core Idea

🍞 Hook: You know how a chef can use the same base sauce but add different spices depending on the dish? One base, many flavors.

🥬 The Aha! Moment: Inject tiny, specialized LoRA experts directly into the shared feed-forward network and use a router to pick the right ones per token—bringing back layer diversity without blowing up parameters.

Multiple Analogies:

Spice Rack Analogy: The shared FFN is the base sauce; MoL experts are spices. The router is the chef’s taste test that picks paprika for paella and basil for pasta.
Headphones EQ Analogy: The shared FFN is the music track; MoL experts are EQ presets (rock, jazz, podcast). The router picks the preset for each passage.
School Tutor Analogy: The shared FFN is the class teacher; MoL experts are specialized tutors (grammar, logic). The router sends each question to the right tutors.

Before vs After:

Before (Plain Sharing): Every layer acts the same; to match big models you often need wider layers—costly.
After (MoL Inside FFN): The same shared block behaves differently per token because the router selects low-rank updates (experts) that tweak the FFN weights on the fly. You regain layer-like diversity while staying small.

Why It Works (Intuition, No Equations):

A shared FFN is a single, fixed path; adding LoRA experts creates many small alternate paths.
The router gives each token a map to the right mini-paths (top-2), so tokens about math, law, or sports trigger different micro-adjustments.
Because LoRA is low-rank, these adjustments are cheap but expressive enough to carve out specialized behaviors.
Conditional computation means we only pay compute for selected experts per token, so we scale capacity without scaling cost linearly.

Building Blocks (with mini-sandwich intros where needed):

🍞 Hook: Like clipping small gadgets onto a keychain only when needed. 🥬 What it is: LoRA experts are small low-rank matrices that gently shift FFN weights. How it works: Freeze big weights, train tiny A/B matrices, scale and add. Why it matters: Keeps parameter growth tiny but impactful. 🍞 Anchor: A pocket tool that makes the main tool perfect for odd jobs.
🍞 Hook: Choosing the right helper is half the job. 🥬 What it is: A router that scores experts per token. How it works: Compute scores, pick top-2, combine their outputs. Why it matters: Without routing, every token gets the same treatment; with routing, tokens get personalized processing. 🍞 Anchor: Like a librarian pointing you to the exact shelf.
🍞 Hook: Slow and steady isn’t always best; quick lanes help. 🥬 What it is: FlashAttention speeds attention via IO-aware tiling. How it works: Process in chunks, keep data close, avoid memory thrash. Why it matters: Saves memory and time, enabling longer sequences and faster training. 🍞 Anchor: Like an express line at the store.
🍞 Hook: Order matters, like steps in a recipe. 🥬 What it is: RoPE encodes relative positions via rotations. How it works: Imprints position into queries/keys so distances are preserved. Why it matters: Better sensitivity to order and long spans. 🍞 Anchor: Page numbers help you follow the story.
🍞 Hook: Don’t flood the sink—use a valve. 🥬 What it is: GeGLU gates what passes through FFN. How it works: A gate modulates content, improving gradient flow. Why it matters: Stabilizes and enriches representations. 🍞 Anchor: A faucet that stops splashing.
🍞 Hook: Learn from a pro to get good fast. 🥬 What it is: Distillation from ModernBERT. How it works: Student matches teacher’s soft predictions. Why it matters: Strong results with fewer tokens. 🍞 Anchor: A coach sharing winning tricks.

Net Effect: MoL restores the missing diversity caused by parameter sharing, making tiny models act like bigger, smarter ones—without paying a big memory or compute bill.

03Methodology

High-Level Flow: Input tokens → Embeddings with RoPE → Multi-Head Attention (FlashAttention) → Shared FFN or MoL-FFN (with router, top-2 LoRA experts) → Residual/LayerNorm → Output representations

Step-by-Step Details:

Tokenization and Embeddings

What happens: Convert words to token IDs, map to vectors, and add Rotary Position Embeddings (RoPE) so the model knows the order.
Why it exists: Without positions, the model can’t tell “dog bites man” from “man bites dog.”
Example: For tokens [The, dog, runs], RoPE helps attention understand that “The” comes before “dog,” guiding grammatical cues.

Attention with FlashAttention

What happens: Compute attention among tokens using IO-aware tiling to save memory/time.
Why it exists: Standard attention can be slow and memory-heavy; FlashAttention keeps training fast and stable.
Example: In “Paris is the capital of France,” attention can link “Paris” with “capital” and “France” efficiently.

Recursive Parameter Sharing (Groups)

What happens: The model is organized in groups; layers inside a group share the same weights (ALBERT-style). Some groups replace the shared FFN with a MoL-FFN.
Why it exists: Drastically cuts parameters by reusing one set of weights across depth.
Example: A 12-layer model might have 3 groups of 4 layers each; all 4 layers in a group share weights, keeping memory small.

Standard FFN vs. MoL-FFN

What happens: Usually, tokens go through a feed-forward network with GeGLU activation (content × gate). In selected groups, we swap this FFN with a Mixture-of-LoRAs (MoL) FFN.
Why it exists: The plain shared FFN is too uniform; MoL-FFN brings token-conditional diversity back.
Example: If a token is about “finance,” the router will choose experts that are good at numbers or economics tones.

Inside the MoL-FFN (Secret Sauce)

What happens:
- A tiny router looks at each token’s features and gives a score to each expert.
- Only the top-2 experts run (sparse), saving compute.
- Each chosen expert adds a low-rank LoRA update inside the FFN’s down and up projections, then GeGLU applies gating.
- Outputs from the two experts are combined using their normalized scores.
Why it exists: Putting experts inside the FFN (not after it) lets the model change the internal transformation itself, which is more powerful than stacking adapters at the end.
Example with numbers: Suppose there are 8 experts. A token gets scores like [0.40, 0.35, 0.10, ...]. The router picks expert 1 and 2 (0.40 and 0.35), normalizes them to [0.53, 0.47], runs just those two, and blends their outputs ~53%/47%.

Training Setup and Stability Aids

What happens:
- Initialization: Start ModernALBERT from a fully-parameterized ModernBERT to give the shared layers a strong base.
- Knowledge Distillation: Use ModernBERT’s soft targets to teach ModernALBERT better decision boundaries.
- Curriculum: Warm up on RedPajamas-1T for 20–30k steps, then train on RefinedWeb for another 70–80k steps (batch 384, seq len 1024).
- Optimizer: AdamW with linear warmup and decay (peak LR 5e-4 or 5e-5).
Why it exists: Parameter sharing plus routing can be tricky to train; good initialization and distillation smooth training and boost data efficiency.
Example: Think of it as learning to ride with training wheels (teacher model) before biking solo.

Router Pretraining/Fine-tuning Tweaks

What happens: For small datasets (e.g., RTE, CoLA), pretrain the router on MNLI and freeze it during fine-tuning to stabilize results. For large datasets, allow router to keep learning.
Why it exists: Routers can overfit on small data; freezing a good routing policy prevents instability.
Example: On RTE (small), a frozen router guides tokens to the right experts based on prior MNLI experience.

Expert Merging for Inference (Deployment Trick)

What happens: To avoid routing overhead at inference, merge all experts into one dense adapter using either:
- Uniform Averaging: Average all experts equally.
- EMA Merging: Track the router’s average choices during fine-tuning and weight experts accordingly.
Why it exists: Conditional branching adds latency. Merging captures most of the learned expressivity in one adapter, speeding up deployment while keeping accuracy.
Example: EMA merging preserves more accuracy on RTE and SST-2 than uniform averaging, nearly matching the unmerged model.

Model Variants

What happens: Tiny (4 experts, top-1), Medium/Base/Large (8 experts, top-2), with shared groups where MoL layers replace certain FFNs. All use RoPE, GeGLU, and FlashAttention.
Why it exists: Different sizes fit different hardware and latency budgets.
Example: ModernALBERT-large (~120M) reaches top GLUE scores among compact models, while tiny/medium shine in resource-limited settings.

What Breaks Without Each Part:

No MoL inside FFN: You lose token-conditional internal changes; expressivity drops vs. MoL/MoE designs.
No router (or always-on experts): You waste compute and lose specialization benefits.
No FlashAttention: Training/inference slow down and memory pressure rises.
No RoPE: Weaker handling of order and long-range relations.
No GeGLU: FFN less expressive and less stable.
No distillation/init: Training takes longer and may underperform with the limited 30B-token budget.

Secret Sauce Summary: The clever move is putting conditional LoRA experts inside the shared FFN—where the representation is actually transformed—so the same shared block can act differently per token, restoring the diversity that parameter sharing removes.

04Experiments & Results

The Test: The authors evaluate natural language understanding (GLUE), question answering (SQuAD-v2), and retrieval (BEIR). They also run ablations to see whether MoL beats Mixture-of-Adapters and Relaxed Recursive Transformers, and they test the expert-merging trick for deployment.

The Competition: Baselines include classic and modern encoders like BERT and RoBERTa, strong compact models (e.g., MosaicBERT, NomicBERT, GTE-en), a modern high-performing baseline (ModernBERT), and architectures aimed at parameter efficiency (ALBERT, RRT).

Scoreboard with Context:

GLUE: ModernALBERT-large (120M) scores 88.72 average, edging out ModernBERT-base (149M, 88.45). That’s like a smaller kid getting an A+ while a bigger kid gets a solid A—impressive given fewer parameters. On RTE, STS-B, MRPC, ModernALBERT-large hits state-of-the-art among base-class models (e.g., 92.7 MRPC, 92.1 STS-B), showing great semantic precision.
SQuAD-v2: ModernALBERT-base reaches 92.8 F1 and 86.1 EM, slightly beating ModernBERT-base (92.6 F1) and ALBERT-xxlarge (92.5 F1). Think of it as winning a close race by leaning at the finish line—small but meaningful.
BEIR (subset): ModernALBERT leads compact models on tasks like ArguAna (48.82 vs. ModernBERT’s 35.7), showing excellent domain adaptation—like a student who can ace pop quizzes in surprise subjects.

Surprising/Notable Findings:

MoL vs. Mixture-of-Adapters (MoA): With 8 experts and top-2 routing, MoL scores 77.24 vs. MoA’s 76.87 on a GLUE setting used for ablations. That’s a clear, consistent edge for putting experts inside the FFN rather than tacking them on after it.
Scaling Experts: Moving from 1 to 8 experts (top-2) gives about +1.16 GLUE points in the ablation setting, showing real gains from conditional capacity, not just more parameters.
MoL vs. Relaxed Recursive Transformers (RRT): Under matched initialization/training, MoL hits 81.94 GLUE vs. 80.95 for RRT—evidence that token-conditional routing inside the shared FFN trains more effectively than static, depth-only LoRA.
Expert Merging: After training with routing, merging experts into one adapter keeps accuracy close while slashing latency and memory. For example, ModernALBERT-tiny achieves ~9.46 ms latency, ~106k tokens/s, and ~0.196 GB memory—like turning a Swiss Army knife into a single, sharp tool for everyday carry.

Interpreting the Numbers:

Hitting 88.72 GLUE at 120M parameters is like scoring higher on the final with fewer hours of study—parameter efficiency paying off.
Beating larger baselines on SQuAD-v2 (F1) means the token-level reasoning is crisp even with shared layers, thanks to expert modulation.
Jumping ahead on BEIR’s ArguAna suggests the router learns useful domain splits so tokens trigger the right specialists—an important sign of real-world adaptability.

Takeaway: Conditional computation inside the shared FFN closes the expressivity gap of parameter sharing and sometimes flips the script—small, well-aimed models can outplay bigger dense ones.

05Discussion & Limitations

Limitations:

Inference Overhead Before Merging: Routing and activating multiple experts per token adds branching and latency; although top-2 is light, it’s not free. Merging is a strong fix, but it removes dynamic routing at deployment time.
Global Attention Only: Without local/block-sparse patterns, very long-context tasks may suffer compared to architectures that blend local and global attention.
Router Sensitivity on Small Data: On tiny datasets, routers can overfit; freezing or pretraining helps but isn’t a perfect cure.

Required Resources:

Training: Modest by today’s standards (30B tokens), but still requires multi-GPU setups for pretraining and careful engineering (FlashAttention, RoPE, GeGLU).
Inference: Very efficient when experts are merged (tiny memory footprint and fast throughput), moderate if keeping routing live.

When NOT to Use:

Ultra-low-latency, on-device inference with no tolerance for routing overhead (unless you merge).
Extremely long-sequence tasks where local attention patterns yield big wins.
Settings where expert specialization is unlikely (very homogeneous data), making routing less beneficial.

Open Questions:

Can we learn even better routers that are robust on small data without freezing?
What’s the best way to place MoL layers across groups for maximum gain per parameter?
Can we combine MoL with local/block-sparse attention to handle very long contexts efficiently?
How does expert merging compare to knowledge distillation into a single adapter for different tasks?
Can these ideas generalize smoothly to large autoregressive LLMs and multimodal transformers while preserving efficiency?

Honest Assessment: MoL elegantly restores expressivity inside shared FFNs and shows clear empirical wins. The approach offers a practical training/deployment story (dynamic while learning, merged for speed in production) with a few caveats around routing cost and long-context modeling that future work can tackle.

06Conclusion & Future Work

Three-Sentence Summary: Recursive transformers are tiny but lose diversity because every layer shares the same weights. This paper inserts Mixture of LoRAs directly inside the shared FFN and uses a router to pick experts per token, bringing back specialization without inflating parameters. The result, ModernALBERT, achieves state-of-the-art results among compact models and even beats larger dense baselines on key benchmarks, with an expert-merging trick that keeps deployment fast.

Main Achievement: Showing that conditional computation inside the shared FFN—via low-rank LoRA experts selected per token—is the key to restoring expressivity lost to aggressive parameter sharing.

Future Directions:

Blend MoL with local/block-sparse attention to handle very long contexts efficiently.
Extend the framework to autoregressive LLMs and multimodal models.
Explore smarter routers, adaptive expert placement, and advanced merging/distillation methods for even leaner deployments.

Why Remember This: It’s a blueprint for building small models that punch above their weight—train with the flexibility of experts, deploy with the speed of a dense model. That combination can make high-quality language understanding more accessible, greener, and ready for real-world apps on everyday hardware.

Practical Applications

•On-device assistants that understand queries accurately while running fast and light.
•Customer support bots that adapt to finance, travel, and tech topics via expert routing.
•Search and retrieval systems that find exact answers across specialized domains.
•Education tools that grasp nuances in questions and provide step-by-step help.
•Healthcare triage chatbots that route medical terms to the right domain experts (with human oversight).
•Document classification and tagging for enterprises with varied, evolving vocabularies.
•Legal and policy text analysis where subtle, domain-specific phrasing matters.
•Low-latency QA for mobile apps using merged-expert deployment.
•Multilingual or code-mixed understanding by letting different experts specialize in language families.
•Efficient fine-tuning for new tasks using lightweight LoRA experts instead of full retraining.

Version: 1