VersatileFFN: Achieving Parameter Efficiency in LLMs via Adaptive Wide-and-Deep Reuse

Ying Nie; Kai Han; Hongguang Li; Hang Zhou; Tianyu Guo; Enhua Wu; Xinghao Chen; Yunhe Wang

VersatileFFN: Achieving Parameter Efficiency in LLMs via Adaptive Wide-and-Deep Reuse

Intermediate

Ying Nie, Kai Han, Hongguang Li et al.12/16/2025

arXiv PDF

Key Summary

•Large language models get smarter when they get bigger, but storing all those extra weights eats tons of memory.
•VersatileFFN makes models smarter without adding many new weights by reusing the same feed-forward network in two clever ways: wide (many mini-experts) and deep (repeat steps for hard tokens).
•It builds “virtual experts” by slicing one shared FFN, so it feels like a Mixture-of-Experts but barely increases memory.
•It also lets hard tokens loop through the same FFN multiple times, giving them extra thinking without new parameters.
•A difficulty-aware gate decides, per token, whether to take the fast wide path or the thoughtful deep path, and how to mix them.
•Across many benchmarks and model sizes, VersatileFFN beats same-size or same-FLOPs baselines, including k-loop and MoE in average accuracy.
•Compared to MoE, it avoids a huge parameter jump while keeping adaptive routing benefits.
•Compared to just adding more loops, it uses compute more wisely, often reaching better accuracy with fewer FLOPs.
•This approach is practical for memory-limited settings like edge devices and cheaper cloud deployments.
•Key idea: add capacity with computation reuse, not with more stored weights.

Why This Research Matters

VersatileFFN helps powerful language models run on devices and servers with limited memory by adding capacity through computation, not stored weights. That means lower costs for companies serving many users and better access for schools, nonprofits, and small teams. It can cut energy and carbon use by avoiding the need to host massively larger models just to get better reasoning. On phones and edge devices, it makes smarter assistants more practical without huge downloads. In safety-critical settings, the adaptive depth can focus extra “thinking time” only when needed. Overall, it brings stronger reasoning to more places, more affordably.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Top Bread (Hook): You know how a bigger backpack can fit more books, but it also gets too heavy to carry? Big AI models are like that—more parameters (weights) make them smarter, but they get heavy on memory.

🥬 Filling (The Actual Concept): Parameters are the stored numbers a model keeps in memory to know what to do. How it works: 1) We add layers and width to store more knowledge. 2) This boosts performance but makes the model huge to store and serve. 3) Memory becomes the bottleneck even before raw compute. Why it matters: Without a smarter approach, we either pay a lot for big GPUs or split models across machines, which adds cost and delay.

🍞 Bottom Bread (Anchor): A 1–2B parameter model can already be tough to host on a single consumer GPU—imagine trying to deploy dozens at once.

🍞 Top Bread (Hook): Imagine a library that tries to fit in a tiny room. You could squish the books (compress), but the room doesn’t get bigger.

🥬 Filling: Model compression (like pruning and quantization) reduces storage but doesn’t add new thinking power. How it works: 1) Pruning removes less-important connections. 2) Quantization stores numbers with fewer bits. 3) Low-rank adapters add small task modules. Why it matters: These methods approximate the big model but can’t break its representation ceiling—they don’t make the architecture itself more capable.

🍞 Bottom Bread: A pruned or quantized model may run on your laptop, but if the original design couldn’t learn a tough skill, the compressed version won’t either.

🍞 Top Bread (Hook): Think of a restaurant with many kitchens (experts). Only a few open for each order to save time and energy.

🥬 Filling: Mixture-of-Experts (MoE) lets a model pick a few experts for each token. How it works: 1) A router scores which experts fit a token. 2) Only top-k experts run (sparse compute). 3) Their outputs are combined. Why it matters: Great compute efficiency—but storing many separate experts still explodes memory.

🍞 Bottom Bread: You save cooking time by opening two kitchens instead of twenty, but you still had to build all twenty kitchens.

🍞 Top Bread (Hook): When homework is easy, you solve it once. When it’s tricky, you check and redo steps.

🥬 Filling: Recursion in models means reusing the same layer multiple times for hard inputs. How it works: 1) Run the same FFN again and again. 2) Decide per token how many repeats. 3) Hard tokens get more passes. Why it matters: You add thinking time without adding new weights. But traditional approaches don’t mix this with width-wise variety.

🍞 Bottom Bread: For “cat” you pass once; for “quantum tunneling” you might pass three times to refine understanding.

The world before: We chased performance by adding more parameters (dense and MoE), hitting memory walls. The problem: Make models stronger under a fixed parameter budget, not just compress them. Failed attempts: Compression trades some quality for size; pure MoE saves compute but not memory; pure recursion adds compute but can be blunt and uniform. The gap: A design that expands capacity by reusing the same weights across both width (variety) and depth (steps) adaptively. Real stakes: Cheaper serving, greener AI, better on-device assistants, and wider access without massive hardware.

02Core Idea

🍞 Top Bread (Hook): Imagine one Swiss Army knife that can act like many tools and can also use the same blade multiple times for tougher jobs.

🥬 Filling (The Actual Concept): VersatileFFN is a feed-forward layer that reuses the same weights in two ways—wide (many virtual mini-experts) and deep (repeat passes)—and mixes them per token with a difficulty-aware gate. How it works: 1) Width-versatile path slices one shared FFN into non-overlapping subspaces to form virtual experts and routes tokens to top-k of them. 2) Depth-versatile path applies the entire shared FFN multiple times for hard tokens, chosen by a learned loop controller. 3) A gate uses the predicted loop count as a “difficulty” signal to blend the two outputs. Why it matters: It adds capacity via computation, not memory—so you get MoE-like adaptability and recursive refinement without storing many extra weights.

🍞 Bottom Bread (Anchor): The token “the” goes quickly through a couple of virtual experts; the token “photosynthesis” gets more iterative passes. The model fuses the two results using the difficulty signal.

The “Aha!” in one sentence: Reuse the same FFN weights both across width (virtual experts) and depth (loops), then let token-level difficulty decide how to split compute between them.

Three analogies:

City traffic: Add extra lanes (width) for many simple cars and let some cars take a longer scenic route (depth) when they need more careful driving.
School help: A student can ask two classmates (width) for quick hints or spend extra time reworking the same problem (depth); the teacher decides which per question.
Cooking: Use the same base sauce (shared FFN) but serve it in small flavored bowls (virtual experts) for quick tastes, or simmer the same pot longer (loops) for complex dishes.

Before vs After:

Before: Either store many experts (MoE → memory heavy) or just add loops (k-loop → compute heavy, blunt per-token).
After: Get MoE-like variety and loop-like depth from one set of weights, with a smart gate that chooses how much of each per token.

Why it works (intuition):

Many tasks need variety (different subskills) and steps (extra refinement). Slicing hidden dimensions yields diverse behaviors without extra parameters; looping enables progressive improvement for hard tokens. A difficulty signal from the loop predictor naturally indicates when to rely more on depth versus width.

Building blocks (each with its own mini-explanation):

🍞 Hook: You know how a big closet can be partitioned into neat sections? 🥬 Concept: Virtual experts are non-overlapping slices of one FFN’s hidden units. How: assign each expert to a strided slice; route top-k per token. Why: Emulates MoE without storing many expert weights. 🍞 Anchor: One closet, many shelves—no new closets bought.
🍞 Hook: Redoing a math step can fix a mistake. 🥬 Concept: Recursive depth applies the same FFN multiple times. How: predict loops per token; run FFN that many times; aggregate or early-exit. Why: Hard tokens get more thinking without new parameters. 🍞 Anchor: “quantum” loops more than “cat.”
🍞 Hook: A smart dispatcher can send easy chores to quick helpers and tough tasks to specialists. 🥬 Concept: Difficulty-aware fusion. How: use expected loop count as a difficulty proxy to set a mixing weight between width and depth outputs. Why: Avoids wasting compute on easy tokens and under-thinking hard ones. 🍞 Anchor: Stopwords get the quick path; reasoning words get deep passes.

03Methodology

High-level pipeline: Input tokens → Self-Attention (unchanged) → VersatileFFN: [Width path in parallel] + [Depth path with loops] → Difficulty-aware fusion → Output tokens.

Step 0: Keep Attention As-Is

What: The standard self-attention block computes context-aware representations H from X.
Why: Focus the innovation on the FFN, making drop-in replacement easy.
Example: For the sequence [“the”, “cat”, “slept”], attention mixes information so each token “knows” about its neighbors.

Step 1: Build Virtual Experts from One FFN (Width-Versatile)

What happens: Take the FFN’s hidden dimension and slice it into N non-overlapping chunks. Each chunk behaves like a lightweight virtual expert with its own input slice and output slice aligned.
Why this step exists: It gives MoE-like diversity (different subskills) while keeping just one set of stored weights. Without it, you lose width-wise specialization and adaptability.
Example with data: Suppose d_hidden=2048 and we choose N=8 experts, each d_expert=256. The router scores 8 experts per token; top-2 run. For “cat”, maybe experts 1 and 5 fire; for “quantum”, experts 2 and 7.

How token routing works (sparse expert routing):

What: A small router multiplies H by a gate matrix to score experts; pick top-k.
Why: Saves compute—only a few experts run per token. Without routing, you’d compute all experts and waste FLOPs.
Example: “the” → experts {0,3}; “photosynthesis” → {2,6}; outputs are weighted by their gate scores and summed.

Step 2: Reuse the Whole FFN Recursively (Depth-Versatile)

What happens: The same full FFN (no slicing) is applied multiple times per token. A loop predictor (small head) estimates how many iterations each token needs.
Why this step exists: Some tokens need more progressive refinement. Without it, you can’t give hard tokens extra reasoning without adding new layers.
Example with data: Max loops L_max=4. For an easy token, predicted loops=1; for a hard token, loops=3. During training, a relaxed (Gumbel-Softmax) probability over {1,2,3,4} allows gradients; at inference, choose the argmax and run exactly that many iterations.

Step 3: Turn Loop Count into Difficulty (Difficulty-Aware Fusion)

What happens: Convert the (soft) predicted loop distribution into an expected loop count E[L]. Map this to a mixing weight λ that leans toward width for easy tokens (few loops) and toward depth for hard tokens (many loops). Final output Y = λ·Y_width + (1−λ)·Y_depth.
Why this step exists: It unifies the two strengths—quick variety and deep refinement—without manual tuning. Without it, you’d either overuse depth (slow) or overuse width (shallow).
Example: If E[L]=1.2 out of 4, λ is high (closer to width). If E[L]=3.8, λ is low (closer to depth).

Training details that keep it stable and efficient:

Load balancing loss: Encourages the router to use experts fairly so one virtual expert doesn’t hog all tokens.
Temperature annealing for the loop predictor: Starts smooth so training explores options; ends sharper to make decisive loop choices.
Soft aggregation during training: Combine intermediate loop states with soft weights to keep gradients flowing.

Inference-time optimizations (speed-ups):

Discrete early-exit: Run exactly the predicted number of loops (no soft averaging), saving compute.
Conditional parallelism: If λ is essentially zero, skip the width path; otherwise, run width and depth in parallel for throughput.

Concrete walk-through (toy):

Input token “world” → Attention → H.
Router scores 8 virtual experts; picks {1,4}; runs their slices; gets Y_width.
Loop head predicts 2 loops; run FFN twice to get Y_depth.
Compute λ from predicted difficulty (2 of 4 is medium), blend Y = λ·Y_width + (1−λ)·Y_depth.

Secret sauce (what’s truly clever):

The same FFN weights generate width diversity (via structured slices) and depth power (via recursion). All extra capacity comes from compute, not memory. This dual reuse turns one “real” expert into many “virtual” behaviors and many “thinking steps,” adaptively, per token.

04Experiments & Results

The test: Can VersatileFFN raise accuracy while keeping parameters nearly fixed and using compute wisely? The authors train OLMo2-style models at three sizes (≈354M, 720M, 1.21B) on FineWeb-Edu and evaluate zero-shot on eight benchmarks (PIQA, HellaSwag, OBQA, SciQ, ARC-e, ARC-c, COMM, WINO).

The competition: Baselines include (a) MoE: adds multiple real experts (top-2 active) like classic sparse layers—good compute use, but parameter-heavy; (b) k-Loop: keep the dense model but repeat the FFN k times per layer—parameter-light but compute-heavy and not adaptive.

Scoreboard with context:

1.21B scale: VersatileFFN hits ~60.47% average accuracy, beating MoE (~59.65%) and 6-Loop (~60.05%). Think of this like edging ahead with an A- when others get B+/A- but using far less memory than MoE.
720M scale: VersatileFFN ~57.03% vs MoE ~55.87% and 6-Loop ~56.55%—a clear bump despite not ballooning parameters.
354M scale: VersatileFFN ~52.33%, ahead of MoE (~51.48%) and 6-Loop (~51.94%). Even the smallest model benefits.

Efficiency comparisons (why it’s practical):

Parameters: MoE inflates memory (e.g., 1.21B → 1.97B, +63%). VersatileFFN adds only tiny routing/loop heads—effectively the same memory as the base.
FLOPs: k-Loop costs scale linearly with k (e.g., 4×, 6× the FFN cost). VersatileFFN typically uses fewer FLOPs than high-k loops yet reaches equal or better accuracy (e.g., at ~350M, about 45% fewer FFN FLOPs than 6-Loop while outperforming it).

Surprising findings:

MoE sometimes shows lower pretraining loss than VersatileFFN but does not translate to higher zero-shot accuracy—suggesting VersatileFFN generalizes better on reasoning-heavy tasks (e.g., strong gains on ARC-e, COMM).
Best depth is not always “more”: Ablations show accuracy peaks around 4 loops; 6 loops can slightly overfit or waste compute.

Behavioral insights (visual analyses):

Loop allocation by layer: Smaller models push more loops late; medium models concentrate in the middle; the largest model front-loads loops early then stabilizes. This suggests size shapes where depth helps most.
Word cloud by difficulty gate: Specific action words (e.g., “clean,” “remove,” “cut,” “cup”) tend to get more loops (lower λ), while generic, frequent words (“make,” “use,” “water,” “will”) get fewer loops—matching intuition.

Ablations (what matters):

Each branch helps: Width-only and depth-only each beat the base; combining them with difficulty-aware fusion is best.
Expert settings: 8 experts with top-2 routing works well; more isn’t always better.
From-scratch training: Even without continued pretraining, VersatileFFN outperforms others (e.g., ~51.14% vs base ~47.98%).

Bottom line: VersatileFFN consistently offers higher average accuracy per parameter and smart compute usage, beating both MoE (memory-hungry) and k-Loop (compute-hungry).

05Discussion & Limitations

Limitations:

Token-level controllers add moving parts: a router, a loop predictor, and a fusion rule. These can complicate training dynamics and require careful temperature annealing and load balancing.
Fixed slicing: Virtual experts are created by static, non-overlapping slices. While simple and efficient, learned, overlapping, or adaptive partitions might capture richer subskills—but would be more complex.
Latency variance: Adaptive loops mean per-token compute varies. On some hardware or batching regimes, this can cause throughput jitter without careful engineering (though conditional parallelism mitigates this).
Attention unchanged: Gains come purely from the FFN side; interactions with advanced attention tricks (e.g., long-context routing) remain unexplored.

Required resources:

Comparable to training a dense model of the same size, plus negligible overhead for routing heads and auxiliary losses. No extra parameters for multiple experts or deeper stacks.
For best results, a decent pretraining corpus (40–100B tokens in the paper) still helps the shared FFN learn versatile features that the slices and loops can exploit.

When not to use:

Ultra-low-latency micro-inference where any control logic is too costly and inputs are uniformly easy; a plain dense layer could be simpler.
Tiny models with very small hidden sizes may not have enough room to carve effective virtual experts without hurting capacity per slice.
Workloads where uniform, predictable compute per token is mandatory (e.g., hard real-time systems) may dislike adaptive depth.

Open questions:

Can we learn the slice layout (width partition) end-to-end, or make it dynamic per layer? Would overlapping subspaces help or cause interference?
What’s the best way to coordinate attention with adaptive FFN depth—can loops be triggered by attention uncertainty, too?
Can we push depth further with curriculum or verification-style passes, yet keep FLOPs in check?
How does this approach interact with quantization and sparsity when deployed on edge accelerators?
Is there a theoretical trade-off frontier between slice count, loop depth, and generalization on reasoning tasks?

06Conclusion & Future Work

Three-sentence summary: VersatileFFN reuses the same FFN both across width (as virtual experts) and across depth (as loops), then fuses them per token using a difficulty signal. This unlocks MoE-like adaptability and recursive refinement without adding many parameters, so capacity grows with computation rather than memory. Across sizes and benchmarks, it outperforms MoE and k-Loop baselines on average accuracy while keeping memory low and compute efficient.

Main achievement: Showing that a single, shared FFN can be repurposed to deliver both expert diversity and iterative reasoning, and that a simple difficulty-aware gate can steer compute to the right path per token—achieving parameter efficiency without sacrificing performance.

Future directions: Learnable or overlapping slice layouts; tighter coupling between attention and adaptive FFN depth; broader tests on long-context and multi-modal tasks; integration with quantization/pruning for edge devices; and theoretical analyses of optimal width–depth reuse. Also, explore curriculum-based loop scheduling and verification passes that selectively increase depth only when confidence is low.

Why remember this: It reframes scaling from “add more weights” to “reuse weights more cleverly.” That shift matters for greener AI, cheaper inference, and getting advanced reasoning onto memory-constrained hardware—making powerful language tools more accessible to everyone.

Practical Applications

•Deploy stronger LLMs on memory-limited GPUs or edge devices without big weight growth.
•Reduce cloud inference costs by avoiding MoE-style parameter explosions while keeping adaptive routing benefits.
•Speed up reasoning-heavy tasks by giving only hard tokens extra loops instead of looping everything.
•Build multi-tenant inference services that fit more models per machine through parameter-efficient layers.
•Combine with quantization or pruning to further shrink memory while keeping performance via width–depth reuse.
•Create on-device copilots that stay small in memory but think deeper when needed (e.g., code hints, math help).
•Improve RAG and tool-use systems by letting tokens tied to complex steps trigger deeper refinement.
•Use difficulty signals to monitor uncertainty and trigger verification passes only when beneficial.
•Design curriculum or energy-aware schedulers that cap loops for easy inputs to save battery or cost.
•Retrofit existing Transformer stacks by swapping in VersatileFFN without changing attention.

Version: 1