MeKi: Memory-based Expert Knowledge Injection for Efficient LLM Scaling
Key Summary
- •MeKi is a new way to grow a language model’s knowledge by using storage (ROM) instead of extra heavy calculations (FLOPs).
- •It adds tiny “memory experts” for each token at every Transformer layer and blends their advice into the model’s thinking with a light gating step.
- •During training, MeKi learns rich knowledge using flexible projections; before deployment, it folds this complexity into a static lookup table (re-parameterization).
- •At inference, it only does cheap lookups plus small projections, so speed stays the same as a normal model on phones.
- •On a smartphone NPU, a 1.7B-parameter MeKi matches a 4B dense model’s average accuracy while decoding about 2.26× faster.
- •MeKi consistently beats dense baselines across 10 benchmarks and outperforms storage-based competitors like PLE and Engram.
- •Both static memory (learned per-token tables) and dynamic memory (learned from global embeddings) help, and together they work best.
- •Putting MeKi in parallel with the FFN block and using additive-sigmoid gating gives the strongest results.
- •Performance scales smoothly with memory size, letting designers trade a bit of ROM for a lot of accuracy.
- •This memory-first scaling is especially valuable for private, low-latency AI on edge devices like phones and wearables.
Why This Research Matters
MeKi makes powerful AI practical on everyday devices by trading heavy computation for fast memory lookups, keeping responses quick and energy-friendly. This enables private, offline assistants that don’t need to ship your data to the cloud for good results. It helps older or lower-cost phones run smarter apps without lag, improving accessibility worldwide. Developers can dial in how much extra ROM to use to unlock better accuracy without hurting speed. For users, this means faster, more reliable translations, summaries, and answers right on the device. For industry, it reduces server costs and latency while enhancing battery life and user satisfaction.
Detailed Explanation
Tap terms for definitions01Background & Problem Definition
🍞 Hook: Imagine you’re studying for a quiz. You could try to think really hard every time (lots of brain work), or you could keep a small sheet of key facts in your pocket to peek at quickly (a fast memory lookup). Which one saves time when you’re in a hurry?
🥬 The Concept: Before MeKi, making language models smarter usually meant doing way more math. What it is: The AI world used to boost model performance mostly by adding parameters or doing extra thinking steps at test time. How it works: 1) Train bigger models with more layers and wider feed-forward parts, 2) Use Mixture-of-Experts to activate different sub-networks per token, or 3) Spend more time at inference using long chains of thought or search. Why it matters: On phones and small devices, this extra math turns into slow responses and drains battery, which makes the experience feel laggy.
🍞 Anchor: If your phone assistant took five seconds to answer every simple question, you’d stop using it—even if it were a genius.
🍞 Hook: You know how some toys are fun at home but don’t fit in your backpack? Big models love data centers but don’t travel well to tiny devices.
🥬 The Concept: Transformer architecture became the standard brain for language. What it is: A Transformer is a model that reads and writes text using attention and feed-forward layers. How it works: 1) Each token becomes a vector, 2) Attention figures out which tokens to focus on, 3) Feed-forward networks (FFNs) transform the vectors, 4) Stack many layers to get better understanding. Why it matters: Transformers are powerful, but bigger stacks and wider layers mean lots more calculations per token.
🍞 Anchor: It’s like building a tower of Lego floors: more floors make a cooler tower, but it gets heavier and harder to carry.
🍞 Hook: Think about counting your steps. More steps mean more work.
🥬 The Concept: FLOPs are how we count the model’s math steps. What it is: FLOPs measure the amount of computation (floating-point operations). How it works: 1) Every matrix multiplication adds lots of FLOPs, 2) Bigger layers multiply faster in size than you expect, 3) FLOPs add up across many layers and tokens. Why it matters: On-device NPUs have strict power and memory limits—too many FLOPs slow everything down.
🍞 Anchor: If every answer your phone gives takes a marathon of steps, your battery and patience run out.
🍞 Hook: A bookshelf holds lots of knowledge, even if you’re not reading it all at once.
🥬 The Concept: ROM (Read-Only Memory) is storage you can read fast but don’t rewrite during use. What it is: ROM on phones (like UFS storage) can deliver data quickly without doing heavy math. How it works: 1) Store fixed tables, 2) Look up by an index (like a page number), 3) Read small chunks when needed, 4) Keep compute light. Why it matters: ROM bandwidth on phones is often underused during inference, so it’s a great spot to stash extra knowledge.
🍞 Anchor: Instead of solving a problem from scratch every time, you open to the bookmarked page and grab the answer you need.
🍞 Hook: People tried to be clever by choosing only a few helpers each time.
🥬 The Concept: Mixture-of-Experts (MoE) makes models bigger but uses only some experts per token. What it is: MoE picks a few specialized FFNs for each token. How it works: 1) A router chooses experts, 2) Their outputs are combined, 3) Repeat per token and per layer. Why it matters: On servers it’s great, but on phones, constantly loading many separate expert weights from memory creates delays; memory traffic becomes the bottleneck.
🍞 Anchor: It’s like asking different friends for advice every sentence, but each friend lives across town, so you waste time driving around.
🍞 Hook: What if we grew knowledge without growing the number of math steps?
🥬 The Concept: This paper’s gap and idea. What it is: We need a way to scale model capacity using storage instead of compute. How it works: 1) Learn useful, token-specific knowledge during training, 2) Compress it into static tables, 3) At inference, do simple lookups plus tiny projections, 4) Keep speed. Why it matters: This lets small on-device models feel like bigger ones without slowing down.
🍞 Anchor: It’s like carrying a smart pocket guide—tiny to use, big in stored knowledge—so you answer quickly without doing long math in your head.
02Core Idea
🍞 Hook: Imagine your brain having a tiny cheat sheet for each word, ready to whisper the most helpful hint the moment you see it.
🥬 The Concept: The Aha! MeKi scales model capacity with storage (ROM) by injecting token-level expert knowledge at every layer with almost zero extra compute at inference. What it is: A memory-based module that, for each token, grabs a small expert vector from a layer-specific table and blends it into the layer’s hidden state. How it works: 1) During training, learn rich expert vectors using both static tables and dynamic projections, 2) Before deployment, fold the dynamic parts into the table (re-parameterization), 3) At inference, just do a lookup and a small gate+projection, 4) Add the result to the residual stream in parallel with the FFN. Why it matters: You get the brains of a bigger model without paying extra time or power during use.
🍞 Anchor: On a phone, MeKi-1.7B performed like a 4B dense model in accuracy while keeping fast decoding—like racing with a smaller engine but a turbocharged memory.
🍞 Hook: You know how a well-organized binder makes homework faster even if your pencil isn’t any sharper?
🥬 The Concept: MeKi’s token-level memory experts. What it is: For each word (token) and for each layer, there’s a small vector of “what usually helps here.” How it works: 1) Look up the token’s vector from a layer’s memory bank (static part), 2) Add a learned, dynamic refinement from global embeddings (during training), 3) Blend with a light gate that reads the current context, 4) Project back to the model size and add to the layer’s output. Why it matters: Without these experts, the model’s FFNs must memorize and compute everything; with them, much of the “known stuff” is fetched quickly from storage.
🍞 Anchor: Seeing the token “photosynthesis,” the memory expert can nudge the model toward plant-energy facts without heavy extra thinking.
🍞 Hook: Changing the shape of a suitcase lets you fit the same clothes more neatly.
🥬 The Concept: Re-parameterization. What it is: A way to turn a complicated training-time pathway into a simple inference-time lookup. How it works: 1) Train with powerful nonlinear projections from global embeddings, 2) After training, precompute those projections for all tokens, 3) Merge them into the table, 4) At inference, skip the heavy math and just read the precomputed vectors. Why it matters: You keep learning flexibility but deploy with speed.
🍞 Anchor: It’s like pre-cooking meals on Sunday so weekday dinners are heat-and-eat.
🍞 Hook: A dimmer switch sets how bright a lamp should be; you don’t rebuild the lamp each time.
🥬 The Concept: Low-rank gating mechanism. What it is: A small, efficient controller that adjusts how much the memory expert should influence the current token. How it works: 1) Read the token’s current hidden state, 2) Pass it through a slim linear layer and a sigmoid to get a gate, 3) Add this gate to the expert vector (additive-sigmoid), 4) Project up and add to the residual. Why it matters: Without this contextual gate, the same memory hint would be used equally in all situations and could hurt understanding.
🍞 Anchor: For the word “bank,” the gate helps decide if we mean “river bank” or “money bank” based on the sentence.
🍞 Hook: Upgrading shelves instead of your calculator can still make homework faster.
🥬 The Concept: Before vs After. What it is: Before = scale with more parameters or test-time compute; After = scale with stored knowledge plus tiny compute. How it works: 1) Offload lots of facts into ROM, 2) Use lookups and small gates, 3) Keep RAM and FLOPs steady, 4) Achieve higher accuracy. Why it matters: Edge models become both smarter and still snappy.
🍞 Anchor: A 1.7B model with MeKi scored like a 4B dense model on average benchmarks but kept phone-level speed.
🍞 Hook: Picture a recipe broken into handy mini-steps you can do quickly.
🥬 The Concept: Building blocks of MeKi. What it is: (a) Layer-specific memory banks, (b) Static + dynamic expert vectors, (c) Additive-sigmoid low-rank gate, (d) Re-parameterized tables for inference, (e) Parallel placement with FFN. How it works: 1) For each token, fetch a small vector, 2) Contextually adjust it with a gate, 3) Project and add to the residual stream, 4) Train richly, deploy simply. Why it matters: Each piece is small, but together they free accuracy from compute.
🍞 Anchor: Like adding a helpful sidecar to a bike—you don’t rebuild the bike; you neatly attach extra carrying space.
03Methodology
At a high level: Input tokens → Lookup layer-wise memory experts → Add a tiny context gate → Project and add to residual (in parallel with FFN) → Output tokens.
Step 1: Token and hidden states enter a Transformer layer
- What happens: The usual Transformer flow prepares a hidden state for each token via attention and normalization. MeKi taps into the same hidden states that the FFN sees.
- Why this step exists: We need the current context to decide how much the memory hint should matter; otherwise the same hint could be wrong in different sentences.
- Example: For the token “bark,” the surrounding words “tree” vs “dog” shape the hidden state differently.
🍞 Hook: You know how you grab an index card from a box when you see a vocabulary word? 🥬 The Concept: Layer-specific memory bank lookup. What it is: Each layer keeps a table where each token ID maps to a small expert vector. How it works: 1) Use the token’s ID like a card number, 2) Pull out its small vector from that layer’s table, 3) Optionally, during training, add a dynamic refinement computed from the global word embeddings, 4) Normalize and scale. Why it matters: Without the lookup, the model must recompute “obvious” facts every time. With the lookup, it reads them instantly. 🍞 Anchor: When you see “triangle,” the expert vector gently reminds you: three sides, angles add to 180 degrees.
Step 2: Add a context-aware gate in a low-rank space
- What happens: A slim linear layer reads the current token’s hidden state and passes it through a sigmoid to make a per-channel gate. We add this gate to the expert vector (additive-sigmoid) to personalize the hint.
- Why this step exists: Without the gate, the same hint is applied equally, even when context changes meaning.
- Example: For “bass,” the gate lifts music-related channels in a concert sentence and fish-related channels in a lake sentence.
🍞 Hook: Think of a volume knob that’s easy to turn and doesn’t use much power. 🥬 The Concept: Low-rank gating mechanism. What it is: A lightweight controller that decides how loudly the expert should speak. How it works: 1) Project hidden state down to the memory dimension, 2) Apply a sigmoid to get values between 0 and 1, 3) Add to the expert vector, 4) Keep everything compact for speed. Why it matters: This keeps compute tiny while making the memory context-sensitive. 🍞 Anchor: Like whispering “use the science hint now” when the sentence is about photosynthesis.
Step 3: Project back up and add to the residual stream in parallel with FFN
- What happens: A small projection maps the combined vector back to the model dimension; then we add it to the layer’s residual output alongside the normal FFN result.
- Why this step exists: The model works in its full hidden size; we need to return the memory help to that size so it blends smoothly with everything else.
- Example: After the projection, the model’s main stream feels like a slightly wider layer without paying the cost of a truly wider FFN.
🍞 Hook: Pre-cooking saves time on school nights. 🥬 The Concept: Re-parameterization for fast inference. What it is: Turning the training-time dynamic projection into a precomputed table so inference is just a lookup. How it works: 1) During training, use a strong nonlinear projector to learn expressive features from global embeddings, 2) Before deployment, compute those features for every token once, 3) Fold them into the memory table, 4) At inference, skip the projector entirely. Why it matters: You keep expressive learning but ship with near-zero overhead. 🍞 Anchor: It’s like freezing homemade soup in single portions so dinner is microwave-quick.
Concrete walkthrough with simple numbers
- Inputs: Suppose the model dimension is 2048 and the MeKi memory dimension is 256. The token is “gravity,” ID=1234.
- Lookup: At layer 7, we read row 1234 from the table: a 256-length expert vector.
- Gate: From the hidden state of “gravity,” a slim layer makes a 256-length gate via sigmoid; we add it to the expert vector.
- Projection: A small 2048Ă—256 matrix maps this back up. We add it to the residual alongside the FFN output.
- Output: The next layer now carries a hint-weighted understanding of “gravity,” tuned to the sentence.
What breaks without each step
- Without lookup memory: The model must reconstruct common knowledge with heavier compute and may forget niche facts.
- Without the gate: The same hint is applied even when context flips meaning, causing confusion.
- Without the projection back up: The help stays in the wrong size and can’t join the main computation.
- Without re-parameterization: Inference would be slower due to extra math, defeating the edge-device goal.
The secret sauce
- Parallel to FFN: MeKi acts like a stealth width-expander, adding capacity without slowing the original path.
- Additive-sigmoid fusion: More stable and better-performing than multiplicative or SiLU variants in tests.
- Storage-first scaling: Shifts the bottleneck from math to memory bandwidth that phones can handle, unlocking on-device smarts.
04Experiments & Results
🍞 Hook: If two runners tie on the finish line but one used half the energy, which one would you take on a long trip?
🥬 The Concept: The test setup. What it is: MeKi models trained from scratch on 50B tokens from a high-quality web dataset and then tested on 10 well-known benchmarks. How it works: 1) Compare MeKi to same-sized dense baselines, 2) Also compare to storage-boosted methods like PLE and Engram, 3) Measure both accuracy and tokens-per-second on a real smartphone NPU. Why it matters: This shows whether MeKi really brings big-model brains to small devices without slowing them down.
🍞 Anchor: MeKi-1.7B scores like a 4B dense model on average while staying fast on a Snapdragon phone.
The competition
- Dense baselines: Standard Qwen3 models at 0.6B, 1.7B, and 4B parameters.
- Storage-based peers: PLE and Engram with similar ROM budgets.
- Hardware reality check: Measured decoding speeds on Qualcomm Snapdragon 8 Elite, KV cache length 10K.
The scoreboard with context
- Average zero-shot across 10 tasks: MeKi-1.7B scores about 59.7 vs the 4B dense model’s 60.5—like getting an A compared to an A-, but using a smaller, zippier engine.
- Speed: While matching the 4B dense model’s accuracy average, the MeKi-1.7B runs about 2.26× faster in decoding on the phone NPU.
- Knowledge-heavy tasks: On ARC-Challenge and SciQ, MeKi-1.7B beats the 1.7B baseline by solid margins (e.g., +3–5 points), suggesting the ROM acts like an efficient fact store.
- Reasoning and language modeling: On BoolQ, HellaSwag, and LAMBADA, MeKi-1.7B also exceeds the 1.7B dense baseline and reaches 4B-level on some tasks, showing the experts serve as semantic anchors that help long-range predictions.
- Against PLE and Engram: MeKi leads the average by roughly 1–3 points at 0.6B and 1.7B scales, indicating the gate-and-injection design integrates stored knowledge more effectively and with lower latency.
Surprising or notable findings
- Placement matters: Putting MeKi in parallel with the FFN gives the best results; other placements underperform or create bottlenecks.
- Fusion style matters: Additive-sigmoid gating beats multiplicative or SiLU variants in training loss and downstream scores.
- Static vs dynamic memory: Each alone helps, but together they work best, suggesting they capture complementary knowledge (memorized priors + learned refinements).
- Validation loss and convergence: MeKi shows lower validation loss and lower layer-wise KL divergence (via LogitLens), meaning predictions “lock on” sooner within the network.
- Memory size scaling: Accuracy improves log-linearly as memory dimension grows, letting designers tune ROM budget vs. performance.
Why the numbers matter
- Same speed, more smarts: On-device NPU time often goes to moving and multiplying big FFN weights; MeKi shifts improvement into fast lookups. You keep responsiveness while boosting accuracy.
- Practical win: A 1.7B model that behaves like a 4B dense one opens the door to private, offline assistants that feel premium without cloud help.
05Discussion & Limitations
Limitations
- ROM budget required: You need enough storage to hold the per-layer, per-token tables. Very tight-storage devices may need smaller memory dimensions and accept a smaller gain.
- Training complexity: The dynamic projector (e.g., SwiGLU) adds FLOPs during training. While this doesn’t affect inference, it raises training cost and engineering complexity.
- Task boundaries: The approach excels for token-level, language-style knowledge. Extremely novel or multimodal tasks may still require larger active compute or other techniques (e.g., retrieval from external documents).
- Vocabulary lock-in: Memory is indexed by token IDs. Large tokenizer changes or very different languages may require retraining or new tables.
Required resources
- Training: GPU/TPU resources for standard pretraining plus the added projector cost. Usual data pipelines for web-scale text.
- Deployment: A mobile SoC with decent ROM bandwidth (e.g., UFS 4.0) and enough RAM for the base model; ability to prefetch tables efficiently.
When not to use
- Ultra-tiny storage devices or strict install-size limits where extra ROM is unacceptable.
- Scenarios demanding heavy on-the-fly reasoning expansions (e.g., long chain-of-thought search) where compute scaling is the main driver.
Open questions
- Granularity beyond tokens: Could subword, morpheme, or phrase-level experts provide even more adaptable hints without lookup collisions?
- Dynamic memory growth: Can we incrementally augment tables post-deployment for continual learning without retraining the whole model?
- Cross-lingual sharing: How best to share or factorize memory across languages to save ROM while keeping quality?
- Safety and calibration: How to ensure stored priors remain accurate over time and avoid amplifying stale or biased facts?
- Compression: Which quantization or compression strategies preserve the benefits while shrinking ROM further?
06Conclusion & Future Work
Three-sentence summary
- MeKi scales language model capacity by moving knowledge into ROM and injecting token-level expert vectors at every layer with a tiny, gated fusion—so inference stays fast. It learns rich features during training and then re-parameterizes them into static tables, turning heavy computations into simple lookups. On real smartphones, a 1.7B MeKi model rivals a 4B dense model’s accuracy while decoding much faster.
Main achievement
- Decoupling capacity from compute: MeKi proves that storage-first scaling can deliver big-model performance on edge hardware with essentially no extra latency.
Future directions
- Smarter memory organization (token- to phrase-level), adaptive or updatable tables after deployment, and better compression/quantization to shrink ROM without losing accuracy. Exploring multilingual sharing and safety calibration will make on-device assistants more broadly useful and trustworthy.
Why remember this
- MeKi flips the script: instead of buying accuracy with more math, it buys it with smart memory. That’s a durable idea for bringing private, fast, and capable AI to billions of devices that can’t run giant models but do have storage to spare.
Practical Applications
- •Offline phone assistants that answer quickly without sending data to the cloud.
- •On-device translation and transcription apps with faster, more accurate results.
- •Wearables (watches, earbuds) that give smart suggestions while preserving battery.
- •In-car infotainment that understands voice commands reliably with low latency.
- •Smart home devices that process language locally for privacy and responsiveness.
- •Educational apps that run rich language tutoring on low-cost tablets.
- •Healthcare note-taking or dictation tools that work privately on-device.
- •Customer service kiosks that stay responsive even with weak internet.
- •Field operations (e.g., disaster zones) where models must run fully offline.
- •AR/VR headsets that need instant language understanding with tight power budgets.