ROCKET: Rapid Optimization via Calibration-guided Knapsack Enhanced Truncation for Efficient Model Compression

Ammar Ali; Baher Mohammad; Denis Makhov; Dmitriy Shopkhoev; Magauiya Zhussip; Stamatios Lefkimmiatis

ROCKET: Rapid Optimization via Calibration-guided Knapsack Enhanced Truncation for Efficient Model Compression

Intermediate

Ammar Ali, Baher Mohammad, Denis Makhov et al.2/11/2026

arXiv

Key Summary

•ROCKET is a fast, training-free way to shrink big AI models while keeping most of their smarts.
•It picks how much to compress each layer using a clever 'packing' plan called a knapsack solver, guided by a tiny calibration set.
•Instead of many slow training loops, it does a single-step 'learn a dictionary + prune' move, then a quick least-squares fix.
•It balances what matters in two spaces at once (the whitened activation space and the original weight space) to prune safely.
•Across many models and tasks, ROCKET beats popular low-rank and sparse-dictionary baselines at 20–50% compression.
•At about 30% compression, it keeps over 90% of the original accuracy without any fine-tuning.
•A short healing step (about 30M tokens) recovers even more accuracy, nearly matching a native model of similar size.
•It generalizes beyond text to vision–language and speech models, with minimal loss.
•The dynamic programming allocator prevents breaking a few layers too much by adding per-layer safety caps.
•It is simple, fast, greener, and compatible with standard dense math at inference (plus faster sparse kernels where helpful).

Why This Research Matters

Smaller models mean more people can use advanced AI on everyday devices without expensive hardware. Faster inference translates into snappier chatbots, better voice assistants, and smoother on-device apps that respect privacy by keeping data local. Lower compute needs reduce energy consumption and costs, which is good for both budgets and the planet. Developers can ship multiple model sizes from a single big model, speeding up experimentation and personalization. In fields like education and healthcare, deploying strong models on modest machines enables wider access and lower latency where it counts. ROCKET’s training-free nature also lowers the barrier to entry: you can compress today and deploy today. Finally, the approach generalizes to vision–language and speech, so one tool can help across many AI applications.

Reading Workflow

Turn this paper into a decision

Scan fast. Promote only the papers that survive triage.

No workflow history yet.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 You know how packing for a trip can be tricky when your suitcase is small? If you squeeze everything evenly, you might crush fragile items and still waste space on sweaters you won’t wear.

🥬 The Concept: Model compression is like smart packing for AI models. We want to make them smaller so they run on cheaper, smaller, and faster devices—without breaking what’s important.

How it works (before this paper): People used three main tricks—quantization (use fewer bits), distillation (teach a small student from a big teacher), and factorization/pruning (reshape or remove weights).
Why it matters: Modern transformers have billions of parameters. Without compression, many apps can’t run on phones, in clinics, or on low-latency servers.

🍞 Anchor: Imagine trying to run a giant language model on a laptop. Without compression, it’s like trying to drive a bus into a tiny garage.

🍞 You know how folding clothes into a single neat pile doesn’t always work for oddly shaped items?

🥬 The Concept: Low-rank factorization (like SVD) is a popular way to shrink weight matrices by forcing them to live in one shared, simple subspace.

How it works: Split a big matrix into two thinner ones with a small inner rank; multiply them to approximate the original.
Why it matters: It’s fast and simple, but when you compress a lot, forcing every column to share the same small subspace can remove important details.

🍞 Anchor: It’s like making all outfits from the same few clothing pieces—fine for casual days but bad for a fancy event.

🍞 Imagine a tool chest where each tool is different, and for each job you pick only the tools you need.

🥬 The Concept: Dictionary learning lets each column of a weight matrix pick its own few basis vectors (atoms), creating a union-of-subspaces rather than one shared subspace.

How it works: Iteratively pick atoms and adjust them (e.g., K-SVD, OMP) until reconstruction fits well.
Why it matters: It’s flexible and accurate—but very slow at LLM scale because it needs many alternating optimization steps.

🍞 Anchor: Great results, but like rearranging a warehouse by hand—it’s slow for really big spaces.

🍞 You know how you shouldn’t cut the same amount from every subject when trimming a study plan? Some classes matter more before a test.

🥬 The Concept: Uniform compression across layers assumes all layers are equally important, which isn’t true.

How it works: Compress each layer by the same amount.
Why it matters: It can harm critical layers and waste budget on robust ones, causing bigger accuracy drops.

🍞 Anchor: If you trim math study time as much as recess, the test might go poorly.

🍞 Imagine deciding which items go in a backpack with a strict weight limit—you pick the best combo of snacks, water, and jacket.

🥬 The Concept: What was missing was a training-free, globally optimal way to choose per-layer compression that is flexible like dictionaries but fast like SVD.

How it works: Use a tiny calibration set to measure sensitivity, propose options for each layer, and pick the best mix under a total budget.
Why it matters: This keeps accuracy high while meeting memory and speed goals, without retraining.

🍞 Anchor: It’s like packing exactly what you need for a hike after quickly checking the weather.

Real stakes in daily life:

Phones and laptops can run smarter assistants offline.
Hospitals and schools with modest hardware can deploy strong models safely and cheaply.
Lower energy cost and carbon footprint during deployment.
Faster responses for chat, search, and voice—less waiting, more doing.
Easier A/B testing of multiple model sizes from one big base model.

02Core Idea

🍞 Imagine building a LEGO model: instead of forcing every part to be made from the same few bricks, pick different mini-kits per section, and spend more bricks where the model is fragile.

🥬 The Aha! Moment: ROCKET compresses models by (1) doing a single, calibration-guided sparse factorization that behaves like dictionary learning without slow loops, and (2) allocating compression budgets across layers with a knapsack optimizer to minimize total reconstruction error under a global size limit.

Multiple analogies:

Suitcase packing: Don’t shrink every clothing item the same. Use a scale to weigh importance (calibration) and a packing planner (knapsack) to choose what gets extra room.
Orchestra mix: Give more microphone volume to the soloist (important layers) and less to background instruments (robust layers), all decided from a short sound check (calibration).
School study plan: Use a quick quiz (calibration) to see weak topics, then spend more time there. A planner (knapsack) spreads study time wisely to keep the overall grade high.

Before vs After:

Before: SVD forced all columns into one small subspace; dictionary learning was accurate but slow; uniform budgets often harmed key layers.
After: ROCKET gets dictionary-like flexibility in one step, then uses a global optimizer to place parameters where they help most—no training loops required.

Why it works (intuition):

Calibration makes the math look at directions that actually show up in real activations, not just raw weights.
Whitened space keeps directions fair (like leveling a playing field), but we still peek back at the original space to avoid surprises after unwhitening.
Sparsifying coefficients, not the basis, lets each output choose its own small set of useful directions (union-of-subspaces power) without heavy iteration.
A global knapsack allocator avoids the classic trap of over-compressing a few sensitive layers while under-compressing robust ones.

Building blocks (each with a sandwich):

🍞 You know how a quick practice test shows what you actually need to study? 🥬 Calibration set: A small batch of real data used to measure which directions matter for the model’s activations.
- How: Run a few forward passes, compute a whitening transform to decorrelate inputs.
- Why: Without calibration, we might prune the wrong things. 🍞 Example: 256 text snippets reveal which word patterns the model uses most.
🍞 Imagine leveling a tilted table so marbles roll fairly. 🥬 Whitening: A math transform that evens out input directions so no direction is unfairly loud.
- How: Compute a Gram matrix from activations and take its Cholesky factor to decorrelate.
- Why: Without it, importance scores can be biased. 🍞 Example: After whitening, a small but crucial direction won’t be ignored.
🍞 Think of spotlighting the main directions of variation in data. 🥬 Eigenvalue decomposition (EVD): Breaks a matrix into principal directions and strengths.
- How: Find top eigenvectors of the whitened weight covariance; project weights onto them.
- Why: Without a good basis, sparsifying is clumsy. 🍞 Example: In a photo, EVD finds the few directions that capture most of the image structure.
🍞 Like picking only a few LEGO bricks per section instead of using every kind. 🥬 Structured sparsification: Keep only the most important coefficients per output column.
- How: Score coefficients by combining importance in whitened and original spaces; over-prune, then reactivate the best globally to hit the exact budget.
- Why: Without structure and dual-space scoring, you either miss key details or keep waste. 🍞 Example: Each output neuron chooses its top few directions.
🍞 Fixing a nearly-finished drawing with one smooth stroke, not many sketches. 🥬 Closed-form dictionary update: After pruning, refit the left factor once by least squares.
- How: Solve a small ridge-regularized system; no backprop.
- Why: Tightens the fit cheaply after changing coefficients. 🍞 Example: A quick re-fit sharpens the image.
🍞 You know how you budget pocket money across snacks, games, and savings? 🥬 Multi-choice knapsack + dynamic programming: For each layer, consider several (rank, sparsity) options, then pick the best combo under a global parameter limit with per-layer safety caps.
- How: Precompute cost/error per option, then a DP finds the minimal total error within budget.
- Why: Without a global planner, you over/under-spend in the wrong places. 🍞 Example: Spend more on layers that buy you big accuracy gains per parameter.

03Methodology

At a high level: Input (pretrained model + small calibration set + target size) → [Calibrate & Whiten] → [Find Basis via EVD] → [Project to Coefficients] → [Dual-space Importance Scores] → [Two-stage Structured Sparsification] → [Closed-form Left-Factor Update] → [Layer Profiling (many candidate options)] → [Global Budget Allocation via Knapsack DP] → Output (compressed factors per layer).

Step-by-step with what/why/examples:

Calibrate & Whiten

What: Run a tiny calibration set through the model to collect activations; compute a whitening transform that decorrelates inputs.
Why: If inputs are uneven, pruning decisions get biased. Whitening levels the field so importance reflects true use.
Example: Use 256 text sequences; build a Cholesky-based transform to make activation directions orthonormal.

Build a Data-Aware Basis (EVD)

What: In the whitened space, compute the top eigenvectors of the weight covariance; these are the key directions.
Why: They summarize where the weight actually “lives” given typical activations, giving a strong starting basis.
Example: For a $d×d$ layer, keep r directions that capture most energy; r is tried over a small grid (e.g., 16, 32, 64...).

Project to Coefficients

What: Project the whitened weights onto the basis to get a coefficient matrix ( $directions × outputs$ ).
Why: Moving pruning to coefficients lets each output select its own small subset (union-of-subspaces flexibility).
Example: Column j keeps the few rows (directions) it really needs.

Score Importance in Two Spaces

What: Compute importance per coefficient using both whitened-space magnitude and original-space sensitivity.
Why: A coefficient that looks small in whitened space might explode after unwhitening; dual scoring avoids nasty surprises.
Example: Multiply |coefficient| by a per-direction sensitivity from the unwhitening transform; blend with geometric mean.

Two-Stage Structured Sparsification

What: For each column, hard-threshold to keep its top-s entries by importance. Then slightly over-prune (by a small beta), and globally reactivate the best masked entries across the whole matrix until the exact target sparsity is reached.
Why: Column-wise keeps structure; global reactivation fine-tunes the final budget with maximum benefit.
Example: Keep top 8 per column, over-prune to 7.5% more than needed, then restore the single most valuable entries anywhere until budget matches.

Closed-form Left-Factor Update (Least Squares)

What: After pruning coefficients, refit the left factor once with ridge-regularized least squares in the whitened space.
Why: Pruning changes the target; a quick closed-form update tightens the approximation without slow training.
Example: Solve a small system per layer; store U = ( $unwhitening × new$ left factor), V = (sparse coefficients).

Reconstruct to Original Space

What: Map the whitened solution back using the inverse whitening transform; store two factors per layer.
Why: This yields a compact representation compatible with standard matrix multiplies (and sparse kernels where helpful).
Example: The model now calls $y ≈ x$ · $U·V$ efficiently.

Layer Profiling (Option Generation)

What: For each layer, precompute several candidate (rank, sparsity) pairs; for each, run steps 2–7 and record kept parameters (cost) and relative reconstruction error.
Why: We need a menu of realistic options per layer so the global allocator can pick the best combo.
Example: For a layer, try rank {16, 32, $64} \times sparsity$ {4, 8, 16 kept per column}; store (cost, error) for each.

Constrained Multi-Choice Knapsack via Dynamic Programming

What: Given per-layer option sets, pick exactly one option per layer to minimize total error under a global parameter budget, with per-layer error caps to avoid wrecking any single layer.
Why: Prevents lopsided solutions that over-compress a few sensitive layers to save budget.
Example: State DP[layer $s_s$ een][kep $t_p$ arams] = mi $n_e$ rror; transition by trying each option of the next layer.

Assemble the Compressed Model

What: Use the winning option per layer to set U, V factors; keep masks and dictionaries for inference.
Why: This creates a single, size-limited model ready to run.
Example: Attention layers may keep higher rank and lower sparsity; MLP layers often get more pruning to save many parameters.

The secret sauce:

Dual-space importance: balances activation fidelity (whitened) with true weight impact (original) so pruning choices hold up after unwhitening.
Single-step dictionary-like factorization: avoids slow alternating K-SVD/OMP but keeps their flexibility by pruning coefficients per column.
Global budget allocation with safety caps: a DP knapsack optimizer that spreads parameters where they give the biggest error reduction, while preventing any one layer from being over-harmed.

04Experiments & Results

The test: The authors compressed popular LLMs (e.g., Qwen3-8B/14B, Llama3-8B, Llama3.2-1B) and evaluated zero-shot on common benchmarks like PIQA, HellaSwag, LAMBADA, ARC-E/C, SciQ, RACE, MMLU, plus perplexity on WikiText and LAMBADA. They also tried tougher benchmarks (IFEVal, BBH, MATH, GPQA, MuSR, MMLU-Pro) and other modalities (Qwen3-4B-VL for vision–language, VibeVoice for speech).

The competition: ROCKET was compared to strong baselines: SVD-LLM (low-rank), CoSpaDi (sparse dictionary learning with K-SVD/OMP), ARS/Dobi-SVD/ARA (budget allocation in low-rank regimes), and structured sparsification/width/depth pruning methods (Wanda, Bonsai, LLM-Pruner, SliceGPT). Quantization add-ons were also explored.

The scoreboard with context:

Across 20–50% compression, ROCKET consistently outperformed SVD-LLM and CoSpaDi in accuracy and perplexity. For example, at 50% compression on Qwen3-8B, ROCKET achieved about 51.3 average accuracy versus 38.1 for SVD-LLM and 42.0 for CoSpaDi—like scoring a solid B when others slipped to Ds under the same study time.
At about 30% compression, ROCKET typically kept over 90% of original performance without any fine-tuning—like shrinking your backpack by a third but still bringing almost everything you need.
Against budget allocation baselines (Uniform, ARS, Dobi-SVD, ARA), ROCKET’s constrained knapsack selection preserved more capability under the same parameter budget, especially at aggressive compression levels—its planner simply spends parameters more wisely.
With quantization added after ROCKET, it matched or surpassed Dobi-SVD at 40–60% compression on Llama3.1-8B, showing the methods can stack.
Healing (brief fine-tuning): Compressing Qwen3-14B down to 8B and fine-tuning on only ~30M tokens boosted performance from ~63.6 to ~68.0 average accuracy—nearing the native Qwen3-8B (~70.5). That’s like fixing a few dents after moving furniture, not rebuilding the house.
Other modalities: At 20% compression, Qwen3-4B-VL kept over 90% of its average accuracy; VibeVoice kept almost identical WER and very close speech quality (UTMOS), signaling strong generalization.

Surprising findings:

Bigger models retain a higher fraction of their original performance after compression—suggesting headroom and robustness scale with size.
The DP allocator naturally prunes MLP layers more (they’re larger and more robust) while protecting attention layers—emerging behavior aligned with intuition.
Energy/runtime: The single-step pipeline is dramatically greener and faster than iterative dictionary learning—orders of magnitude less energy and time in reported cases.
Throughput: With suitable sparse kernels (like MACKO) where beneficial, ROCKET maintains or improves tokens/sec over competing sparse-factorization methods at similar budgets.

05Discussion & Limitations

Limitations:

Scaling DP to many components: The dynamic programming allocator works great for standard dense models, but Mixture-of-Experts with many experts per block could explode the option space. Smarter pruning of options or approximate solvers may be needed there.
Fixed sparsity during healing: Keeping the sparsity pattern fixed simplifies training but can be suboptimal. Letting masks change during fine-tuning might recover more accuracy.
Calibration sensitivity: Although robust across several small datasets, extremely mismatched calibration data could mislead importance scores, especially for niche domains.
Discretization and caps: The DP relies on discretized budgets and per-layer error caps; poor granularity or cap choices can leave performance on the table.
Kernel dependence: Speedups from structured sparsity depend on good sparse kernels and layout choices; some hardware stacks favor dense ops unless sparsity is sufficiently high and well-structured.

Required resources:

A small calibration set (e.g., 256 sequences) and modest compute for per-layer EVD, projections, and one least-squares solve per candidate.
Memory to hold temporary factors during layer profiling; runtime grows with the number of (rank, sparsity) candidates considered.

When NOT to use:

If you need extreme compression beyond what structured sparse factorization can support without heavy fine-tuning.
If you cannot collect even a tiny calibration set representative of deployment data.
If your hardware has poor support for sparse or factorized inference and you cannot benefit from any structural speedups.
Ultra-dynamic settings where on-the-fly compression must happen without any calibration passes.

Open questions:

Can we jointly learn masks during healing to approach dense-model parity at higher compression?
How to scale allocation to MoE with many experts per block—learned option pruning, Lagrangian relaxations, or bandit-style approximations?
Can we automate calibration selection or augment it to be more robust across domains (e.g., active sampling)?
Are there even better dual-space or multi-space importance metrics that account for downstream loss more directly without backprop?

06Conclusion & Future Work

Three-sentence summary:

ROCKET is a training-free compression method that combines a single-step, calibration-guided sparse factorization with a knapsack-based global budget allocator.
It preserves much more accuracy than classic low-rank or iterative dictionary methods at the same sizes, keeping over 90% performance at ~30% compression without fine-tuning.
A short healing step can recover even more, and the approach generalizes across text, vision–language, and speech models.

Main achievement:

Marrying dictionary-like flexibility (via coefficient sparsification and closed-form refit) with an optimal, global allocation strategy that runs fast and requires no training.

Future directions:

Adaptive sparsity patterns during healing; scalable allocation for MoE; smarter calibration selection; richer importance metrics that more directly track downstream loss.

Why remember this:

ROCKET shows you don’t need heavy training loops to get the flexibility of union-of-subspaces models: a smart one-step factorization plus a global budget planner can deliver state-of-the-art training-free compression that’s fast, green, and widely useful.

Practical Applications

•Deploy responsive on-device assistants (text and voice) on laptops and phones with strong accuracy.
•Serve more users per GPU in data centers by compressing models to fit memory and speed budgets.
•Shrink multimodal models (vision–language) for edge cameras or AR devices with minimal quality loss.
•Speed up A/B testing by compressing a single large model to several target sizes instead of training many models.
•Reduce latency for chat, translation, and summarization services in low-bandwidth environments.
•Lower energy costs and carbon footprint for large-scale AI deployments without heavy retraining.
•Enable privacy-preserving apps by keeping inference local on user devices through smaller models.
•Accelerate speech generation and transcription pipelines while keeping word error rates stable.
•Bundle quantization after ROCKET to hit tighter memory targets with competitive accuracy.
•Use short healing runs to quickly tailor a compressed model to a domain (e.g., legal or medical text).

Version: 1