QuantVLA: Scale-Calibrated Post-Training Quantization for Vision-Language-Action Models

Jingxuan Zhang; Yunta Hsieh; Zhongwei Wang; Haokun Lin; Xin Wang; Ziqi Wang; Yingtie Lei; Mi Zhang

QuantVLA: Scale-Calibrated Post-Training Quantization for Vision-Language-Action Models

Intermediate

Jingxuan Zhang, Yunta Hsieh, Zhongwei Wang et al.2/23/2026

arXiv

Key Summary

•Vision-Language-Action (VLA) robots are powerful but too big and slow for many real-world devices.
•QuantVLA is a training-free way to shrink these models (using post-training quantization) while keeping them accurate.
•It is the first PTQ framework to successfully handle a Diffusion Transformer (DiT) action head in VLA systems.
•QuantVLA uses a selective quantization layout: it makes most linear layers low-bit but keeps the most fragile attention projections in floating point.
•Two tiny calibrations steady the model after shrinking: Attention Temperature Matching (ATM) fixes attention sharpness, and Output Head Balancing (OHB) restores residual energy.
•It needs only a small unlabeled calibration buffer and does not change the model’s architecture or execution order.
•On LIBERO with π0.5 and GR00T N1.5, QuantVLA matches or beats full precision while saving up to about 70% memory on quantized parts.
•It remains robust even at aggressive settings like W4A8 and shows strong results at W4A4.
•Compared to prior methods built for language-only or vision-only models, QuantVLA is tailored to the tight coupling between language and diffusion-based action.
•This makes low-power, long-horizon robot control much more practical.

Why This Research Matters

Robots that understand what they see and what we say can help at home, in hospitals, and in factories—but only if their brains fit on small, power-limited devices. QuantVLA cuts memory and compute without retraining, so capable VLA models can run on-board in real time. This enables longer planning horizons, multiple policies per device, and faster reactions in dynamic scenes. It reduces dependence on cloud connections, improving privacy and reliability. By stabilizing diffusion-based control at low precision, QuantVLA removes a key barrier to practical embodied AI. In short, it brings high-quality robot intelligence closer to affordable, safe, and scalable deployment.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Top Bread (Hook): Imagine packing for a long trip with one backpack. You need clothes (vision), a guidebook (language), and tools (actions). If your backpack is too heavy, you can’t travel far or fast.

🥬 Filling (The Actual Concept):

What it is: Vision-Language-Action (VLA) models are robot brains that see pictures, read instructions, and output precise motions.
How it works:
1. A camera turns images into tokens (like puzzle pieces).
2. A language model turns the instruction into tokens too.
3. A transformer mixes vision and language so the robot understands the task.
4. An action head turns that understanding into a sequence of robot moves.
Why it matters: Without a good VLA, robots can’t follow natural instructions or handle complex tasks in changing scenes.

🍞 Bottom Bread (Anchor): Tell a robot, “Place the red block on the blue square.” It must see colors, understand your words, and move its arm smoothly. A VLA model does all three.

🍞 Top Bread (Hook): You know how phone photos can be saved in smaller sizes to save space without looking much worse? That’s shrinking data while keeping it useful.

🥬 Filling (The Actual Concept):

What it is: Quantization means storing numbers with fewer bits to save memory and speed up computing.
How it works:
1. Pick a small number of levels (like 16 or 256) to represent values.
2. Scale and round the real numbers to those levels (integers).
3. During use, scale them back (dequantize) to approximate the originals.
Why it matters: Without quantization, big models don’t fit on small computers, and robots can’t react fast enough.

🍞 Bottom Bread (Anchor): Saving a drawing as a smaller file that still looks the same to your eye is like quantization for model numbers.

🍞 Top Bread (Hook): Imagine you’ve already baked a cake, but now you want to carry it in a smaller box without squishing it.

🥬 Filling (The Actual Concept):

What it is: Post-Training Quantization (PTQ) shrinks a trained model without retraining it.
How it works:
1. Take the finished model.
2. Measure typical activation sizes on a small sample (a calibration buffer).
3. Choose scales, round to low-bit integers for weights and/or activations.
4. Use simple math to map between integers and floats at inference.
Why it matters: Without PTQ, you’d need expensive retraining to fit on-device limits.

🍞 Bottom Bread (Anchor): Like gently pressing the cake into a snug box using a ruler (calibration) so it still looks and tastes right later.

🍞 Top Bread (Hook): Think of drawing a neat picture by erasing a bit of noise over many passes until it looks crisp.

🥬 Filling (The Actual Concept):

What it is: A Diffusion Transformer (DiT) is an action head that refines noisy action guesses step by step into smooth robot motions.
How it works:
1. Start with a noisy action proposal.
2. Use a transformer to look at vision-language context.
3. Make a small correction.
4. Repeat for several steps to produce a final, smooth action.
Why it matters: Without DiT, robot motions can be jerky or miss long-horizon plans.

🍞 Bottom Bread (Anchor): Like tracing a sketch, then going over it several times to make a clean, confident line.

The world before: VLA models like OpenVLA, π0.5, and GR00T have grown strong at understanding scenes and following instructions, but their memory and compute bills ballooned. Profiling shows much of the cost isn’t just seeing images; it’s the reasoning and control stack—especially long sequences and big hidden states. On robots (embedded devices with tight power and memory), that’s a deal-breaker.

The problem: Can we make these models smaller and faster without breaking their careful balance between language reasoning and the diffusion-based action head? Earlier efficiency tricks often focused on the vision part, leaving the action head fully precise because it’s finicky—small numeric shifts can derail control.

Failed attempts: Off-the-shelf PTQ for LLMs or ViTs (like SmoothQuant or rotation-based methods) help in unimodal settings. But when applied to VLA stacks, they ran into trouble. Quantization-induced scale drift made attention either too sharp or too flat (wrong “temperature”), and changed how much “energy” the action block injected into the residual pathway. In short, the DiT action head got unstable.

The gap: We needed a PTQ method designed for VLA’s tight coupling—language features flow into diffusion control. That means recognizing which layers are fragile and adding tiny calibrations that realign scales after quantization.

Real stakes: If this works, you can run capable VLA policies on cheaper onboard computers, get longer horizons (more context), and run more robots or multiple policies at once—all while saving power. That’s better factory automation, home assistance, and safer, quicker responses in the real world.

02Core Idea

🍞 Top Bread (Hook): You know how a bike rides better if the tires have the right pressure and the seat is at the right height? Small adjustments make a big difference.

🥬 Filling (The Actual Concept):

What it is: QuantVLA is a training-free PTQ recipe for VLA models that shrinks both the language backbone and the diffusion action head, then gently re-tunes two scales so attention sharpness and output energy stay just right.
How it works:
1. Quantize most linear layers, but keep the fragile attention projections (Q, K, V, O) in floating point.
2. Calibrate a tiny per-head scalar (ATM) so attention isn’t too sharp or too flat.
3. Calibrate a tiny per-layer scalar (OHB) so the action block injects the right amount of signal into residuals.
4. Fold these scalars into dequantization scales—no new ops, no retraining.
Why it matters: Without these choices, quantization drifts stack up in the DiT head and control quality drops.

🍞 Bottom Bread (Anchor): Like shrinking a violin case to save space but then re-tuning the strings so the music still sounds beautiful.

Multiple analogies for the same idea:

Thermostat analogy: ATM is like setting the room temperature so people neither shiver nor sweat; OHB is like setting the speaker volume so the concert isn’t too quiet or too loud.
Cooking analogy: ATM is adjusting the oven temperature so cookies bake evenly; OHB is portioning the serving size so the meal is satisfying, not overpowering.
Sports analogy: ATM keeps your focus at the right intensity for the play; OHB makes sure your pass has the right force so teammates can catch it.

Before vs. After:

Before: Generic PTQ made attention distributions drift (too sharp/flat) and changed residual energy. Models lost accuracy, especially on long tasks.
After: QuantVLA’s layout avoids quantizing the most sensitive attention projections. ATM corrects attention sharpness per head. OHB restores per-layer energy. The result: stable control and big memory savings without training.

Why it works (intuition behind the math):

Attention sharpness is mainly set by the product of Q and K scales; if quantization changes those, the softmax “temperature” is wrong. ATM estimates a simple per-head multiplier from a short, unlabeled calibration run to align the logits’ spread back to the teacher’s.
The energy that flows from attention through the output projection into the residual stream can get too big or too small after quantization. OHB measures teacher vs. student RMS at each layer’s output, then scales the student to match—restoring the residual injection gain that layer norms expect.
Keeping Q, K, V, O in float preserves the most fragile joints. Quantizing the LLM and the DiT MLP gives the main memory wins without poking the most sensitive parts.

Building blocks (each with the Sandwich pattern):

🍞 Top Bread (Hook): Imagine deciding which books to keep as hardcovers and which as paperbacks to save shelf space. 🥬 Filling:

What it is: Selective Quantization Layout chooses which layers to make low-bit and which to keep in float.
How it works: (1) Integerize all LLM linear layers; (2) Integerize MLPs in the DiT; (3) Keep Q, K, V, O (attention projections) in floating point; (4) Leave the operator schedule unchanged.
Why it matters: If you quantize the wrong spots, tiny errors snowball and control breaks. 🍞 Bottom Bread (Anchor): Keep the delicate glassware (attention projections) intact, box the sturdy plates (LLM + MLPs).

🍞 Top Bread (Hook): You know how sunglasses can make things too dim or too bright? You want just the right tint. 🥬 Filling:

What it is: Attention Temperature Matching (ATM) is a per-head scalar that keeps attention neither too sharp nor too flat.
How it works: (1) Compare teacher vs. student logits spread on a small calibration set; (2) Compute a safe, clipped scalar α per head; (3) Fold α into dequantization; (4) Ignore tiny differences using a neutrality band so we don’t overcorrect.
Why it matters: Wrong temperature ruins which tokens get attention. 🍞 Bottom Bread (Anchor): Like adjusting a camera’s exposure so bright and dark areas both look right.

🍞 Top Bread (Hook): Think of setting the master volume on a stereo so the whole song sounds balanced. 🥬 Filling:

What it is: Output Head Balancing (OHB) is a per-layer scalar that restores the output energy after the attention’s output projection.
How it works: (1) Measure teacher vs. student RMS energy per layer; (2) Compute a safe, clipped scalar β; (3) Apply β to the output before it joins the residual stream; (4) Use a neutrality band to avoid jitter.
Why it matters: Too little or too much energy confuses residual pathways and layer norms downstream. 🍞 Bottom Bread (Anchor): Like making sure each instrument in a band isn’t drowning out the others.

Finally, QuantVLA folds α and β into existing scales, so we keep integer kernels, add no new activations, and keep the original architecture intact.

03Methodology

High-level flow: Inputs (vision frames + instruction + robot state) → Vision and Language Encoders → Fused Transformer Features → DiT Action Head (selectively quantized) → Actions.

Step-by-step recipe (with reasons and examples):

Prepare the model and calibration buffer

What happens: Take a trained VLA (e.g., π0.5 or GR00T N1.5) and collect a small unlabeled calibration buffer—just a few short rollouts.
Why: We need typical activation ranges to pick safe quantization scales.
Example: 32 mini-batches of typical scenes and instructions from LIBERO are enough to estimate robust scales.

Reparameterize linear layers for friendly quantization

What happens: Use a rotation-and-smoothing style reparameterization (inspired by DuQuant) that redistributes outliers and balances channels before quantizing. This preserves the exact function in floating point but makes integerization more stable.
Why: Without this, a few large values dominate, and rounding breaks accuracy.
Example: Apply block-orthogonal rotations and per-channel smoothing with a mild coefficient so no single channel controls the scale.

Selective Quantization Layout

What happens: Integerize all linear layers in the language model and the MLPs in the DiT; keep attention projections Q, K, V, O in floating point.
Why: Quantizing attention projections magnifies two drifts—attention temperature shift (Q·K scale) and residual energy shift (V then W_o). Keeping them in float avoids compounding these errors.
Example: On π0.5 and GR00T N1.5, this layout preserves accuracy best while saving memory across quantized parts.

Choose bit widths and scales (W4A8)

What happens: Use 4 bits for weights and 8 bits for activations in quantized layers, estimating scales from the calibration buffer (with high-percentile clipping to ignore rare spikes).
Why: W4A8 gives large memory savings with stable accuracy for these VLA stacks.
Example: Per-channel weight scales and per-token activation scales minimize rounding error where it matters.

Attention Temperature Matching (ATM)

What happens: For each attention head, compute α = Std(teacher logits) / (Std(student logits) + tiny ε), clip to a safe range, apply a neutrality band, and fold into dequantization.
Why: If logits are too small, attention becomes flat; if too large, it becomes spiky. Either way, focus is wrong.
Example: After ATM, the standard deviation of student logits closely tracks the teacher across layers in GR00T N1.5.

Output Head Balancing (OHB)

What happens: For each DiT layer’s attention block, compute β = RMS(teacher output) / (RMS(student output) + tiny ε), clip and neutralize, then scale the output entering the residual path.
Why: Keeps the residual injection gain and layer-norm operating point aligned with the teacher.
Example: After OHB, post-projection RMS matches the teacher more closely, especially in deeper layers.

Fold scalars and keep the operator schedule unchanged

What happens: Integrate α and β into existing dequantization scales or simple output scaling so no new ops or buffers are introduced. Integer GEMMs remain intact.
Why: Guarantees deployment simplicity and speed—no retraining, no graph edits, no extra memory.
Example: The only overhead is the one-time scale folding during calibration.

Inference

What happens: Run the quantized VLA on LIBERO tasks or on-device. The LLM and DiT MLPs use integer math; attention projections run in float; ATM and OHB quietly keep distributions aligned.
Why: This delivers the desired memory cuts and speed, while behavior matches or beats full precision.
Example: On π0.5, QuantVLA reaches a 97.6% average success rate with memory reduced from 4.27 GB to 1.28 GB for the quantized components.

What breaks without each step:

No selective layout: Quantizing Q/K/V/O causes major performance drops, especially on long-horizon tasks.
No ATM: Attention goes off-temperature, making focus too diffused or too peaky.
No OHB: Residual energy drifts, layer norms misbehave, and deep stacks become unstable.
No calibration buffer: Scales are guessed poorly, raising rounding and clipping errors.

The secret sauce:

Precision where it counts (keep attention projections in float), compression where it pays (LLM + DiT MLPs in low-bit), and two tiny, safe scalars (ATM and OHB) that are learned once on unlabeled data and folded into scales. This keeps the model’s “feel” the same after shrinking, especially across long, multi-step diffusion control.

04Experiments & Results

The test: Measure task success rates and memory usage on standard VLA simulation benchmarks.

Why these metrics: Success rate reflects real control quality. Memory reflects deployability on small devices.

Competition/baselines:

Full-precision models (teacher): π0.5 and GR00T N1.5 with FP16.
DuQuant-style PTQ (quantize both LLM and DiT): strong in unimodal settings.
SmoothQuant (LLM-only and LLM+DiT-MLP at W8A8): a widely used PTQ for LLMs.
QuantVLA (ours): selective layout + ATM + OHB at W4A8 (also tested W4A4).

Main scoreboard with context:

OpenPI π0.5 on LIBERO (Spatial, Object, Goal, Long):
- FP16 baseline: 97.1% avg.
- DuQuant (LLM+DiT) W4A8: 76.3% avg (like dropping from an A to a C).
- QuantVLA (LLM-only) W4A8: 97.6% avg.
- QuantVLA (full) W4A8: 97.6% avg with memory reduced from 4.27 GB to 1.28 GB (~70% relative savings on the quantized parts).
- W4A4 ablation: 95.3% avg (still very strong for such low precision).
GR00T N1.5 on LIBERO:
- FP16 baseline: 86.5% avg.
- DuQuant (LLM+DiT) W4A8: 70.0% avg.
- QuantVLA (LLM-only) W4A8: 87.0% avg.
- QuantVLA (full) W4A8: 88.0% avg with memory cut from 2.02 GB to 0.91 GB (~55% relative savings on the quantized parts).
- Denoising steps: at 8 steps, 88.0% (beats FP16 86.5%); at 16 steps, 88.5%.
SmoothQuant comparison (π0.5):
- SmoothQuant (LLM) W8A8: 96.6% avg.
- SmoothQuant (LLM+DiT-MLP) W8A8: 97.0% avg.
- QuantVLA W4A8: 97.6% avg (matches/exceeds at fewer bits).
Extended benchmark: Pick-and-Can (GR00T)
- FP16: 31/50 successes.
- SmoothQuant W4A8: 16/50.
- QuantVLA W4A8: 27/50 (substantially closer to FP16).
Beyond DiT: OpenVLA (non-DiT action head)
- FP16 Spatial: 84.7%.
- QuantVLA (W8A16 setting): 86.0% (matches/slightly exceeds), showing applicability beyond DiT-based heads.

Surprising findings:

QuantVLA sometimes exceeds full precision. Why? Low-bit noise, when scale-calibrated, can act like a mild regularizer while ATM/OHB keep the important distributions aligned.
Long-horizon tasks (Long) remained stable or even improved. This is where drift usually accumulates; targeted calibration seems to halt that snowball.
Even W4A4 on π0.5 held up impressively (95.3%), suggesting room for more aggressive compression when memory is very tight.

Takeaway: Methods from unimodal LLM/ViT quantization do not directly transfer to VLA because of the tight language-to-diffusion coupling. QuantVLA’s selective layout plus two tiny calibrations make low-bit VLA feasible and robust.

05Discussion & Limitations

Limitations:

Needs a small unlabeled calibration buffer. If this buffer is unrepresentative (very different scenes/instructions), scales and α/β might be slightly off.
The selective layout keeps attention projections in float, so it’s not pure end-to-end integer; that’s a deliberate trade-off for stability.
Extreme low-bit (e.g., W2A4) was not the main target; performance at ultra-low precision may require further tricks.

Required resources:

Access to the trained model and a short calibration run (e.g., 32 batches) to estimate scales and ATM/OHB.
Standard integer GEMM kernels for W4A8 inference; no special operators beyond scale folding.

When NOT to use:

If you can’t run any calibration at all (zero examples), or if your deployment requires fully integer attention projections with no float fallback.
If your action head is highly nonstandard (exotic attention variants) and the fragile spots differ (you may need to adapt which layers stay in float).

Open questions:

Can we push attention projections to low-bit with extra safeguards (e.g., per-head mixed precision) while keeping stability?
How to automate buffer selection so calibration remains robust under domain shift (new scenes, lighting, tasks)?
Can α and β be predicted on the fly from streaming stats to adapt across environments without a separate calibration pass?
What’s the best mixed-precision recipe for even longer horizons (e.g., adaptive precision over time steps in diffusion)?

06Conclusion & Future Work

Three-sentence summary:

QuantVLA is a training-free PTQ framework tailored to VLA models that selectively quantizes the language backbone and the DiT MLPs while keeping attention projections in float.
Two tiny calibrations—Attention Temperature Matching and Output Head Balancing—re-align attention sharpness and residual energy after quantization.
The result is large memory savings (around 55–70% on the quantized parts) with success rates that match or beat full precision on LIBERO and strong robustness across settings.

Main achievement:

The first practical, scale-calibrated PTQ method for VLA systems that stabilizes a DiT action head and delivers state-of-the-art low-bit performance without retraining.

Future directions:

Explore partial low-bit attention projections with per-head safeguards, adaptive α/β that update online, and automated calibration-set selection under domain shifts.
Extend to even lower bit widths or mixed-precision schedules that vary over diffusion steps or layers.

Why remember this:

QuantVLA shows that intelligent scale calibration plus selective precision, not brute-force retraining or redesign, can unlock real low-power, long-horizon robotic control—bringing capable embodied AI closer to everyday devices.

Practical Applications

•Run advanced VLA policies on low-power embedded computers inside robots.
•Extend temporal horizons (longer context windows) without increasing memory.
•Deploy multiple control policies in parallel on the same hardware for task switching.
•Accelerate on-device inference to reduce latency in closed-loop control.
•Lower energy usage for mobile robots operating on batteries.
•Improve edge reliability by reducing dependence on high-bandwidth cloud links.
•Retrofit existing VLA models with PTQ post-training, avoiding costly retraining.
•Scale fleet deployments (many robots) using the same hardware budget.
•Enable robust long-horizon manipulation in constrained industrial settings.
•Facilitate field robotics (agriculture, logistics) with compact, stable control stacks.

Version: 1

Key Summary

Why This Research Matters

Detailed Explanation

01Background & Problem Definition

02Core Idea

03Methodology

04Experiments & Results

05Discussion & Limitations

06Conclusion & Future Work

Practical Applications

Notes