An Empirical Study of World Model Quantization

Zhongqian Fu; Tianyi Zhao; Kai Han; Hang Zhou; Xinghao Chen; Yunhe Wang

An Empirical Study of World Model Quantization

Intermediate

Zhongqian Fu, Tianyi Zhao, Kai Han et al.2/2/2026

arXiv PDF

Key Summary

•World models are AI tools that imagine the future so a robot can plan what to do next, but they are expensive to run many times in a row.
•Post-Training Quantization (PTQ) shrinks these models to run faster and use less memory, but in planning tasks, tiny errors can pile up over many steps.
•Using DINO-WM as the test case, the study compares popular PTQ methods across many bit-widths, activation settings, and planning horizons up to 50 steps.
•8-bit weights usually work almost as well as full precision, but 4-bit can be shaky and 3-bit often breaks planning.
•Grouping weights (group-wise quantization) helps stabilize 4-bit rollouts, especially with group size 128, but doesn’t rescue extreme 3-bit settings.
•Making activation quantization more fine-grained (per-token) gives mixed results; keeping a consistent scale (per-tensor) is often just as good or better for long rollouts.
•The encoder is far more sensitive to quantization than the predictor; if the encoder is too low-bit, planning struggles no matter how long you plan.
•On PushT, failures often come from planning geometry going off-course even when images look okay; on Wall, failures come from visual representation breaking down.
•Very low-bit settings can break the link between the planner’s loss and actual task success, so optimizing longer doesn’t help.
•These findings offer concrete advice for safely compressing world models under strict compute limits.

Why This Research Matters

Robots and interactive AI need to plan many steps ahead quickly and on small devices, so shrinking models safely is essential. This study shows exactly how to compress world models without breaking their planning ability, which saves time, energy, and money. It warns that some common tricks (like going too low-bit or over-fine activation scaling) can secretly harm long-horizon behavior. It highlights protecting the encoder as the top priority when precision is limited. These lessons help engineers deploy reliable, fast planners in homes, warehouses, hospitals, and on drones. With the right choices, we get both speed and trustworthiness. With the wrong ones, we get fast models that plan poorly.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine you’re playing a board game where you must think a few moves ahead. You picture what might happen if you move here or there, then pick the best move. AI does something similar when it plans.

🥬 The Concept: World Models are AI systems that learn how the world changes so they can imagine possible futures before acting.

How it works:
1. Watch: The model looks at the current situation (like an image from a camera).
2. Imagine: It predicts what will happen if you take different actions.
3. Choose: A planner picks the action that leads to the best predicted future.
Why it matters: Without world models, the agent guesses blindly each step. Planning becomes slow, clumsy, or fails in tricky tasks.

🍞 Anchor: A robot arm wants to push a block to a goal. The world model simulates many possible pushes in its head and picks the one that likely gets the block to the target.

🍞 Hook: You know how doing a chore once is easy, but doing it 50 times in a row gets tiring? AI planning often runs the model again and again.

🥬 The Concept: Planning Horizon means how many planning steps the agent rolls out into the future during decision-making.

How it works:
1. Pick a horizon number (like 10, 20, or 50).
2. For each candidate action sequence, roll out the world model that many steps.
3. Compare predicted futures to the goal and pick the best.
Why it matters: Longer horizons are powerful but expensive, and small mistakes can grow with each extra step.

🍞 Anchor: If you plan 50 moves ahead in chess, a tiny evaluation error on move 5 can snowball into a losing plan by move 50.

🍞 Hook: Picture shrinking a big picture into a smaller file to send faster. If you shrink too much, it gets blurry.

🥬 The Concept: Post-Training Quantization (PTQ) makes a trained model smaller and faster by storing numbers with fewer bits instead of retraining it.

How it works:
1. Collect a small set of example inputs (calibration data).
2. Choose bit-width (like 8-bit or 4-bit) and scaling rules.
3. Convert weights (and sometimes activations) to low-precision integers.
Why it matters: Without PTQ, world models can be too slow or memory-hungry to plan over long horizons on real devices.

🍞 Anchor: Like compressing a movie from 4K to 1080p so it streams smoothly on your tablet.

🍞 Hook: Think of bits like measuring cups. Using fewer cups makes cooking faster but less exact.

🥬 The Concept: Bit-width is how many bits are used to store each number in the model (e.g., 32-bit vs. 8-bit vs. 4-bit).

How it works:
1. Fewer bits → fewer distinct values a number can take.
2. Quantize full-precision values to the nearest available low-bit value.
3. Use scale and zero-point to map real numbers to integers.
Why it matters: Too few bits can add noise; over many planning steps, this noise can compound into bad plans.

🍞 Anchor: A ruler with only centimeter marks (not millimeters) is faster to read but less precise; measuring many times adds up small errors.

🍞 Hook: When you organize crayons, you can sort by every color shade or just by big groups (like blues, reds, greens).

🥬 The Concept: Quantization Granularity is how finely we set scaling for numbers during quantization.

How it works:
1. Coarser granularity (per-tensor) uses one scale for many numbers.
2. Finer granularity (per-channel or per-token) uses more scales for sub-parts.
3. Choose granularity to balance stability and flexibility.
Why it matters: In long-horizon planning, stable, consistent scaling can beat overly flexible scaling that adds variability each step.

🍞 Anchor: Using one volume knob for the whole band is simple and steady; separate knobs for every instrument are flexible but can create chaos if they drift.

🍞 Hook: Baking a dozen cookies at once is faster than baking each cookie one by one.

🥬 The Concept: Group-wise Weight Quantization puts weight channels into groups (like 128 at a time) and quantizes each group with its own scale.

How it works:
1. Split the weight matrix into groups.
2. Compute a scale per group.
3. Quantize values within each group using that scale.
Why it matters: Without grouping, 4-bit quantization can wobble; groups often stabilize rollouts. But at extreme 3-bit, even groups can’t save the plan.

🍞 Anchor: Sorting books by shelf (groups) makes it easier to keep each shelf tidy, but if the labels are too vague, the whole library still gets messy.

🍞 Hook: Imagine coloring a picture with fewer shades. You still want the main shapes to look right.

🥬 The Concept: Activation Quantization compresses the layer outputs (the activations) to lower precision during inference.

How it works:
1. Observe activation ranges from calibration examples.
2. Pick scales and zero-points.
3. Quantize activations on-the-fly during inference.
Why it matters: Without careful activation quantization, outliers and changing scales can destabilize long rollouts.

🍞 Anchor: If your camera keeps changing brightness wildly, a time-lapse video looks jittery.

🍞 Hook: You can set one brightness for the whole photo, or tweak brightness for each small patch.

🥬 The Concept: Per-tensor vs. Per-token Activation Quantization decide how many scales to use for activations.

How it works:
1. Per-tensor: one scale for the entire activation tensor.
2. Per-token: separate scales per token (position) for finer control.
3. Compare stability vs. flexibility in long sequences.
Why it matters: In planning, per-token isn’t always better; its extra flexibility can introduce inconsistent scaling over time.

🍞 Anchor: Setting one classroom temperature is steady; letting each desk have its own thermostat can cause drafts and hot spots.

🍞 Hook: Picture LEGO pieces from a high-quality kit guiding your build.

🥬 The Concept: DINO-WM is a world model that uses strong pre-trained visual features (like from DINO) to plan in new tasks without special rewards.

How it works:
1. An encoder turns images into rich features.
2. A predictor imagines future features given actions.
3. A planner chooses actions that make predicted features match the goal image.
Why it matters: DINO-WM is powerful and a good testbed to study how quantization affects planning.

🍞 Anchor: It’s like using a detailed map (encoder) and a route simulator (predictor) to reach a destination without needing a hand-written set of turn-by-turn rules.

02Core Idea

🍞 Hook: You know how a tiny twist in a telescope becomes a huge miss when looking at a faraway star? Small errors grow with distance.

🥬 The Concept: The Aha! is that quantizing world models isn’t just about accuracy vs. size—because planning rolls out many steps, quantization changes the rollout dynamics and the planner’s objective alignment.

How it works:
1. Compress weights/activations with PTQ.
2. Roll out long horizons; quantization noise compounds over time.
3. Some granularities (like group-wise weights) stabilize 4-bit rollouts.
4. Activation granularity (per-token) gives inconsistent gains.
5. Encoder precision is critical; predictor can be mildly noisy.
6. Extreme low-bits can break the link between loss and success.
Why it matters: Without honoring these dynamics, a fast model can plan poorly, even if short tests look fine.

🍞 Anchor: A paper airplane that looks great for the first two meters but veers off by the tenth—quantization can make long flights go crooked.

Three analogies for the same idea:

Dominoes: If the first domino (encoder representation) is tilted, all later dominoes (future steps) fall the wrong way, no matter how carefully you nudge the middle.
Whisper game: A slightly muffled first whisper (low-bit encoder) turns into nonsense after 20 passes (long horizon), even if some people in the middle (predictor) speak clearly.
GPS vs. road bumps: A fuzzy GPS (encoder) makes you head the wrong direction. Minor road bumps (predictor noise) are tolerable, but a wrong heading ruins the trip.

Before vs. After:

Before: People expected that if 8-bit works for vision or language, it should also work for planning, and that finer activation granularity is always safer.
After: We learned that rollout stability dominates: group-wise 4-bit weights can help, per-token activations don’t guarantee wins, encoder precision matters most, and extreme low-bits can break optimization itself.

Why it works (intuition):

Planning is iterative prediction. Tiny biases in the latent space get reapplied at every step.
Group-wise weight scaling reduces internal scale mismatch, dampening error growth.
Per-token activation scaling can vary across time/positions, injecting inconsistency that accumulates.
The encoder sets the coordinate system of the imagined world; if that system is distorted, no amount of extra planning fixes it.

Building blocks (new concepts introduced with sandwiches):

🍞 Hook: Think of a camera lens that defines what the scene looks like, and a weather app that predicts tomorrow.

🥬 The Concept: Encoder and Predictor are the two main parts of the world model: the encoder turns images into features; the predictor rolls those features forward in time.

How it works:
1. Encoder: image → latent features.
2. Predictor: current features + action → next features.
3. Repeat to simulate futures.
Why it matters: If the encoder is wrong, the whole imagined movie is off; if the predictor is a bit noisy, planning sometimes compensates.

🍞 Anchor: A map that’s drawn wrong (encoder) ruins every route; a slightly bumpy road (predictor) is manageable.

🍞 Hook: If one light bulb is way brighter than the rest, the photo looks odd.

🥬 The Concept: Activation Outliers and Smoothing mean some activations are much larger than others; smoothing methods (like SmoothQuant) rebalance scales to reduce outlier shock.

How it works:
1. Measure activation scales.
2. Shift some scale from activations to weights.
3. Quantize with calmer ranges.
Why it matters: Without smoothing, low-bit activations clip or jitter, which snowballs in long rollouts.

🍞 Anchor: Wearing sunglasses evens out blinding glares so you can see the whole scene better.

🍞 Hook: When testing a recipe, you try a few bites to set the oven right.

🥬 The Concept: Calibration Data are small example runs used to pick quantization scales before deployment.

How it works:
1. Gather short trajectories not used for final testing.
2. Record activation ranges and layer outputs.
3. Pick scales that minimize output error.
Why it matters: Without calibration, scales can be badly chosen, making errors explode in planning.

🍞 Anchor: You preheat the oven after testing a few cookies so the whole batch bakes evenly.

03Methodology

High-level pipeline: Input (pretrained DINO-WM + environments + calibration data) → Quantize (weights and/or activations with chosen bits and granularity) → Plan (rollouts up to 50 steps) → Measure success rate and rollout stability.

Steps in detail:

Choose the base world model (DINO-WM) and tasks (Wall, PushT).

What happens: Load the public DINO-WM checkpoint; set up two visual planning environments.
Why this step exists: We need a strong, realistic testbed where long-horizon rollouts matter.
Example: On Wall and PushT, the task is to reach a goal image from a start image by simulating action sequences.

Define quantization settings: weight-only and joint weight-activation.

What happens: Pick bit-widths (e.g., 8, 6, 4 for both weights and activations) and methods (RTN, OMSE, AWQ, SmoothQuant, OmniQuant).
Why this step exists: Different PTQ strategies behave differently under low bits; we compare them fairly.
Example: Try W8A8, W6A6, W4A8, W4A4 and also weight-only 3/4/8 bits.

Pick quantization granularity for weights and activations.

What happens: For weights, compare per-channel vs. per-group (group size 128). For activations, compare per-tensor vs. per-token.
Why this step exists: Granularity changes stability vs. flexibility. Long rollouts may prefer stability.
Example: On Wall, W4 with group size 128 often stabilizes rollouts; on PushT, the gains are smaller but present in some settings.

Calibrate scales using short trajectories.

What happens: Run horizon-2 rollouts on separate seeds, record activation ranges and outputs, then pick scales/zero-points.
Why this step exists: Good scales reduce initial quantization error that would otherwise compound.
Example: OMSE or OmniQuant use calibration loss to minimize output mismatch.

Quantize the model modules, and optionally ablate encoder vs. predictor.

What happens: Apply PTQ to the full model, or isolate the encoder or predictor to test sensitivity.
Why this step exists: We need to know which parts are the bottlenecks at low bits.
Example: Set encoder at 6-bit and predictor at 8-bit to see how success rates change.

Plan with rollouts up to 50 iterations and record metrics.

What happens: For each episode, generate candidate action sequences, roll out the world model, compute a planning loss (distance to goal in latent space), and pick the best.
Why this step exists: Planning performance, especially over long horizons, reveals how quantization noise accumulates.
Example: Compare success rates at 0, 5, 10, 20, 30, 40, 50 iterations.

Analyze qualitative rollouts and loss alignment.

What happens: Visualize predicted frames and plot the planning loss over iterations.
Why this step exists: Numbers alone can hide whether failures come from representation collapse or geometric misalignment.
Example: On PushT, frames can look plausible but still fail; the loss curve may stop decreasing under very low bits.

Secret sauce (what makes this study clever):

Long-horizon focus: Instead of single-pass accuracy, it probes how tiny quantization errors evolve across many prediction steps.
Granularity sweep: It disentangles when stability (per-tensor, group-wise) beats flexibility (per-token, per-channel) for planning.
Module sensitivity: It pinpoints the encoder as the precision-critical part, guiding practitioners to protect it when budgets are tight.

More clarifying sandwiches for key choices:

🍞 Hook: Imagine deciding whether to shrink just your backpack or both your backpack and your jacket for a long hike.

🥬 The Concept: Weight-only vs. Joint Weight-Activation Quantization means compressing just the model’s weights or both weights and activations.

How it works:
1. Weight-only: store parameters in low bits; keep activations in higher precision.
2. Joint: also quantize activations during inference.
3. Tune bits for each case.
Why it matters: Joint quantization saves more but risks more rollout instability.

🍞 Anchor: Carrying a lighter pack (weights) is good; also wearing an ultralight jacket (activations) saves more energy, but you might get cold if the weather (rollout) turns rough.

🍞 Hook: Picking a tool from a toolbox depends on the job.

🥬 The Concept: PTQ Methods (RTN, OMSE, AWQ, SmoothQuant, OmniQuant) are different recipes for choosing scales and handling outliers.

How it works:
1. RTN: simple round-to-nearest.
2. OMSE: calibrate to minimize output error.
3. AWQ: protect important weight channels.
4. SmoothQuant: smooth activation outliers by shifting scale to weights.
5. OmniQuant: jointly calibrate across layers.
Why it matters: Methods that tame outliers or calibrate well tend to be more stable at 4–8 bits.

🍞 Anchor: For delicate glass, bubble wrap (smoothing/calibration) beats plain newspaper (naive rounding).

🍞 Hook: A scoreboard tells you who won—not just how pretty the game looked.

🥬 The Concept: Success Rate is the fraction of episodes where the planner reaches the goal.

How it works:
1. Run many trials.
2. Mark success if the final state matches the goal criteria.
3. Average over trials for each setting.
Why it matters: It captures whether planning truly works, not just whether predictions look okay for a few steps.

🍞 Anchor: It’s like counting how often your paper airplane actually hits the target, not just how nice the flight path looks.

🍞 Hook: If your metal detector beeps less even when you’re on top of a coin, something’s wrong.

🥬 The Concept: Planning Loss Alignment means the loss you minimize during planning should go down when you move closer to success.

How it works:
1. Define a latent-space distance to the goal.
2. Optimize action sequences to shrink that distance.
3. Check if lower loss reliably means higher success.
Why it matters: Extreme quantization can break this link; then extra optimization doesn’t help.

🍞 Anchor: If your compass points the wrong way, walking faster doesn’t get you to the treasure.

04Experiments & Results

The test: Measure planning success rates across horizons (0–50 steps on Wall, 0–30 on PushT) under different PTQ methods, bit-widths, and granularities. Also inspect qualitative rollouts and how planning loss changes with iterations.

The competition: Compare RTN, OMSE, AWQ, SmoothQuant, and OmniQuant against full precision (FP32), under weight-only and joint weight-activation quantization, with per-channel vs. per-group weights and per-tensor vs. per-token activations.

The scoreboard (with context):

8-bit weights: Nearly FP32 performance across horizons (an A when FP32 is A+). This means moderate compression is usually safe for planning.
4-bit weights (per-channel): Clear drops at short horizons; Wall partially recovers with more planning. On PushT, recovery is limited.
4-bit weights (group-wise, G=128): Big stability gains on Wall—success can climb from poor at 0 iterations to near FP32 by 50 iterations (like turning a shaky start into a near-perfect finish). PushT improves but not to FP32.
3-bit weights: Performance collapses (close to 0), even with grouping. That’s like trying to paint with only three colors—fine details vanish.
Joint quantization (e.g., W6A6, W4A8, W4A4): W6A6 is often usable; W4A8 can be borderline; W4A4 usually fails on Wall and PushT (planning loss may stop decreasing).
Activation granularity: Per-token doesn’t reliably beat per-tensor. In low-bit or long-horizon cases, per-token can add instability; per-tensor often remains steadier.
Module sensitivity: Encoder is the make-or-break part. Low-bit encoder wrecks performance quickly and cannot be rescued by longer planning. Predictor at moderate low-bits is more tolerable—longer horizons sometimes help recover.

Surprising findings:

Group-wise weight quantization acts like a stabilizer at 4-bit but offers little help at 3-bit—once precision is too low, structure is lost.
On PushT, images can still look okay while success drops. Failures come from trajectory geometry drifting off the narrow success corridor, not just from visual collapse.
In extreme low-bits, the planning objective can misalign with success. Loss curves flatten or rise over iterations, so more optimization hurts or doesn’t help.

Concrete examples from the tables:

Wall, weight-only, 4-bit with OmniQuant and G=128: success rises from ~0.20 (0 iters) to ~0.94 (50 iters), nearly matching FP32 (0.94). That’s like turning a C- into an A after more planning steps because grouping stabilized the model.
PushT, weight-only, 4-bit per-channel: success hovers low (~0.08–0.10), indicating limited recoverability; with G=128, it improves (e.g., RTN W4 to ~0.44 at 30 steps) but still far from FP32 (0.94)—a partial comeback.
Joint W4A4 on Wall: success remains near zero even with many iterations; the loss curves (Figure 3) often refuse to go down—evidence of broken alignment.

Takeaway: 8-bit is safe, 4-bit needs group-wise care (and even then, task-dependent), and 3-bit is generally too harsh. Protect the encoder’s precision, and don’t assume per-token activations will help long rollouts.

05Discussion & Limitations

Limitations:

Extreme compression (e.g., W3 or W4A4) can ruin the link between planning loss and true success, making extra optimization useless.
Results are centered on DINO-WM and two environments (Wall, PushT). Patterns should generalize to similar world models, but exact numbers may differ elsewhere.
Calibration used short trajectories (horizon=2). Different calibration strategies might further help or hurt stability.
The study focuses on PTQ; quantization-aware training (QAT) or hybrid-precision designs might improve low-bit viability but were not explored here.

Required resources:

A pretrained world model (e.g., DINO-WM) and modest calibration data.
A quantization toolkit implementing RTN, OMSE, AWQ, SmoothQuant, OmniQuant.
Compute to run long-horizon planning (up to 50 steps) for evaluation and to render qualitative rollouts.

When NOT to use these settings:

Don’t use 3-bit or W4A4 for long-horizon planning—they typically break rollout dynamics.
Avoid low-bit encoder quantization on tasks requiring fine-grained visual distinctions; performance often collapses.
Be cautious with per-token activation quantization for long rollouts—it may add instability despite looking better on short snippets.

Open questions:

Can we design planning-aware quantization that explicitly controls error growth across steps (e.g., step-wise error budgets)?
Would quantization-aware training targeted at the encoder preserve representation quality at 4-bit (or even 3-bit)?
Can hybrid precision (e.g., 8-bit encoder, 4–6-bit predictor) consistently match FP32 with strong speed/memory gains?
Are there better calibration schemes (longer horizons, task-tailored samples, curriculum) that improve alignment?
How do these findings translate to other world-model families (e.g., video-world simulators, multimodal models) and real robots with latency constraints?

06Conclusion & Future Work

Three-sentence summary: This paper shows that compressing world models with post-training quantization is not just about accuracy vs. size—because planning rolls out many steps, quantization reshapes rollout dynamics and can break the planning objective. Group-wise weight quantization stabilizes 4-bit settings, activation granularity gives mixed benefits, and the encoder is far more sensitive than the predictor. At very low bits, success can collapse and more planning doesn’t help.

Main achievement: A thorough, long-horizon, planning-centric map of where quantization is safe, where it’s shaky, and where it breaks—plus clear guidance: protect encoder precision, prefer group-wise weights at 4-bit, and don’t overtrust per-token activations.

Future directions: Explore planning-aware quantization and calibration, encoder-focused QAT, hybrid-precision designs, and evaluation on more diverse tasks and real robots. Develop metrics that directly capture loss–success alignment to diagnose misalignment early.

Why remember this: For world models, the big danger isn’t a tiny accuracy drop—it’s error snowballing across steps. This study explains when that happens and how to avoid it, so compressed planners can stay both fast and trustworthy.

Practical Applications

•Deploy DINO-WM-like planners on edge robots using W8 or W6A6 to balance speed and reliability.
•Use group-wise weight quantization (e.g., G=128) when targeting 4-bit weights to stabilize rollouts.
•Keep the encoder at higher precision (e.g., 8-bit) and push the predictor lower if needed for extra savings.
•Prefer per-tensor activation scaling for long-horizon planning unless careful tests prove per-token stability.
•Avoid 3-bit weights and W4A4 for planning tasks; they often break optimization alignment.
•Calibrate with short, representative trajectories and separate seeds to pick robust scales.
•Monitor loss–success alignment during validation; if the loss stops correlating with success, raise precision or change granularity.
•Run module-wise ablations (encoder vs. predictor) to find the safest precision budget for your task.
•Use qualitative rollouts to diagnose representation collapse (Wall-like) vs. geometric misalignment (PushT-like).
•Adopt hybrid precision (8-bit encoder, 4–6-bit predictor, per-tensor activations) as a strong default for planning on-device.

Version: 1