Spherical Leech Quantization for Visual Tokenization and Generation

Yue Zhao; Hanwen Jiang; Zhenlin Xu; Chutong Yang; Ehsan Adeli; Philipp Krähenbühl

Spherical Leech Quantization for Visual Tokenization and Generation

Intermediate

Yue Zhao, Hanwen Jiang, Zhenlin Xu et al.12/16/2025

arXiv PDF

Key Summary

•This paper shows a simple, math-guided way to turn image pieces into tidy symbols (tokens) using points spread evenly on a sphere.
•It unifies many older methods by viewing them all as special cases of arranging points on a geometric grid called a lattice.
•The key idea is to choose the most evenly spaced points possible (from the famous 24‑D Leech lattice) so the model doesn’t need extra fixing losses.
•This Spherical Leech Quantization (Λ‑SQ) uses a big codebook (~200K tokens) while staying memory- and speed‑friendly because the codes are fixed.
•Autoencoders trained with Λ‑SQ reconstruct images better than the best prior method (BSQ) while using slightly fewer bits.
•An image generator trained on these tokens reaches FID 1.82 on ImageNet‑1k—very close to the ‘oracle’ 1.78—without special training tricks.
•The method simplifies training (no entropy or commitment losses) yet improves both fidelity and diversity.
•It explains, with geometry, why previous methods needed extra losses and how even spacing on a sphere avoids code collapse.
•The approach suggests that, like language models with big vocabularies, visual models also benefit from much larger, well‑designed visual vocabularies.

Why This Research Matters

Better pictures, fewer headaches: Λ‑SQ makes image compression and generation both simpler and higher quality by letting geometry do the balancing. It cuts out fragile, extra losses while scaling visual vocabularies to sizes that match modern language models, enabling richer, more diverse image generation. This means sharper photo storage, faster editing tools, and more realistic AI art without complex training recipes. The fixed, symmetric codebook is friendly to memory and speed, making large-scale models more practical. As visual models grow, having a principled, well-spaced token set becomes a key ingredient—much like big, clean vocabularies helped language models leap forward.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Top Bread (Hook): Imagine organizing a giant box of LEGO bricks. If you sort them into clear, evenly sized bins, it’s quick to find the piece you need. But if your bins are messy or uneven, you waste time hunting and may even lose pieces.

🥬 Filling (The Actual Concept) — Visual Tokenization:

What it is: Turning an image into a sequence of discrete tokens (symbols) that computers can process like words.
How it works: (1) An encoder compresses the image into small feature patches; (2) a quantizer maps each feature to a code in a codebook; (3) a decoder rebuilds the image from those codes; (4) a generator can model sequences of codes to create new images.
Why it matters: Without tokenization, image models are slower, use more memory, and learn less structure, making both reconstruction and generation harder.

🍞 Bottom Bread (Anchor): Like turning a picture into a puzzle of numbered pieces, where each number is a token the model can store and reuse to rebuild or imagine new scenes.

🍞 Top Bread (Hook): You know how a dictionary maps words to definitions? In AI, we also keep a ‘dictionary’ of code vectors to represent data chunks.

🥬 Filling — Vector Quantization (VQ):

What it is: A method that assigns each data vector to the nearest code in a fixed dictionary (codebook).
How it works: (1) Build or learn a codebook of vectors; (2) for each input vector, find the nearest code; (3) store the code’s index instead of the raw vector; (4) reconstruct using the codebook.
Why it matters: Without VQ, storage and computation blow up; with it, we compress and standardize representations.

🍞 Bottom Bread (Anchor): Like rounding any paint color to the closest swatch in a paint store’s palette.

🍞 Top Bread (Hook): If your bins are too few or clustered, some bins overflow while others are empty—wasteful and frustrating.

🥬 Filling — Non‑Parametric Quantization (NPQ):

What it is: Quantization with a fixed, hand‑designed codebook shape, not learned by parameters (e.g., LFQ, BSQ, FSQ).
How it works: (1) Predefine a set of code vectors with a pattern; (2) map inputs to their nearest fixed codes; (3) train only the encoder/decoder; (4) optionally add regularizers to encourage using all codes.
Why it matters: Without good structure, NPQ can ‘collapse’—the model uses only a few codes, wasting capacity.

🍞 Bottom Bread (Anchor): Think of a checkerboard (fixed pattern). If you only place checkers on two squares, you ignore most of the board’s power.

🍞 Top Bread (Hook): When you seat students in class, you try to spread them out so everyone has space and the teacher can see them—no crowding.

🥬 Filling — Entropy Regularization:

What it is: An extra training term that pushes the model to use many codes evenly and make confident assignments.
How it works: (1) Encourage each input to pick a clear nearest code; (2) encourage all codes to be used equally across the dataset; (3) balance these with weights.
Why it matters: Without it, some NPQ methods (like BSQ) clump into a few codes, hurting quality.

🍞 Bottom Bread (Anchor): Like making a rule that each lunch table should have a similar number of students so no one table is overcrowded.

🍞 Top Bread (Hook): Picture a grid of city streets. House addresses fall on neat, repeating blocks.

🥬 Filling — Lattice Codes:

What it is: A repeating geometric grid of points in space used as possible codes.
How it works: (1) Choose a generator that builds a regular point pattern; (2) constrain points (like on a sphere or inside bounds); (3) quantize by picking the nearest lattice point; (4) profit from symmetry and structure.
Why it matters: Without a principled pattern, code placement is ad‑hoc and needs extra fixes; with lattices, spacing becomes predictable and even.

🍞 Bottom Bread (Anchor): Like snapping LEGO pieces onto a pegboard where holes line up perfectly.

🍞 Top Bread (Hook): Packing oranges into a box—hexagons beat squares because they waste less space.

🥬 Filling — Spherical Lattices and Densest Sphere Packing:

What it is: Arranging points evenly on a sphere so the closest two are as far apart as possible.
How it works: (1) Normalize vectors to the sphere; (2) choose a point set with maximal minimum distance; (3) use nearest‑neighbor assignment on the sphere; (4) enjoy balanced usage without extra entropy losses.
Why it matters: Without even spacing, many points are redundant while others are too close, causing confusion and collapse.

🍞 Bottom Bread (Anchor): Like placing flags on a globe so every flag has plenty of breathing room from its neighbors.

🍞 Top Bread (Hook): Imagine the most perfect, symmetric way to place 196,560 pins on a 24‑dimensional ball so none crowd each other.

🥬 Filling — The Leech Lattice:

What it is: A legendary 24‑dimensional lattice with extraordinary symmetry and spacing; its first shell yields 196,560 evenly spread points on the sphere.
How it works: (1) Start from the Leech lattice; (2) take the shortest nonzero vectors (first shell); (3) normalize them to unit length; (4) use them as your codebook.
Why it matters: Without this symmetry, you need extra regularizers; with it, the geometry itself prevents collapse.

🍞 Bottom Bread (Anchor): Like the most efficient locker arrangement ever invented, where every locker is equally easy to reach.

🍞 Top Bread (Hook): If you give artists a bigger, well‑organized color palette, they can paint richer pictures without getting lost.

🥬 Filling — Spherical Leech Quantization (Λ‑SQ):

What it is: A quantizer that normalizes features onto a sphere and snaps them to the nearest Leech‑lattice code.
How it works: (1) Encoder makes 24‑D features; (2) normalize to unit length; (3) pick nearest of 196,560 Leech codes; (4) store its index; (5) decoder reconstructs; (6) AR models learn to predict these indices.
Why it matters: Without Λ‑SQ, you either need extra losses (complexity) or accept worse quality; with Λ‑SQ, training is simpler and results are better.

🍞 Bottom Bread (Anchor): Like rounding any direction on a compass to the nearest of 196,560 perfectly spaced compass points, then using that symbol to rebuild or generate images.

The world before: Visual tokenization worked, but visual vocabularies were tiny compared with language models. Older NPQ methods (LFQ/BSQ/FSQ) scaled codebooks but needed hand‑tuned entropy terms or heuristics. This made training fussy and limited performance.

The problem: How to scale visual codebooks massively, avoid code collapse, and keep training simple. Failed attempts: Ad‑hoc fixes (entropy penalties, commitment losses, subgrouping) made pipelines complex. The gap: A principled geometry that directly designs well‑spaced codes so the model naturally uses them all.

Real stakes: Better compression saves storage and bandwidth; better reconstruction and generation improve photos, design tools, and creative apps. Simpler training reduces engineering time and makes large‑vocabulary vision models more accessible—just like larger vocabularies boosted language models.

02Core Idea

🍞 Top Bread (Hook): Think of laying out picnic blankets in a park. If you pick the most even pattern, everyone has space and no one needs a lifeguard telling them where to sit.

🥬 Filling — The “Aha!” in one sentence:

Choose the most evenly spaced points on a sphere (from the Leech lattice) as your fixed codebook, so autoencoders and generators learn better tokens without extra entropy losses.

Multiple analogies:

Parking lot: Use a parking grid that maximizes space between cars. Drivers (features) easily snap to the nearest spot, and the lot naturally stays balanced.
Crayon box: A crayon set where colors are optimally spaced—no two are indistinguishably close—so artists always pick distinct shades without crowding.
Globe cities: Place cities on Earth so the minimum distance between any two is as large as possible; planes (models) can plan routes (assignments) with less confusion.

Before vs After:

Before: NPQ variants relied on entropy regularizers to keep codes balanced; training felt like juggling. Codebooks were smaller, and generation often needed special tricks.
After: Λ‑SQ uses a 24‑D spherical code with huge, evenly spaced codebooks (~200K). Training needs only the classic trio (L1/MAE, GAN, LPIPS); no entropy/commitment losses. Reconstruction improves and large‑vocab AR generation becomes stable and strong.

Why it works (intuition, no equations):

Minimum distance is king. If the closest two codes are far apart, each input feature lands decisively on a single code. That clear ‘winner’ means confident, diverse code usage without extra incentives.
Symmetry cancels bias. The Leech lattice’s extreme symmetry distributes codes uniformly—like a fair game with no loaded dice—so no class hogs attention.
Fixed, structured codes stabilize learning. The encoder/decoder learn to aim at a stable target; gradients don’t have to chase a moving codebook, and memory stays low.

Building blocks (each introduced with a Sandwich):

🍞 Hook: Lining up pegs on a perfect board makes hanging tools a breeze. 🥬 Lattice Codes — What/How/Why: Regular point grids; build with a generator; quantize by nearest neighbor; avoid messy, uneven code placement. 🍞 Anchor: Pegboard with evenly spaced holes.

🍞 Hook: Spreading picnic blankets evenly leaves no awkward gaps. 🥬 Spherical Codes — What/How/Why: Points on a sphere with big minimum angle; normalize, then snap; prevent code crowding. 🍞 Anchor: Flags on a globe spaced to avoid bumping.

🍞 Hook: If two lockers are close, you grab the wrong one; if far, you pick correctly. 🥬 Minimum Distance (δ_min) — What/How/Why: The smallest pairwise gap between codes; maximize it for clear decisions; otherwise confusion rises. 🍞 Anchor: Farther parking spots reduce fender‑benders.

🍞 Hook: The best tessellation packs oranges tight with no bruising. 🥬 Densest Packing — What/How/Why: Choose arrangements that maximize packing density; on spheres, that yields even coverage; prevents collapse. 🍞 Anchor: Hexagons beat squares for oranges.

🍞 Hook: A master keyring holds many keys but stays organized. 🥬 Leech Lattice — What/How/Why: A 24‑D super‑symmetric lattice with 196,560 shortest vectors; normalize them; get an ultra‑even spherical codebook. 🍞 Anchor: The ultimate locker map—every locker equally reachable.

Together, these bricks form Λ‑SQ: normalize to the sphere, quantize with Leech codes, drop entropy losses, scale the vocabulary, and win on both simplicity and performance.

03Methodology

At a high level: Image → Encoder → Normalize to unit sphere → Nearest Leech code (index) → Decoder (reconstruct) or AR Transformer (predict next index) → Output (image or next token).

Step‑by‑step (with Sandwich for new ideas):

Encoder: compress the image into features.

What happens: A CNN/ViT turns an H×W image into a grid of d‑dimensional vectors (here d=24 for Λ‑SQ used in the paper).
Why this step exists: Without compression, there are too many pixels; learning would be slow and noisy.
Example: A 256×256 image becomes, say, a 32×32 grid of 24‑D vectors.

Normalize to the unit sphere.

What happens: Each 24‑D vector is divided by its length so it lies on the sphere S^23.
Why: Λ‑SQ expects points on the sphere because its codebook lives there; without normalization, distances and directions are inconsistent.
Example: A feature [2,0, …,0] becomes [1,0, …,0] after normalization.

🍞 Hook: Picking the nearest store from your house is faster when stores are well‑spaced. 🥬 Nearest‑Neighbor Quantization with the Leech codebook — What/How/Why:

What happens: We compare the normalized vector with all 196,560 Leech code vectors and choose the nearest.
Why: Snapping to the closest well‑spaced point gives a clean, discrete token; without even spacing, nearest picks become ambiguous. 🍞 Anchor: Like choosing the closest bus stop when stops are laid out evenly across the city.

Store/transmit the index; optionally compress indices with arithmetic coding later.

Why: Indices are compact. A 196,560‑way index is about 17.58 bits; with entropy coding it can get even smaller.
Example: Vector #73,421 is saved instead of 24 floats.

Decoder: turn indices back into an image.

What happens: The decoder takes the codes (or their embeddings) and reconstructs pixels, trained with a simple trio of losses.

🍞 Hook: Scoring homework needs clear rubrics. 🥬 The Loss Trio (MAE/L1, GAN, LPIPS) — What/How/Why:

What it is: Three common losses: pixel accuracy (L1), realism (GAN), and perceptual similarity (LPIPS).
How it works: (1) L1 improves PSNR; (2) GAN pushes textures to look natural; (3) LPIPS aligns with human perception.
Why it matters: Without these, images may be blurry or fake‑looking. Crucially, we do NOT need entropy or commitment losses here. 🍞 Anchor: Like grading by correctness, style, and clarity without needing extra penalties to keep students from all picking the same answer.

Autoregressive (AR) generation with a huge codebook.

What happens: A Transformer predicts the next code index given previous ones (and a class label). Two heads are explored: (a) a single 196,560‑way softmax; (b) a factorized 24‑way, 9‑class head (d‑itwise) per dimension.
Why: With Λ‑SQ, we can model richly varied code sequences, boosting diversity and recall.
Example: Predicting the sequence of tokens for a tiger image so the decoder can draw one.

🍞 Hook: If you must choose from a huge menu, your calculator should be stable and fit in memory. 🥬 Memory‑Stable Training Tricks — What/How/Why:

Cut Cross‑Entropy (CCE): Compute loss more memory‑efficiently when vocab is huge. Without it, memory can blow up.
Z‑loss: Adds a tiny penalty to keep logits from exploding. Without it, training can become numerically unstable.
Dion optimizer: Orthonormalized updates for deep layers; helps tame exploding gradients with large vocabularies. 🍞 Anchor: Like using a budget spreadsheet (CCE), a surge protector (Z‑loss), and a seatbelt (Dion) so the ride stays smooth.

Sampling (making images):

What happens: Use classifier‑free guidance (CFG) to steer toward the class; apply top‑p/top‑k to control randomness; optionally scale CFG and top‑k by layer.
Why: Without guidance and calibrated randomness, samples may be off‑class or repetitive.
Example: For ‘monarch butterfly,’ CFG sharpens butterfly features; top‑p/top‑k keep variety but avoid nonsense.

Secret sauce:

Geometry replaces heuristics. The Leech lattice’s even spacing (high minimum distance) spreads codes uniformly, removing the need for entropy regularizers in training. Fixed codes also cut memory and speed costs, because no codebook gradients are needed. The result is a simpler recipe that scales to ~200K tokens while improving both reconstruction and generation.

Concrete mini‑example:

Suppose a feature is pointing ‘northeast’ in 24‑D. Λ‑SQ normalizes it, measures angles to all Leech codes, and snaps to the closest one—say code #105,992. The decoder learns how to turn #105,992 into a crisp patch. During AR generation, the model learns when to place #105,992 next to, say, #33,801 to shape a tiger stripe, and CFG ensures stripes look like the specified class.

04Experiments & Results

The test: Measure whether Λ‑SQ improves reconstruction, compression, and generation.

Reconstruction: PSNR (clarity), SSIM/MS‑SSIM (structure), LPIPS (perceptual), rFID (distribution match on reconstructions).
Compression: Bits per pixel (BPP) vs. quality on Kodak.
Generation: FID/IS and improved precision/recall on ImageNet‑1k.

The competition: Prior VQ, LFQ/BSQ/FSQ variants, and strong AR baselines (e.g., VAR, Infinity).

Scoreboard with context:

Reconstruction (ViT autoencoder): Λ‑SQ beats BSQ across the board on COCO and ImageNet validation. For example, on ImageNet‑1k, Λ‑SQ improves rFID from 1.14 (BSQ) to ~0.83 while using slightly fewer bits (~17.58 vs 18), which is like jumping from a solid B to an A+ with a shorter essay.
Compression (Kodak): Λ‑SQ achieves higher PSNR and MS‑SSIM at slightly lower BPP than BSQViT and modern learned codecs, even without arithmetic coding; adding AR coding could cut bitrate a further ~25%.
Generation (ImageNet‑1k): With Infinity‑CC + Λ‑SQ, FID reaches 1.82, very close to the validation ‘oracle’ 1.78, and recall improves notably—meaning the model covers more of the true image variety. Think of it as drawing from a richer vocabulary so you can describe more kinds of pictures accurately.

Surprising findings:

Random projection baselines look decent in low dimensions, but the Leech advantage grows in higher dimensions; δ_min (minimum distance) matters more as d rises.
VF alignment (matching tokenizer latents to DINOv2 features) can slightly hurt raw reconstruction but helps generation converge faster and improves final recall, echoing results from continuous‑latent work.
Factorized d‑itwise heads (24×9‑way) are simpler but trade off some diversity versus the full 196,560‑way head; still, Λ‑SQ with a single softmax head is strong and simple.

Ablations worth noting:

Dispersiveness → quality: Quantizers with larger δ_min yield better rate‑distortion tradeoffs. Λ‑SQ has a much larger δ_min than BSQ at similar effective bitrates, aligning with improved metrics.
Scaling codebook size: Bigger visual vocabularies benefit larger AR models—mirroring language-model scaling laws—pushing precision‑recall closer to the oracle frontier.

Bottom line: Λ‑SQ simplifies training (no entropy penalties), scales vocabularies to ~200K, and improves both reconstruction and generation, landing within a whisker of the oracle on ImageNet.

05Discussion & Limitations

Limitations:

Very large codebooks imply heavy nearest‑neighbor search. Although the codes are fixed (no gradients), efficient kernels and tiling/JIT tricks are needed to keep lookup fast.
The method currently highlights the 24‑D Leech lattice first shell; generalizing equally strong spherical codes for other dimensions or multi‑shell mixes is open.
Factorized d‑itwise prediction is convenient but can reduce diversity; full softmax heads are stronger but costlier.
VF alignment can slightly reduce reconstruction scores even as it helps generation; practitioners must choose based on goals.

Required resources:

GPUs with enough memory to handle large-vocabulary logits and training batches (CCE and Z‑loss help); fast nearest‑neighbor kernels for code assignment; typical AE/AR training time comparable to BSQ/VQ pipelines.

When NOT to use:

Tiny datasets where a ~200K codebook would be underutilized; ultra‑low‑power devices lacking compute for nearest‑neighbor search; tasks where continuous latents are preferred (e.g., some diffusion backbones).

Open questions:

Can we learn to select subsets or combine multiple shells for adaptive vocabularies that retain symmetry?
What are the best dimensionalities beyond 24 that preserve much of the Leech advantage?
How do Λ‑SQ tokens interact with advanced entropy coding and multi‑modal tokenizers?
Can geometry‑aware training further enhance AR stability without Z‑loss/Dion?
How far does the ‘larger model, larger visual vocabulary’ trend go across datasets and resolutions?

06Conclusion & Future Work

Three‑sentence summary: This paper reframes non‑parametric quantization as spherical lattice coding and chooses the most evenly spaced points—the Leech lattice’s first shell—to build a giant, fixed codebook. The resulting Spherical Leech Quantization (Λ‑SQ) removes entropy penalties, improves reconstruction/compression, and powers state‑of‑the‑art autoregressive generation with ~200K visual tokens. In short, geometry does the balancing for you, making training simpler and results stronger.

Main achievement: A principled, geometry‑driven tokenizer that scales visual vocabularies to language‑model levels while improving fidelity and diversity—achieving FID 1.82 on ImageNet‑1k, near the oracle, with a simple loss recipe and no codebook learning.

Future directions: Explore multi‑shell or adaptive Leech subsets for flexible vocabularies; extend spherical code designs to other dimensions; integrate advanced entropy coding; apply Λ‑SQ to video and multi‑modal tokenization; study scaling laws linking model size and visual vocabulary.

Why remember this: It shows that the right geometry can replace fragile heuristics—just as hexagons pack oranges best, Leech codes pack visual tokens best—unlocking simpler training, larger vocabularies, and better images.

Practical Applications

•High-quality photo storage with lower bitrates for phones and cloud albums.
•Fast, realistic image generation for creative tools (design, concept art, marketing).
•Efficient game and AR/VR texture compression with crisp detail and low latency.
•Medical and satellite image compression that preserves fine structures for diagnosis and analysis.
•On-device tokenizers for privacy-preserving photo enhancement and restoration.
•Scalable visual vocabularies for multimodal LLMs, improving vision-language understanding and generation.
•Better visual codecs for streaming platforms to reduce bandwidth without losing fidelity.
•Large-scale dataset tokenization for faster training of generative models.
•Robust tokenizers for robotics and autonomous systems that need compact yet detailed visual cues.
•Fine-grained retrieval and search using rich, evenly spaced visual tokens.

Version: 1