VQRAE: Representation Quantization Autoencoders for Multimodal Understanding, Generation and Reconstruction

Sinan Du; Jiahao Guo; Bo Li; Shuhao Cui; Zhengzhuo Xu; Yifu Luo; Yongxian Wei; Kun Gai; Xinggang Wang; Kai Wu; Chun Yuan

VQRAE: Representation Quantization Autoencoders for Multimodal Understanding, Generation and Reconstruction

Intermediate

Sinan Du, Jiahao Guo, Bo Li et al.11/28/2025

arXiv PDF

Key Summary

•VQRAE is a new kind of image tokenizer that lets one model both understand images (continuous features) and generate/reconstruct them (discrete tokens).
•It builds on a strong vision encoder (a Vision Foundation Model), then adds a high‑dimensional vector-quantization codebook to turn meanings into tokens.
•A symmetric ViT decoder learns to turn those tokens back into pixels so the model can reconstruct images with fine details.
•Training happens in two stages: first learn to reconstruct pixels while the encoder is frozen, then gently fine‑tune the encoder with a self‑distillation teacher so it keeps its understanding skills.
•Unlike older VQ methods that used tiny codebooks, VQRAE trains a large, high‑dimensional codebook with nearly 100% usage, which avoids collapse and preserves meaning.
•On benchmarks, it matches or beats many dual‑encoder systems for understanding while keeping strong reconstruction and generation quality.
•The discrete tokens are perfect for fast autoregressive generation, and the continuous features keep the model smart for reasoning.
•This unified design simplifies systems, reduces training complexity, and opens the door to scalable multimodal models that both see and create.
•It also shows a counterintuitive result: semantic encoders need high‑dimensional VQ codebooks to stay stable and useful.
•VQRAE points toward future models that tightly connect understanding, reconstruction, and generation for better performance across tasks.

Why This Research Matters

VQRAE makes it practical to build one model that both understands your photos and can create or fix images, which reduces complexity and cost. Its discrete tokens plug neatly into fast autoregressive training, while continuous features keep the model smart at reasoning. This unified approach helps assistants describe scenes, edit pictures, and follow visual instructions without switching systems. It also unlocks better synergy: what the model learns from drawing can improve how it explains, and vice versa. By stabilizing high‑dimensional codebooks, VQRAE opens a path to more expressive and scalable visual generation. Over time, these ideas can extend to video, diagrams, and 3D, enabling richer multimodal applications. In short, VQRAE points to a future where “see and create” live under one reliable roof.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine you have a Swiss Army knife. It’s one tool that can cut, open, and fix things—without swapping to a different tool every time. Wouldn’t it be great if AI had a Swiss Army knife for pictures—one tool that can understand, generate, and fix images?

🥬 The concept (The world before): Before this paper, many models used different tools for different jobs. One encoder for understanding images (like recognizing objects) and another encoder for generating images (like painting from a description). That’s called the dual‑encoder setup. It works, but it’s bulky, harder to train, and the two parts don’t talk to each other very well. Meanwhile, early unified attempts used discrete tokenizers trained only to reconstruct pixels. Those were great at tiny details (fur, grass, textures) but often forgot the big idea (what’s in the picture), hurting tasks like visual question answering. Some tried to push the tokens to be more meaningful using contrastive learning, but that needed huge batches and careful balancing.

🍞 Anchor: It’s like having one microscope for looking at cells and a totally different camera for taking photos—carrying both is heavy, and they don’t share notes.

🍞 Hook: You know how you can take notes in two ways—summaries (ideas) or exact quotes (every word)? You pick the right style for the job.

🥬 The concept (The problem): AI needed a way to produce both kinds of notes from the same picture—continuous features (summaries of meaning) for understanding and discrete tokens (precise building blocks) for generation and reconstruction—without using two separate encoders. If we only use pixel‑focused tokens, the model gets great at copying textures but can miss meaning. If we only use semantic features, generation becomes tricky for fast autoregressive models that love discrete tokens.

🍞 Anchor: It’s like wanting both the short summary and the exact recipe from the same cooking show—without rewatching it twice with different tools.

🍞 Hook: Imagine sorting a big box of LEGO pieces. If you sort them by color only, building detailed models is hard. If you sort them by shape only, recognizing themes is hard.

🥬 The concept (Failed attempts): Dual encoders sorted in two separate ways—great detail vs. great meaning—but the two piles didn’t mix. Other unified methods tried to force a single pile to serve both needs via contrastive losses or complex designs. These approaches often required massive compute, had unstable training, and still didn’t match the best understanding‑only systems.

🍞 Anchor: You keep ending up with two LEGO bins and a messy desk.

🍞 Hook: You know how a good translator keeps the meaning while choosing the right words? That’s tricky but important.

🥬 The concept (The gap): We needed one tokenizer that could translate rich visual meaning into two forms: (1) continuous features for understanding and (2) discrete tokens for generation/reconstruction—cleanly, reliably, and efficiently—without losing semantics. And we needed a way to make the discrete part large and expressive without collapsing.

🍞 Anchor: One translator who can write a thoughtful summary and also a word‑for‑word script.

🍞 Hook: Think of school projects—sometimes you explain the idea, other times you draw the diagram. You want both to match.

🥬 The concept (Real stakes): In daily life, multimodal assistants must understand your photos (What fruit is this? Is my bike chain off?) and also generate or fix images (Make me a poster! Clean up this old scan!). Using two separate systems is slower, costlier, and less reliable. A single, scalable tokenizer unlocks faster training, simpler deployment, and better synergy between tasks.

🍞 Anchor: A phone app that can both describe your picture and create a matching sticker—without switching modes or models.

— Now we introduce each key building block using the Sandwich pattern, in the order you’d learn them best.

🍞 Hook: You know how a librarian makes a short summary card for each book to help you find it quickly? 🥬 Vector Quantization (VQ): What it is: VQ turns continuous information into a small set of codewords (discrete tokens) chosen from a learned codebook. How it works: (1) Learn a set of representative vectors (the codebook). (2) For each feature chunk, pick the closest codeword. (3) Store/send the codeword index. Why it matters: Discrete tokens are great for fast, scalable autoregressive generation. Without VQ, next‑token models can’t easily handle images. 🍞 Anchor: Like tagging each book with its closest matching summary card so you can stack, count, and predict the next card.
🍞 Hook: Imagine compressing a song into a tiny file and then playing it back with good quality. 🥬 Autoencoders: What it is: A model that learns to squeeze data into a compact form and then reconstruct it. How it works: (1) Encoder compresses. (2) Bottleneck stores key info. (3) Decoder rebuilds the original. Why it matters: It teaches the model what details really matter to recreate an image. 🍞 Anchor: MP3 for pictures—smaller file, good playback.
🍞 Hook: You know how an expert tour guide focuses on meaning—what’s important to understand the place? 🥬 Representation Autoencoders (RAE): What it is: An autoencoder that uses a strong pretrained vision encoder for rich, semantic features and a decoder trained to reconstruct pixels. How it works: (1) Freeze a powerful encoder trained on image‑text. (2) Train a decoder to turn semantic features back into images. (3) Optionally fine‑tune carefully. Why it matters: You get a structured, meaningful space that’s easier for generators to learn. 🍞 Anchor: A guidebook that’s already well‑written, paired with an artist who redraws the scene.
🍞 Hook: Think of a very smart camera that already knows lots of objects. 🥬 Vision Foundation Models (VFMs): What it is: Pretrained vision encoders (often ViTs) that produce strong semantic features. How it works: (1) Train on huge image‑text pairs. (2) Learn general visual concepts. (3) Output feature maps that capture meaning. Why it matters: They give you a head start—great understanding from day one. 🍞 Anchor: Using a camera that already recognizes cats, bikes, and street signs.
🍞 Hook: If you fold a paper plane one way, you can unfold it the reverse way. 🥬 Symmetric ViT Decoder: What it is: A ViT‑style decoder mirroring the encoder to turn features or tokens back into pixels. How it works: (1) Project tokens to decoder size. (2) Pass through ViT blocks. (3) Map to RGB pixels at target resolution. Why it matters: Mirrors keep information aligned; no need for separate CNN decoders. 🍞 Anchor: Unfolding a folded paper using the reverse steps to get back the flat sheet.
🍞 Hook: When learning to swim, you first float with support, then try strokes on your own. 🥬 Two‑stage Training Strategy: What it is: Train in two steps to balance detail and meaning. How it works: (1) Stage 1: Freeze the encoder; learn the codebook and decoder with reconstruction. (2) Stage 2: Unfreeze encoder a little; add a teacher to keep semantics (self‑distillation); keep reconstructing for detail. Why it matters: Without Stage 1, the encoder may forget meaning; without Stage 2, images stay blurry. 🍞 Anchor: First practice with floaties, then swim while the coach reminds you to keep your form.
🍞 Hook: Sometimes you want the gist; other times you want exact pieces you can count. 🥬 Continuous and Discrete Tokenization: What it is: Producing two forms from the same image—smooth features for understanding and countable tokens for generation. How it works: (1) Take encoder features (continuous). (2) Quantize them via VQ to get discrete tokens. (3) Use the right form for the right task. Why it matters: Without both, you either lose speed (no discrete) or lose meaning (no continuous). 🍞 Anchor: A short movie review for meaning and subtitles for exact words.

02Core Idea

🍞 Hook: Picture a bilingual student who can both summarize a story (ideas) and recite it word‑for‑word (details) using the same brain.

🥬 The concept (Aha! moment): One tokenizer, built on a pretrained vision encoder, can output continuous semantic features for understanding and, via a high‑dimensional VQ codebook, discrete tokens for generation and reconstruction—trained in two careful stages so nothing important gets lost.

How it works (big picture recipe):

Use a strong Vision Foundation Model (VFM) to extract rich, continuous features from an image.
Pass those features into a high‑dimensional VQ codebook to get discrete tokens for autoregressive generation and pixel‑level reconstruction.
Rebuild the image with a symmetric ViT decoder.
Train in two stages: first freeze the encoder and learn to reconstruct; then unfreeze with a self‑distillation teacher to keep meaning while adding detail.

Why it matters: Models no longer have to choose between brains (understanding) and hands (drawing). They keep semantics sharp and generation fast in one unified system.

🍞 Anchor: Like using the same map to give directions (summary) and to rebuild a tiny model city (detailed pieces).

Three analogies (same idea, new lenses):

Library analogy: The VFM produces a thoughtful summary of each page (continuous), while VQ makes a set of index cards (discrete). You can quickly predict the next card to write a new chapter (generation) or read the summary to answer questions (understanding).
Kitchen analogy: The VFM gives the flavor profile (meaning), while VQ turns it into exact ingredient packets (tokens). You can cook new dishes quickly by predicting the next packet, or explain the cuisine style by reading the flavor notes.
Music analogy: The VFM captures the melody (semantics), while VQ stores notes on a staff (discrete). You can compose new music by predicting the next note or discuss the song’s theme using the melody.

Before vs After:

Before: Two encoders or heavy tricks were needed; discrete tokenizers trained on pixels forgot meaning; understanding‑only encoders struggled to feed fast autoregressive generation.
After: One encoder yields both continuous semantics and discrete tokens; generation remains fast and scalable; understanding stays strong; training and deployment are simpler.

Why it works (intuition, not equations):

VFMs already know visual meaning from massive image‑text data, so starting there keeps semantics.
A high‑dimensional VQ codebook matches the richness of VFM features, preventing collapse and letting discrete tokens carry meaning-rich signals.
A symmetric ViT decoder speaks the same architectural “language” as the encoder, making reconstruction smoother.
Two‑stage training avoids fighting goals: first learn to copy well, then learn to copy well without forgetting meaning.

Building blocks (each with a job):

Vision Foundation Model encoder: creates continuous semantic features.
High‑dimensional VQ codebook: turns those features into discrete tokens usable by autoregressive models.
Symmetric ViT decoder: translates tokens back into pixels for reconstruction.
Two‑stage training with self‑distillation: preserves understanding while improving detail.
Disentangled outputs: continuous for understanding, discrete for generation/reconstruction.

🍞 Anchor: It’s like learning a language by first listening (get meaning), then practicing writing with a clear alphabet (tokens), all while a teacher keeps you from picking up bad habits.

03Methodology

At a high level: Image → VFM encoder (continuous features) → branch A: continuous features to understanding tasks; branch B: project + VQ to discrete tokens → symmetric ViT decoder → reconstructed image (and tokens for autoregressive generation).

Step 1: Use a Vision Foundation Model as a unified encoder

What happens: A pretrained ViT‑based encoder (like SigLIP2 or InternViT) turns the image into a grid of semantic features (continuous vectors). These features already capture objects, scenes, and relationships.
Why this step exists: Starting from semantics keeps understanding strong. If we used a pixel‑only encoder, we’d get detail but lose meaning.
Example: Feed a cat photo. The encoder outputs features that cluster around “cat,” “fur,” “whiskers,” and “sofa,” not just raw colors.

Step 2: High‑dimensional vector quantization (codebook)

What happens: Project encoder features into a VQ space and pick the nearest codeword from a big, high‑dimensional codebook (e.g., 16k entries, 1536‑dimensional). This yields discrete tokens for each patch.
Why this step exists: Discrete tokens are ideal for next‑token prediction and efficient training on standard AI stacks. A high‑dimensional codebook matches the richness of the VFM features and avoids collapse.
Example: For a patch of tabby fur, you select a token that represents “brown‑striped texture with soft edges,” not just “brown pixel block.”

Step 3: Symmetric ViT decoder for reconstruction

What happens: The chosen discrete tokens are mapped back to a feature bottleneck and then passed through a ViT‑style decoder that mirrors the encoder, producing an image.
Why this step exists: Reconstruction forces tokens to preserve enough fine detail; mirroring architectures keeps information aligned. Without this, tokens might drift away from pixel fidelity.
Example: Rebuild the cat image so stripes and whiskers look right, not blurry.

Step 4: Two‑stage training (the secret to balancing goals)

Stage 1 (freeze encoder): Optimize the VQ codebook and decoder using pixel reconstruction (L2/L1), perceptual loss (LPIPS), and optionally a small adversarial term. The encoder stays fixed so semantics remain stable while the decoder learns to map meaning to pixels.
Why this matters: If you fine‑tune the encoder too early, you risk erasing its semantic structure. If you never fine‑tune, reconstructions may lack color/texture fidelity.
Example: After Stage 1, images are recognizable but may miss some crispness or exact hues.
Stage 2 (unfreeze gently + self‑distillation): Now allow small encoder updates while a frozen teacher copy of the original encoder (self‑distillation) nudges it to keep its semantic features. Continue reconstruction training so details improve.
Why this matters: Distillation is the guardrail that says, “Get sharper, but don’t forget what a cat is.” Without it, understanding can degrade; with it, you get both sharpness and meaning.
Example: After Stage 2, the cat’s fur has better texture and color, and the model still answers “What animal is this?” correctly.

Step 5: Using the outputs for tasks

Understanding: Use the encoder’s continuous features directly with an MLLM (like Vicuna/Qwen) via a connector—no quantization errors.
Generation: Use the discrete tokens with an autoregressive LLM trained to predict the next token, conditioned on text.
Reconstruction: Use the decoder to map tokens to pixels, evaluating rFID/PSNR/SSIM.

What breaks without each step:

Without VFM encoder: You lose strong semantics; understanding tasks drop.
Without high‑dimensional VQ: Codebook collapses or can’t capture meaning; generation weakens.
Without symmetric decoder: Reconstruction is harder; fidelity drops.
Without Stage 1: The encoder drifts; you forget semantics.
Without Stage 2 + distillation: Details stay mushy or understanding degrades.

Concrete mini‑walkthrough with data:

Input: 256×256 image of a dog on grass.
Encoder: Outputs a 16×16 grid of 1536‑dimensional features (continuous).
VQ: Each grid cell picks a codeword index from a 16k×1536 codebook.
Tokens: A sequence of, say, 256 visual tokens represents the image.
Decoder: Produces a reconstructed 256×256 RGB image; compute rFID/PSNR/SSIM.
Understanding: Pass the continuous features to the MLLM to answer “What animal and where?” → “A dog on grass.”
Generation: Condition an LLM on text “a brown dog on bright green grass at sunset” and autoregressively predict the visual tokens; decode to pixels.

The secret sauce:

High‑dimensional semantic VQ: Counter to past wisdom, matching the VFM’s high dimensionality stabilizes training, keeps codebook usage near 100%, and preserves semantics in discrete tokens.
Two‑stage + distillation: Cleanly separates “learn to copy” from “keep meaning while adding detail,” delivering the trade‑off unified models need.
Pure ViT stack: Using ViTs on both sides avoids mixing architectures and keeps the representational language consistent.

04Experiments & Results

The tests and why they matter:

Reconstruction quality (ImageNet‑1K 50k): Measures how faithfully tokens can rebuild images using rFID (lower is better), PSNR/SSIM (higher is better). If this is weak, tokens aren’t capturing enough detail.
Multimodal understanding (LLaVA‑style benchmarks): MME‑Perception, GQA, TextVQA, MMBench, SEED, MMMU, AI2D. These probe if continuous features still carry meaning and support reasoning.
Generation (GenEval, DPG‑Bench): Check if discrete tokens work well for autoregressive generation across object alignment, attributes, relations, and counting.

The competition:

Generative‑only tokenizers (VQGAN, LlamaGen, VAR, Open‑MAGVIT2, RAE): Good at reconstruction/generation but not designed to keep semantics strong for understanding.
Unified dual‑encoders (TokenFlow, Janus, MUSE‑VL): Two separate encoders to split pixel vs. semantics; more complex and sometimes weaker cross‑talk.
Unified single‑encoders with contrastive supervision (QLIP, UniTok, VILA‑U): Simpler than dual encoders, but often need huge batches and still trail classic understanding‑only models.

The scoreboard with context:

Reconstruction (ImageNet‑50k, 256×256): VQRAE (SigLIP2) gets rFID ≈ 1.31, PSNR ≈ 22.23, SSIM ≈ 0.762; VQRAE (InternViT) improves to rFID ≈ 1.39, PSNR ≈ 22.88, SSIM ≈ 0.784. This is like scoring an A when many prior unified models scored B’s. It’s competitive with strong generative tokenizers while using no convolution blocks.
Understanding (LLaVA‑1.5 settings): VQRAE maintains or improves scores versus other unified tokenizers and rivals strong MLLMs when simply swapping in the tokenizer—no extra instruction tuning for the tokenizer. For example, on MME‑Perception with a 13B LLM, VQRAE scores around 1491 vs. TokenFlow‑L’s 1365 (same LLM size), which is like jumping from a B+ to an A–. Larger‑resolution variants push further.
Generation (GenEval, DPG‑Bench): With only 0.6B parameters for the AR generator, VQRAE hits strong overall scores (e.g., GenEval overall ≈ 0.76; DPG‑Bench overall ≈ 86.7), surpassing or matching peers of similar size and approaching much larger systems on some subsets. That’s like a small car keeping up with bigger trucks on the highway.

Surprising findings:

High‑dimensional codebooks work best for semantic encoders: Contrary to older VQ practices (8–256 dims), VQRAE needs high dimensions (≈1536) to keep codebook usage near 100% and avoid collapse. Low dimensions caused non‑convergence.
Bigger codebooks help—up to a point: Quality improves as size grows (e.g., 4k to 16k), but too large (e.g., 32k) can slow convergence and slightly hurt results.
Two‑stage training is necessary: End‑to‑end without distillation boosted reconstruction but damaged understanding. Adding self‑distillation preserved semantics while still improving detail.

Takeaway: VQRAE hits the sweet spot—strong reconstruction and discrete tokens for generation, while continuous features remain excellent for understanding—all in one unified tokenizer.

05Discussion & Limitations

Limitations (be specific):

Fine text and tiny details: Reconstruction and generation can still struggle with small fonts, tiny faces, and fingers; artifacts may appear without extra post‑training.
Trade‑off tension: Even with two stages, pushing for ultra‑sharp reconstructions can nibble at semantic purity; dialing this balance remains tricky.
Quantization loss: Discretizing inevitably drops some information compared to continuous VAEs; state‑of‑the‑art continuous decoders can still outshine on absolute fidelity.

Required resources:

Pretraining data at tens of millions of images (e.g., BLIP3‑o mix) and compute to train the tokenizer and AR generator.
A strong VFM backbone (SigLIP2/InternViT) and a ViT decoder of similar scale.
Tooling for LPIPS/perceptual losses, optional adversarial losses, and distillation.

When NOT to use:

If your only goal is maximum‑fidelity reconstruction with no need for AR generation or multimodal understanding, a continuous VAE/RAE without quantization may be simpler and slightly better.
If compute is extremely limited and you cannot run the two‑stage training or a sizable codebook, a lighter, task‑specific tokenizer might be preferable.

Open questions:

Can reconstruction and generation actively boost understanding (and vice versa) beyond just coexisting? What curricula unlock that synergy?
How far can we scale codebook size and dimension before convergence slows or overfitting appears? Can smarter training stabilize even larger dictionaries?
Can reinforcement learning or instruction‑following for visual tokens reduce artifacts in hands, text, and layout understanding?
What are the best connectors and prompts to fully exploit continuous semantics while leveraging discrete tokens in the same MLLM conversation?
Can similar ideas unify video, audio, and 3D with one tokenizer that serves both continuous understanding and discrete generation?

06Conclusion & Future Work

Three‑sentence summary: VQRAE is a unified tokenizer that uses a pretrained vision encoder to produce continuous semantic features for understanding and, via a high‑dimensional VQ codebook, discrete tokens for generation and reconstruction. A symmetric ViT decoder and a two‑stage training process—with self‑distillation—let the model keep its meaning while sharpening pixel details. The result is competitive performance across understanding, generation, and reconstruction in a single, simpler system.

Main achievement: Proving that high‑dimensional semantic vector quantization (with near‑100% codebook usage) can coexist with strong reconstruction and preserve understanding—unlocking discrete, autoregressive generation without sacrificing semantics.

Future directions: Improve tiny‑detail fidelity (text, faces, fingers) with RL or targeted data; explore curricula where reconstruction and generation enhance reasoning; scale codebooks and encoders safely; extend the approach to video, audio, and 3D. Also refine alignment so continuous and discrete branches collaborate more tightly within one MLLM.

Why remember this: VQRAE flips the old wisdom—semantic encoders want big, high‑dimensional codebooks—and shows that one tokenizer can feed both brains (understanding) and hands (generation). It simplifies multimodal systems today and paves the way for faster, smarter, unified models tomorrow.

Practical Applications

•Smart photo assistants that both describe images and generate matching illustrations or stickers.
•One-click marketing tools that understand a product photo and produce on‑brand ad creatives.
•Educational apps that answer questions about a diagram and also redraw it more clearly.
•Document cleanup that recognizes scanned pages and reconstructs cleaner, readable versions.
•Design copilots that grasp a sketch’s intent (continuous) and render high‑fidelity mockups (discrete).
•Medical imaging helpers that explain findings and reconstruct de‑noised views for review (with human oversight).
•Robotics vision that understands scenes while quickly simulating possible views or outcomes.
•Game tools that recognize scene layouts and generate consistent textures or assets on demand.
•Accessibility tools that describe photos and generate tactile or high‑contrast versions.
•Creative AI that keeps character and style consistency across both analysis and content generation.

Version: 1