Towards Scalable Pre-training of Visual Tokenizers for Generation

Jingfeng Yao; Yuda Song; Yucong Zhou; Xinggang Wang

Towards Scalable Pre-training of Visual Tokenizers for Generation

Intermediate

Jingfeng Yao, Yuda Song, Yucong Zhou et al.12/15/2025

arXiv PDF

Key Summary

•The paper tackles a paradox: visual tokenizers that get great pixel reconstructions often make worse images when used for generation.
•They name this the pre-training scaling problem and show that just adding more compute to reconstruction-only training stops helping generation very early.
•Their key idea (VTP) is to pre-train the visual tokenizer with three objectives at once: image–text contrastive learning (global meaning), self-supervised learning (spatial-semantic perception), and pixel reconstruction (fine details).
•They prove that better semantic understanding in the latent space strongly correlates with better generation quality.
•VTP scales well: when you spend more FLOPs, use larger models, or add more data to the tokenizer pre-training, generation quality keeps improving.
•On ImageNet, their best tokenizer reaches 78.2% zero-shot accuracy and 0.36 rFID and makes diffusion training converge 4.1× faster than strong distillation methods.
•With standard DiT training kept fixed, only upgrading the tokenizer via VTP yields a 65.8% FID improvement in downstream generation when pre-training compute is increased 10×.
•They use a ViT-based autoencoder and show that hybrid training (CLIP + SSL + AE) delivers the best trade-off between semantic meaning and pixel fidelity.
•The method avoids pitfalls of using fixed external encoders (color/texture artifacts) and outgrows the ceiling of distillation-based tokenizers.

Why This Research Matters

Better tokenizers mean better images from the same generator without changing your downstream training recipe. This reduces cost and speeds up projects because you can invest compute where it scales best—pre-training the tokenizer—while keeping generation code stable. It leads to more accurate, on-topic pictures for creative tools, education, and accessibility, not just sharper pixels. By proving that semantics drive generation, this work helps the whole field focus on meaning-first latents that scale cleanly with compute, data, and model size. The approach also opens doors to specialized tokenizers (e.g., for diagrams or UI) that make targeted applications shine. In short, smarter summaries make smarter pictures, and that benefits anyone building or using generative visual AI.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Top Bread (Hook): You know how a librarian makes a short summary card for each book so it’s faster to find the right story later? If those cards only list page-by-page details (like how many times the word “blue” appears), you won’t find the best story quickly.

🥬 Filling (The Actual Concept):

What it is: Visual tokenizers are the “summary makers” for images; they compress pictures into short codes (latents) that generative models can use.
How it works (step by step):
1. Take an image and chop it into patches.
2. Encode those patches into a small, hidden code called a latent.
3. Decode that latent back into an image so the model learns to preserve details.
4. Later, a generator (like a diffusion model) works in this smaller latent space to make new images faster.
Why it matters: If the summary (latent) only remembers tiny details (like exact pixels) and forgets the big idea (like “a dog catching a frisbee”), the generator struggles to create images that look semantically right.

🍞 Bottom Bread (Anchor): Imagine a recipe card that lists every grain of salt but forgets the dish’s name. You’ll cook something salty, but not the dish you wanted.

🍞 Top Bread (Hook): Imagine a treasure map: it doesn’t show every blade of grass, but it shows the path to the treasure. That’s the point of a good summary.

🥬 The Concept (Latent Space Representation):

What it is: A latent space is a hidden, compact place where the model stores the essence of an image.
How it works:
1. The encoder compresses the image into a low-dimensional code.
2. Related images end up close together; unrelated ones are far apart.
3. The generator learns to move around this space to create new images with the right meaning.
Why it matters: If this space clusters by low-level pixels instead of high-level meaning, generation becomes confused and brittle.

🍞 Anchor: If all dog photos are near each other in latent space (not just all brownish photos), it’s easier to generate a “dog running on grass.”

🍞 Top Bread (Hook): Think of rebuilding a LEGO castle from a pile of mixed pieces—you check how close your rebuild is to the original.

🥬 The Concept (Reconstruction Loss):

What it is: A score that measures how well the decoder can rebuild the original image from its latent code.
How it works:
1. Encode image → latent.
2. Decode latent → reconstructed image.
3. Compare pixels and perceptual features; reduce the difference.
Why it matters: It keeps fine details sharp, but if you only chase this score, you might miss the big picture (semantics), which hurts generation.

🍞 Anchor: You can copy a painting’s brushstrokes perfectly yet miss that it was supposed to be “a sunny beach,” not “a cloudy lake.”

🍞 Top Bread (Hook): Picture a school that only grades handwriting, not ideas. Students learn neat letters but not storytelling.

🥬 The Concept (Pre-training Scaling Problem):

What it is: Training a tokenizer only for reconstruction doesn’t keep helping generation as you add more compute, bigger models, or more data—it hits a wall.
How it works:
1. The model gets excellent at pixel-accurate copying.
2. It neglects high-level concepts like objects and scenes.
3. Generators need those concepts, so quality stalls or worsens.
Why it matters: Pouring more FLOPs into the wrong objective wastes resources and caps generation quality.

🍞 Anchor: Baking a bigger cake with a recipe that forgets sugar won’t make it tastier—just larger and still bland.

The world before this paper looked like this: Latent Diffusion Models used autoencoders (often VAEs) as visual tokenizers trained mostly to reconstruct images. People assumed that better reconstructions would mean better generation. But a paradox appeared in practice: when researchers scaled up compute for reconstruction training, pixel scores improved, while generation quality (FID) often plateaued or even got worse. Some tried to patch this by distilling features from big vision models or by using those fixed features directly. Distillation helped a bit but hit a low ceiling; fixed features led to color shifts and texture artifacts. The gap: nobody had shown a tokenizer pre-training recipe that reliably scales generation quality simply by investing more in the tokenizer stage—without changing the downstream generator’s training. The stakes are real: better tokenizers mean faster training, lower costs, and more accurate images for everyday uses like design tools, educational content, or assistive technology. This paper reframes the goal: build a latent space that captures meaning first, details second. Then scaling up finally works the way we hoped.

02Core Idea

🍞 Top Bread (Hook): Imagine teaching a friend to draw from memory. If they only memorize pixels, their drawings look sharp but often wrong. If they understand the scene (dog, park, frisbee), they draw the right thing.

🥬 Filling (The Actual Concept):

What it is: The key insight is that understanding drives generation—so the tokenizer must be trained to encode high-level semantics alongside details.
How it works:
1. Train the tokenizer with three tasks at once: contrastive image–text alignment (global meaning), self-supervised learning (spatial-semantic perception), and reconstruction (fine details).
2. Use a ViT-based autoencoder to flexibly learn these representations.
3. Keep diffusion model training unchanged; only upgrade the tokenizer.
Why it matters: Now, when you scale compute, parameters, or data in pre-training, generation quality keeps getting better instead of stalling.

🍞 Bottom Bread (Anchor): With the new tokenizer, asking for “a red fox in snow” leads to images that actually look like a fox in winter—not just red-white pixel patterns.

Three analogies for the same idea:

City map: A good map shows highways (semantics) and street names (details). A map of only bricks isn’t useful for navigation.
Study guide: A useful guide teaches the big ideas and gives examples; just memorizing spelling won’t ace the test.
Orchestra: Melody (semantics) plus harmony and rhythm (details) together make music; loudness alone doesn’t.

Before vs After:

Before: Tokenizers fixated on pixel-perfect copying; scaling compute improved reconstruction but didn’t grow generation quality.
After: Tokenizers trained for meaning + details scale smoothly; more compute and data now translate into better, faster generation.

Why it works (intuition, no equations):

Generators compose concepts. If the latent space clusters by meaning (objects, relations, scenes), the generator takes shorter, more reliable steps to reach a target image. Pixel-only latents are bumpy roads: tiny shifts cause big semantic mistakes. Hybrid losses flatten the road by aligning images with texts (global meaning), teaching part-whole structure (SSL), and preserving edges, colors, and textures (reconstruction). The result is a well-shaped space where sampling is both accurate and efficient.

Building blocks explained with sandwich explanations:

🍞 Top Bread (Hook): Matching words to pictures is like a classroom game—pair the correct caption with the photo. 🥬 The Concept (Image-Text Contrastive Learning):

What it is: A way to push images and their correct captions close together in feature space, and push mismatches apart.
How it works:
1. Encode images and texts.
2. Compute similarities for all pairs in the batch.
3. Reward correct matches; penalize wrong ones.
Why it matters: It teaches global semantics—what the image is about—which is crucial for text-guided generation. 🍞 Anchor: The caption “golden retriever catching a frisbee” gets pulled close to that exact image, not a cat on a sofa.

🍞 Top Bread (Hook): Learning to solve a jigsaw by hiding pieces forces you to think about the whole picture, not just edges. 🥬 The Concept (Self-Supervised Learning):

What it is: Learning from the image itself without labels, using masked image modeling and self-distillation.
How it works:
1. Mask some patches; predict them from visible ones (MIM).
2. Learn consistent features across different crops/views (self-distillation).
Why it matters: It builds spatial-semantic perception—how parts relate to wholes—useful for composing scenes. 🍞 Anchor: From a close-up of fur and grass, the model infers “dog on lawn,” not just “brown-green pixels.”

🍞 Top Bread (Hook): Doing math, science, and reading in the same school day makes you a stronger thinker than doing only handwriting practice. 🥬 The Concept (Multi-task Learning):

What it is: Training the tokenizer with contrastive, self-supervised, and reconstruction tasks together.
How it works:
1. Share the same encoder.
2. Apply different heads/losses for each task.
3. Balance them so none dominates.
Why it matters: The latent space becomes both meaningful and detailed, which is ideal for generation. 🍞 Anchor: The model learns that “zebra” means stripes (detail) plus animal (semantics), not just black-and-white pixels.

🍞 Top Bread (Hook): Transformers are like attentive readers: they look at the whole page and focus where needed. 🥬 The Concept (Vision Transformer):

What it is: An image model that treats patches like tokens and lets them attend to each other to learn patterns.
How it works:
1. Split image into patches.
2. Use attention layers to connect all patches.
3. Produce a rich feature for the whole image and parts.
Why it matters: ViTs flexibly host multiple objectives and learn global context well—perfect for a multi-skill tokenizer. 🍞 Anchor: To understand “a cat under a table,” attention links the cat patch and table patch into one coherent idea.

🍞 Top Bread (Hook): Understanding a story’s plot lets you retell it, not just recite random sentences. 🥬 The Concept (Semantic Understanding):

What it is: Grasping the meaning of objects, actions, and relationships in images.
How it works:
1. Align images with text concepts (contrastive).
2. Learn part-whole structure (SSL).
3. Preserve details (reconstruction) so meaning is visible.
Why it matters: Generation needs meaning to place the right things in the right places. 🍞 Anchor: When asked for “two red apples on a white plate,” semantic understanding avoids generating cherries on a napkin.

03Methodology

At a high level: Input (image, caption) → ViT-based autoencoder encoder → three training heads/losses (contrastive, SSL, reconstruction) → updated tokenizer → later used by a fixed DiT for generation.

Step-by-step recipe:

Architecture (ViT Autoencoder Core)

What happens: The image is split into patches and fed to a Vision Transformer encoder, which produces a compact latent (e.g., 64 channels at 16× downsampling). A lightweight ViT-based pixel decoder reconstructs the image from this latent. The encoder’s features are also used by a text encoder (for contrastive learning), an EMA teacher (for SSL), and the decoder (for reconstruction).
Why this step exists: ViTs handle global context and are flexible enough to share one backbone across multiple tasks. Without this, we’d struggle to align global meaning with local details in one latent space.
Example: A 256×256 image becomes 16×16 patches. The encoder outputs a 16×16×64 latent grid that the decoder can lift back to 256×256 pixels.

Reconstruction Objective (Two-stage)

What happens: During pre-training, minimize pixel loss plus perceptual loss so reconstructions look right to both the eye and a feature-based judge. After pre-training, freeze the tokenizer and fine-tune the pixel decoder with a GAN loss for crisper textures.
Why this step exists: Reconstruction keeps color, edges, and textures faithful. The two-stage design avoids unstable GAN training with ViTs while still delivering high-fidelity images later.
Example: The model learns to rebuild the orange fur of a fox and the blue tint of snowy shadows; then GAN fine-tuning sharpens whiskers.

Self-Supervised Learning (MIM + Self-Distillation)

What happens: Create global and local crops of the image. Use an EMA teacher on the full global view; the student sees masked or cropped views. Losses encourage the student to predict masked patches (MIM) and match the teacher’s outputs across views (self-distillation).
Why this step exists: It teaches the model how parts fit into wholes and how objects persist across scales and views. Without SSL, latents miss structure, and generation struggles to place or shape objects correctly.
Example: Mask the zebra’s torso; the model infers stripes continue behind the mask and aligns features for both close-up stripes and the full animal.

Image–Text Contrastive Learning (CLIP-style)

What happens: In a big batch (e.g., 16k), encode all images and captions. Pull true pairs together and push mismatches apart with a contrastive loss.
Why this step exists: It injects global semantics—what the picture is “about.” Without it, text-to-image generation can produce off-topic outputs even if details are sharp.
Example: The caption “a blue vintage car parked by the shore” moves right next to the matching image in feature space.

Balancing the Losses (Multi-task Training)

What happens: Combine the three losses with weights: L_total = λ_rec L_rec + λ_ssl L_ssl + λ_clip L_clip. In practice, they reduce λ_rec (e.g., 0.1) so semantics don’t get drowned out by pixels, and toggle λ_ssl/λ_clip to study effects.
Why this step exists: The weights are the volume knobs. If reconstruction is too loud, semantics vanish; if too quiet, textures suffer. The paper shows a sweet spot where all improve together.
Example: With λ_rec=0.1, λ_ssl=1, λ_clip=1, the model achieves strong semantics and still reconstructs colors and textures well.

Batch Sampling Strategy

What happens: Different tasks like different batch sizes. Contrastive likes very large batches (e.g., 16k) to see many negatives; SSL and reconstruction work fine with smaller batches (e.g., 4k and 2k). From a pool of image–caption pairs, use all for CLIP, and subsample for SSL and reconstruction.
Why this step exists: It respects each task’s data appetite. Without it, contrastive learning underperforms, and the latent space won’t align cleanly with text.
Example: From a 16k batch, randomly pick 4k items for SSL and 2k for reconstruction in the same step.

Training Data and Specs

What happens: Use a 277M-sample filtered subset of DataComp-1B for tokenizer pre-training. For understanding metrics, probe features from the latent bottleneck directly (no multi-layer tricks). For generation, keep DiT training fixed (e.g., LightningDiT-B on ImageNet-256 for 80 epochs) and only swap the tokenizer.
Why this step exists: It proves improvements come from the tokenizer, not from changing the generator or training schedule.
Example: Two runs with identical DiT configs: baseline AE vs VTP tokenizer. VTP yields lower FID with the same DiT.

Secret Sauce

The clever part is shaping the latent space with meaning first and pixels second—by jointly solving three complementary tasks and carefully balancing their strengths. Contrastive gives global topic, SSL gives structure, reconstruction gives polish. Together they form a latent space that both understands and renders.

Sandwich recap of any new terms introduced:

🍞 Top Bread (Hook): Think of packing different snacks into the right-sized lunch boxes. 🥬 The Concept (Batch Sampling by Task):

What it is: Giving each training task the batch size it needs to learn best in the same training step.
How it works:
1. Start with a big batch of image–caption pairs.
2. Use the whole batch for contrastive (needs many negatives).
3. Subsample smaller sets for SSL and reconstruction.
Why it matters: Each task learns efficiently without starving or overstuffing. 🍞 Anchor: The contrastive learner gets a full buffet; SSL and reconstruction get just enough to digest well.

04Experiments & Results

The Test: They measure three things that map to how humans judge images: understanding (does the latent space know what’s in the image?), reconstruction (can it faithfully rebuild pixels?), and generation (does a fixed DiT make better images using this tokenizer?). Metrics include zero-shot accuracy and linear probe on ImageNet (understanding), rFID for reconstruction, and FID for generation (gFID in figures).

The Competition: Baselines include classic reconstruction-only autoencoders (like SD-VAE), distillation-based tokenizers (VA-VAE), and methods that use fixed encoders (RAE). They also compare to strong understanding models like CLIP, SigLIP, MAE, and DINOv2 for context.

Scoreboard with context:

Paradox confirmed: Scaling reconstruction-only training improved rFID from about 2.0 to 0.5 (better pixel copying) but made generation worse (gFID rose from ~55.0 to ~58.6). That’s like getting neater handwriting while your essay becomes less understandable.
Hybrid helps: Adding either CLIP+AE or SSL+AE turns scaling into a win. As compute grows, understanding improves and generation improves together, while reconstruction stays healthy. That’s like practicing both storytelling and grammar, so your essay gets clearer and still looks neat.
Best of both worlds: Using CLIP+SSL+AE (VTP) beats single hybrids. With the same compute, VTP hits a higher generative upper bound (e.g., gFID≈27.8) and stronger linear probe accuracy (≈74.9%) than CLIP+AE or SSL+AE alone.
Parameter scaling: For reconstruction-only AEs, making the encoder larger barely helps generation (stuck around FID≈57). VTP improves steadily as the encoder grows (e.g., gFID from ≈31.3 to ≈26.1) and also benefits from a stronger decoder (down to ≈24.1), showing a clean scaling curve.
Data scaling: With more pre-training data for the tokenizer (100K → 100M+), the AE barely moves (≈58.4→≈56.7), while VTP plunges impressively (≈47.6→≈27.5). That’s like studying more books only helps if your study method focuses on meaning.
Headline results: The top VTP model reaches 78.2% zero-shot accuracy and 0.36 rFID on ImageNet, and enables 4.1× faster convergence in downstream generation than strong distillation approaches. Keeping DiT training unchanged, simply spending more FLOPs on VTP pre-training yields a 65.8% FID improvement in generation versus the baseline AE, which stagnates at only 1/10th the compute.

Surprising findings:

More pixels can be less meaning: Reconstruction-only training gets stunningly good at copying details yet actively harms generation quality when scaled.
Understanding is predictive: A stronger linear probe (better semantics) goes hand-in-hand with better FID; the paper plots show a tight positive correlation.
Multi-objective synergy: Contrastive and SSL, though different, both lift generation similarly by injecting semantics. Together, they’re even better, suggesting room to plug in future representation tasks.

Concrete examples observed:

Reconstruction artifacts: Tokenizers built from fixed encoders can miscolor or blur textures; VTP’s jointly learned latent avoids many of these pitfalls.
Speed: Compared to LDM and VA-VAE, VTP’s generative training converges much faster, suggesting a smoother, better-structured latent landscape for diffusion to navigate.

Bottom line: The experiments turn a previously fuzzy belief into clear evidence—semantic understanding in the latent space is the engine that makes generation scale with compute, model size, and data.

05Discussion & Limitations

Limitations:

Compute appetite: VTP benefits from large-scale pre-training (big batches for contrastive, large datasets). Not every lab can afford 16k-batch CLIP training or hundreds of billions of FLOPs.
Data quality: The tokenizer learns from what it sees. If captions are noisy or images are biased, the latent space may inherit those issues, affecting downstream generation fairness and accuracy.
Objective balancing: The sweet spot for λ_rec, λ_ssl, and λ_clip matters. Wrong weights can overfit to pixels or over-prioritize semantics, reducing either fidelity or accuracy.
Domain shifts: Trained mostly on internet-scale photos, VTP may need adaptation for medical images, satellite data, or diagrams where textures and semantics differ.

Required resources:

Large-scale image–text data (e.g., hundreds of millions of pairs) and strong data filtering.
Hardware capable of very large batches for contrastive learning and efficient ViT training (e.g., multi-node accelerators).
Stable training code for SSL (EMA teachers, multi-crop) and careful normalization (e.g., QK-norm) to avoid instability.

When not to use:

Tiny data regimes where contrastive learning can’t form reliable negatives.
Strictly pixel-critical tasks (medical diagnosis visuals) where any semantic bias might risk hallucinations; a conservative reconstruction-first tokenizer may be safer.
Situations where you must use a frozen external encoder for compatibility; VTP’s benefit comes from joint training.

Open questions:

Which additional perception tasks (e.g., depth, segmentation, optical flow) would further enrich the latent space without harming stability?
Can we design adaptive loss weighting that auto-tunes λs as training progresses to reach the sweet spot reliably?
How does data curation (attribute-rich subsets like text rendering, diagrams, UI screenshots) specialize tokenizers for target applications?
What are the limits of scaling—do we see new plateaus at even larger data/model sizes, and how do we break them?
Can similar multi-objective principles make video tokenizers scale as well as image tokenizers?

06Conclusion & Future Work

Three-sentence summary: This paper shows that visual tokenizers trained only to copy pixels stop helping generation when scaled, causing a pre-training scaling problem. Their solution, VTP, jointly trains contrastive, self-supervised, and reconstruction objectives in a ViT autoencoder so the latent space captures meaning and details together. As a result, downstream generation with a fixed DiT improves dramatically with more compute, parameters, and data, achieving 78.2% zero-shot, 0.36 rFID, faster convergence, and up to 65.8% better FID solely from a stronger tokenizer.

Main achievement: They turn tokenizer pre-training into a reliably scalable lever for generation by proving that semantic understanding in the latent space is the key driver—and by delivering a practical recipe (CLIP + SSL + AE) that consistently scales.

Future directions: Explore new perception tasks (depth, segmentation, layout, text rendering) to further shape the latent space; develop adaptive loss balancing; curate domain-specific data to unlock specialized generators; extend the approach to video tokenizers and multimodal generation.

Why remember this: It rewires our intuition about where to invest compute—train the tokenizer to understand, not just to copy—and shows that once the latent space carries meaning, everything downstream becomes faster, better, and more scalable.

Practical Applications

•Upgrade an existing diffusion pipeline by swapping in a VTP-pretrained tokenizer to improve quality without changing DiT settings.
•Pre-train a tokenizer on domain-specific data (e.g., product photos) so downstream generation better follows brand styles and object semantics.
•Use CLIP+SSL+AE training to build a tokenizer that accelerates convergence, reducing cloud costs and time-to-deploy.
•Scale tokenizer pre-training (more FLOPs, larger ViT) to push generation FID lower while keeping your generator fixed.
•Balance loss weights (lower λ_rec, include λ_ssl and λ_clip) to tune the trade-off between semantic accuracy and texture fidelity for your use case.
•Leverage large-batch contrastive training (e.g., 16k) for stronger text–image alignment in text-to-image systems.
•Probe the tokenizer’s bottleneck features to monitor understanding quality (linear probe, zero-shot) as an early indicator of generative performance.
•Curate pre-training data emphasizing needed attributes (e.g., text rendering, diagrams) to unlock specialized generation skills.
•Adopt the two-stage reconstruction (perceptual first, GAN fine-tune) to get both stable training and crisp final textures.

Version: 1