NativeTok: Native Visual Tokenization for Improved Image Generation

Bin Wu; Mengqi Huang; Weinan Jia; Zhendong Mao

NativeTok: Native Visual Tokenization for Improved Image Generation

Intermediate

Bin Wu, Mengqi Huang, Weinan Jia et al.1/30/2026

arXiv PDF

Key Summary

•This paper fixes a hidden mismatch in image generation: tokenizers make tokens without order, but generators need an order to predict the next token well.
•NativeTok teaches the tokenizer to create tokens in a native, causal visual order (big shapes first, fine details later).
•It splits the job into two parts: a Meta Image Transformer (MIT) learns the whole image context, and a Mixture of Causal Expert Transformer (MoCET) generates tokens one-by-one using that context.
•Each MoCET expert specializes in one token position, so dependencies are clear and easy for the next-stage generator to learn.
•A Hierarchical Native Training (HNT) strategy grows models efficiently by freezing old experts and training only new ones, then briefly fine-tuning everything.
•On ImageNet-1K, NativeTok improves generation quality (lower gFID) versus strong baselines, even when its reconstruction score (rFID) is slightly worse.
•In autoregressive setups, NativeTok cuts gFID from 7.45 (TiTok-L-32) to 5.23 using the same generator (LlamaGen-B).
•In MaskGIT-style generation, NativeTok reaches a gFID of 2.16 with only 8 sampling steps, showing strong quality–efficiency trade-offs.
•Simple causal masks alone didn’t fix the mismatch; NativeTok’s divide-and-conquer design did.
•NativeTok makes second-stage models learn more precise, position-aware token relationships, boosting coherence in generated images.

Why This Research Matters

When the tokenizer outputs tokens in the same order the generator needs, training becomes easier and images look more coherent. This reduces weird artifacts like mismatched eyes or warped edges because dependencies are clearer and more learnable. It can shorten sampling steps (faster results) without sacrificing quality, which matters for interactive tools. Developers can get better results even with the same generator by swapping in a smarter tokenizer. The idea also hints at how to improve tokenization in other domains—like video or 3D—by respecting native causal orders. In short, NativeTok turns the tokenizer into a guide that teaches the generator how to build images step by step.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: You know how you build a LEGO set by following the instructions step by step? If the pieces came in a random order with no instructions, you’d waste time and make mistakes.

🥬 The Concept (Image Tokenization):

What it is: Image tokenization breaks a picture into a small set of pieces (tokens) so a model can handle them like words in a sentence.
How it works:
1. Compress the image into a simpler, smaller form (latent space).
2. Turn that form into discrete tokens from a codebook.
3. Learn to decode those tokens back into the original image.
Why it matters: Without tokenization, generation is slow and heavy because the model must juggle millions of pixels. 🍞 Anchor: Like slicing a pizza so it’s easier to share, tokenization slices the image into manageable bites the model can predict.

🍞 Hook: Imagine writing a story one sentence at a time. Each sentence depends on what you already wrote.

🥬 The Concept (Autoregressive Generation):

What it is: A model generates the next token based on all previous tokens, step by step.
How it works:
1. Start with a start token or prompt.
2. Predict token 1, then token 2 using token 1, and so on.
3. Stop when you have the full token sequence, then decode to an image.
Why it matters: This orderly chain is how models keep scenes consistent—like putting the wheels on a car after the frame. 🍞 Anchor: When making a sandwich, you add bread, then lettuce, then tomato. If you reverse the order, it’s a mess—just like unordered tokens.

The World Before: Visual tokenizers (like VQGAN, TiTok, and others) focused on two main things—compress better and reconstruct sharper images. Generators (diffusion or autoregressive) then learned to create images by predicting these tokens. This two-stage setup brought fast progress but treated the stages as separate. Tokenizers aimed for pixel-perfect reconstructions; generators tried to learn how tokens relate in sequence.

The Problem: The first stage didn’t care about token order. Tokens came out as a bag-of-pieces rather than a build-order. But the second stage absolutely needs an order to predict the “next” token. This mismatch made learning harder and caused bias, weaker coherence, or extra training struggles.

🍞 Hook: Think of a jigsaw puzzle. If someone hands you pieces sorted by edges-first, then colors and textures, you finish faster and make fewer errors.

🥬 The Concept (Causal Order in Images):

What it is: A natural sequence that goes from global structure to local details—like seeing the big shapes before the tiny textures.
How it works:
1. Capture the whole scene’s context.
2. Generate early tokens for big shapes and layout.
3. Generate later tokens for textures and fine details.
Why it matters: If tokens follow this native visual order, the second stage learns real, meaningful dependencies. 🍞 Anchor: When drawing, you sketch outlines first, then shade. Token order should work the same way.

Failed Attempts: A simple fix might be “just add a causal mask” in the tokenizer. But that forces one transformer to do two hard jobs at once: understand the whole image and produce ordered tokens. In practice, this barely helped.

The Gap: We needed a tokenizer that bakes in causal, position-aware dependencies (the order) while still seeing the full image context—so the second stage can learn easily and well.

Real Stakes: Better ordering means more coherent generations: faces with matching eyes, cars with correct wheels, and scenes that don’t “melt” at the edges. For apps like design tools, photo editing, and content creation, that’s the difference between “almost right” and “wow.”

02Core Idea

🍞 Hook: Imagine packing a suitcase. If you put shoes at the bottom first and fragile items on top last, everything fits and nothing breaks.

🥬 The Concept (Native Visual Tokenization):

What it is: A tokenizer that produces tokens in the image’s native, causal order (big structure → fine detail) so the generator’s job matches the tokenizer’s output.
How it works:
1. First learn the whole image context well.
2. Then generate tokens one-by-one, each depending only on previous tokens and the fixed image context.
3. Quantize those ordered tokens and decode the image.
Why it matters: Without native order, the generator learns from a scrambled instruction set. With it, learning becomes simpler and results get better. 🍞 Anchor: Like building LEGO from the manual—start with the frame, then add panels, then stickers.

The “Aha!” Moment in one sentence: Make the tokenizer respect causal, position-specific dependencies during tokenization so the generator learns exactly the same order it must use at inference.

Three Analogies:

Recipe: Prep basics (boil pasta), then sauce, then garnish—don’t reverse it.
Drawing: Outline first, color blocks second, highlights last.
Sports play: Set the formation (structure) before running the play (details).

Before vs After:

Before: Tokenizers maximized reconstruction but ignored order; generators struggled to learn meaningful chains from unordered tokens.
After: Tokens arrive pre-ordered; the generator’s next-token task aligns with how tokens were made, improving coherence and sample quality.

Why It Works (intuition):

Separating concerns is powerful. First, a Meta Image Transformer (MIT) learns a clean, global picture of the scene (context). Then MoCET, a line of position-specialist experts, generates token i using: (1) the fixed global context and (2) all previous tokens. This makes dependencies explicit and stable. The generator later sees the same structure, so it learns faster and generalizes better.

Building Blocks:

🍞 Hook: You know how a tour guide gives you a bird’s-eye overview before you explore the city street by street?

🥬 The Concept (Meta Image Transformer, MIT):

What it is: A transformer that reads the whole image to build a rich, global context, then compresses it into a compact latent.
How it works:
1. Use bidirectional attention to understand the full image.
2. Pass through a dimension switcher (small MLP) to shrink features.
3. Freeze this latent during token generation so every token sees the same context.
Why it matters: If context keeps shifting, token dependencies get noisy. A stable context is a steady map. 🍞 Anchor: Like setting a classroom’s lesson plan first, so each student activity follows the same goals.

🍞 Hook: Picture a relay race where each runner is trained for a specific leg of the track.

🥬 The Concept (Mixture of Causal Expert Transformer, MoCET):

What it is: A lineup of lightweight expert transformers where expert i only generates token i, using prior tokens and the fixed context.
How it works:
1. Prepare a sequence of experts (one per token position).
2. For step i, feed the frozen MIT latent, tokens 1..i-1, and a mask token into expert i.
3. Expert i outputs token i; then it’s locked in, and we move to i+1.
Why it matters: Position specialization makes dependencies crisper than a one-size-fits-all transformer. 🍞 Anchor: Like different chefs each perfecting one course: appetizer chef, entrée chef, dessert chef.

🍞 Hook: Imagine growing a team: you hire new members and train only them while the veterans keep doing their jobs.

🥬 The Concept (Hierarchical Native Training, HNT):

What it is: A training strategy that freezes earlier experts and only trains newly added experts (then briefly fine-tunes all).
How it works:
1. Train a short-token model fully (e.g., 32 tokens).
2. Add more experts for longer sequences (e.g., +32), copy weights, freeze the old stack, and train only the new experts.
3. At the end, fine-tune everything briefly.
Why it matters: This speeds up training and stabilizes learning as token count grows. 🍞 Anchor: Like unlocking higher game levels: you keep earlier skills and learn only what the new level needs, then polish all skills together.

Together, these parts align tokenization with generation: the tokenizer speaks in the same ordered “language” the generator understands.

03Methodology

High-level recipe: Input image → MIT (global context) → MoCET (ordered tokens) → Quantize tokens → Decoder (reconstruct image)

Step 1: Meta Image Transformer (MIT)

What happens: The image is split into patches and passed through several transformer layers to learn a rich representation. An MLP compresses this to a compact latent (X_latent) that captures global structure.
Why this step exists: If we try to learn token order and full-image understanding in one place, the model gets overloaded. MIT gives a solid, shared map first.
Example: For a 256×256 image of a dog on grass, MIT encodes the dog’s location, major shapes, and background layout into a latent that all future steps can read consistently.

Step 2: Lock the latent

What happens: Freeze X_latent for the whole token sequence generation.
Why it matters: If context changed mid-generation, early and late tokens would disagree, hurting coherence.
Example: The model won’t suddenly “forget” where the dog is when generating later fur-texture tokens.

Step 3: Mixture of Causal Expert Transformer (MoCET)

What happens: Use a chain of lightweight experts T1…TL. Expert i takes (X_latent, tokens 1…i−1, mask) and outputs token i; then token i is fixed.
Why this step exists: It enforces causal, position-specific dependencies without forcing one giant transformer to do everything.
Example with data: Suppose token 1 captures overall layout (sky/ground split), token 2 refines the dog’s pose, token 3 starts coarse fur regions, and so on, until token L adds whiskers or grass blades.

Step 4: Quantize and decode

What happens: The ordered continuous tokens are mapped to discrete codebook entries (vector quantization), then a transformer-based decoder reconstructs the image.
Why it matters: Discrete tokens are easier for second-stage generative models to predict; decoding tests if tokens preserved enough detail.
Example: Turning the ordered token sequence into a high-fidelity dog image at 256×256.

Secret Sauce (what’s clever):

Divide-and-conquer: MIT handles global understanding; MoCET handles causal ordering.
Position-specialized experts: Each expert learns the exact role of its token position, sharpening dependencies.
Frozen context: All tokens read the same stable scene map.
Hierarchical Native Training (HNT): Efficiently scales token length by reusing and freezing earlier experts.

Training details (as a recipe):

Start with a 32-token NativeTok model; train all parameters.
For 64 tokens, duplicate the first 32 experts into the new slots, freeze the old ones and MIT, train only the new experts and the decoder (about 56% trainable), then briefly fine-tune all.
For 128 tokens, repeat the “add experts → train new only → brief full fine-tune” loop.

Why this works: New positions learn new roles fast, while preserved experts keep earlier skills stable, speeding training and improving final reconstruction.

What breaks without each step:

Without MIT: MoCET must learn context and order at once, often failing to produce clean causal tokens.
Without frozen latent: Later tokens drift, reducing coherence.
Without position experts: A single encoder struggles to perfectly model all positions’ unique roles.
Without HNT: Training becomes slower and less stable as tokens grow.

Integration with generators:

Autoregressive (e.g., LlamaGen): The generator now learns the same causal order forged during tokenization.
MaskGIT-style: The ordered tokens still help because their dependencies are clearer, accelerating convergence and sampling.

Complexity and speed:

MoCET runs in a low-dimensional latent, so even with more expert steps, encoding speed remains reasonable compared to baselines. The observed drop is modest given the O(n^2) attention cost and is outweighed by better generation quality.

04Experiments & Results

🍞 Hook: Imagine grading drawings not just by how closely they match the original (copying skill) but also by how nicely you can create a new scene (creative skill).

🥬 The Concept (FID, rFID, gFID):

What it is: FID measures how close generated images are to real ones. rFID checks reconstructions; gFID checks brand-new generations.
How it works:
1. Extract features from real and generated images using a reference network.
2. Compare the two feature distributions.
3. Lower FID means images look more like real ones.
Why it matters: It turns “looks good” into a number we can compare. 🍞 Anchor: Scoring 2.0 vs 5.0 is like moving from an A to a B- in visual realism.

Setup:

Dataset: ImageNet-1K at 256×256 resolution.
Generators: LlamaGen (autoregressive) and MaskGIT-UVit-L (masked).
Baselines: VQGAN and TiTok, plus broader comparisons to recent tokenizers.

Autoregressive (AR) results with LlamaGen-B:

TiTok-L-32: gFID 7.45.
VQGAN: gFID 5.46.
NativeTok (32 tokens): gFID 5.23 (better than VQGAN by 0.23 and much better than TiTok). Notably, NativeTok’s rFID (2.57) is slightly worse than TiTok’s (2.21), yet its generation is better—evidence that ordered dependencies help AR generation more than raw reconstruction alone.
Takeaway: When the generator must predict next tokens, NativeTok’s ordered tokens make learning cleaner and outputs more coherent.

MaskGIT-style results with MaskGIT-UVit-L:

NativeTok achieves gFID 2.16 with only 8 sampling steps, showing strong quality–efficiency trade-offs.
Competing numbers: TiTok-S-128 can reach 1.97 but at 64 steps (2.50 at 8 steps). FlexTok reports 2.02; VAR reaches 1.92. NativeTok’s 2.16 at 8 steps is competitive while being efficient.
Takeaway: Ordered tokens help even outside strict AR settings.

Ablations (what mattered):

Causal mask vs NativeTok design: Simply adding causal masks to a standard tokenizer barely helped (rFID ~12.95 vs 12.99). NativeTok’s structure dropped rFID to 11.19 under matched steps, proving divide-and-conquer beats naive masking.
Hierarchical Native Training (HNT): Reusing weights and training only new experts sped training (1.53s → 1.15s per batch) and improved rFID (6.50 → 6.46), then fine-tuning trimmed it further (to 6.22).
Encoding speed: NativeTok is a bit slower than TiTok-L-32 (119.85 vs 136.32 samples/s) but stays practical.

Sensitivity visualization:

When earlier tokens are perturbed, NativeTok shows a larger change in the next-token distribution than TiTok. This means dependencies are sharper and more meaningful—exactly what generators need.

Bottom line with context:

AR: Dropping gFID from 7.45 to 5.23 is like going from a shaky B- to a strong A- in image realism and coherence.
MaskGIT: Hitting 2.16 with 8 steps shows NativeTok’s ordered tokens help models get great images faster, a big deal for real-world latency.

05Discussion & Limitations

Limitations:

Still two-stage: Tokenizer and generator are trained separately. While NativeTok narrows the gap by enforcing order, a fully end-to-end joint training could improve even more.
Compute: Position-specialized experts and multiple scales (32→64→128 tokens) add parameters and steps, though HNT reduces the hit.
Encoding speed: Slightly slower than some baselines; acceptable but not the fastest.
Scale constraints: Some AR tests used the smallest variant; larger NativeTok versions might unlock even better generation, but need more compute.

Required resources:

Multi-GPU training (e.g., 4× A800), batch sizes around 256 globally, AdamW, cosine LR schedule, and 500K steps for strong results. For longer token sequences, HNT is recommended to keep training stable and affordable.

When NOT to use:

Ultra-low-latency encoding at tiny compute budgets where every millisecond counts and small quality gains don’t justify complexity.
Extremely small datasets where the benefits of ordered dependencies might not outweigh the overhead.
Non-sequential generators that cannot benefit from or ignore token order entirely.

Open questions:

Can we learn the optimal native order automatically per image or class?
Can we unify tokenization and generation into one end-to-end learner without losing stability?
What’s the best number of experts and token length for a given dataset and budget?
Can we extend native order ideas to video (space-time ordering) or 3D?
How do we align text–image grounding (in text-to-image) with native visual order for even better controllability?

06Conclusion & Future Work

Three-sentence summary: NativeTok makes the tokenizer produce tokens in a natural, causal visual order so the generator learns exactly the kind of dependencies it needs. It separates global context learning (MIT) from position-wise token generation (MoCET), and scales efficiently with Hierarchical Native Training. The result is stronger, more coherent image generation across AR and MaskGIT settings, often beating baselines even when raw reconstruction is similar or slightly worse.

Main achievement: Baking causal, position-specific dependencies into the tokens themselves—turning the tokenizer from a “good reconstructor” into a “good teacher” for the generator.

Future directions:

Fully end-to-end training that jointly optimizes tokenization and generation.
Automatic discovery of the best native orders per class or instance.
Extensions to video and 3D with spatiotemporal native ordering.
Better efficiency via shared or dynamic experts and lightweight decoders.

Why remember this: NativeTok shows that getting the order right at the start makes everything downstream easier—like putting the right instructions in the box. It’s a simple but powerful idea that could shape how we design tokenizers for any modality that needs sequential generation.

Practical Applications

•Improve text-to-image systems by feeding generators tokens ordered from structure to detail for crisper, more consistent outputs.
•Speed up interactive image editing tools (inpainting, outpainting) by giving the generator clearer dependencies for local changes.
•Enhance style transfer and photo-to-art pipelines with tokens that preserve layout first, then fine textures.
•Boost product and scene design mockups where global layout accuracy (e.g., furniture placement) is critical before textures.
•Support low-step sampling in masked generation workflows for faster previews during creative iterations.
•Stabilize training of autoregressive image models in limited compute settings by simplifying next-token learning.
•Provide cleaner supervision for multi-modal models (vision–language) by aligning visual token order with descriptive text order.
•Prepare a foundation for video tokenization where frames or regions follow a native spatiotemporal order.
•Assist dataset distillation or compression by producing structured, informative token sequences.
•Improve downstream tasks (e.g., segmentation prompts) by offering tokens that reflect scene structure early.

Version: 1