Scaling Text-to-Image Diffusion Transformers with Representation Autoencoders

Shengbang Tong; Boyang Zheng; Ziteng Wang; Bingda Tang; Nanye Ma; Ellis Brown; Jihan Yang; Rob Fergus; Yann LeCun; Saining Xie

Scaling Text-to-Image Diffusion Transformers with Representation Autoencoders

Intermediate

Shengbang Tong, Boyang Zheng, Ziteng Wang et al.1/22/2026

arXiv PDF

Key Summary

•Before this work, most text-to-image models used VAEs (small, squished image codes) and struggled with slow training and overfitting on high-quality fine-tuning sets.
•This paper shows that training diffusion models in a shared, high-dimensional representation space (from a frozen encoder like SigLIP-2) with a simple decoder (RAE) works better and faster.
•Scaling the RAE decoder beyond ImageNet improves general image quality, but adding targeted text-rendering data is essential for sharp, readable text.
•A single, crucial trick—dimension-aware noise scheduling—makes high-dimensional diffusion stable; other fancy add-ons (wide denoising heads, noise-augmented decoding) matter little at large scale.
•Across model sizes (0.5B–9.8B DiTs) and LLMs (1.5B–7B), RAE-based models converge 4.0–4.6× faster and score higher on GenEval and DPG-Bench than VAE-based models.
•During fine-tuning, VAE models overfit badly after about 64 epochs, while RAE models remain stable and strong up to 256+ epochs.
•Combining synthetic and web data beats simply doubling either one alone; data composition matters more than raw data volume.
•Because understanding and generation share the same latent space, the LLM can judge images directly in latent space at test time, improving quality without decoding to pixels.
•Understanding accuracy stays intact when adding generation; RAE and VAE choices barely affect perception benchmarks because they share the same frozen encoder.
•Overall, RAEs are a simpler, stronger foundation than VAEs for large-scale text-to-image diffusion and for unified multimodal models.

Why This Research Matters

Better text-to-image models help people create clearer posters, readable diagrams, and accurate illustrations faster and cheaper. By training in rich, shared features, models can both understand and generate in the same space, making them easier to test, guide, and improve. Faster convergence means lower energy costs and more accessible research and development. Stable fine-tuning lets artists and companies customize models without the results collapsing into memorized copies. Finally, latent-space judging (picking the best sample without decoding) opens new ways to build safer, more controllable systems that can check their own work.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: You know how a good library has both picture books and chapter books, and you choose differently depending on what you want to learn? Early AI image makers also chose different ways to store picture information, and those choices made a big difference.

🥬 The Concept: Diffusion Model

What it is: A diffusion model is a system that learns to turn random noise into a clear image step by step.
How it works:
1. Start with pure noise (like TV static).
2. Take tiny steps to make the noise look a little more like a real image.
3. Keep stepping until the full image appears.
4. Use a text prompt as a guide, like “draw a red fox on snow.”
Why it matters: Without the careful step-by-step plan, the model would jump around and make blurry or wrong pictures. 🍞 Anchor: It’s like using an eraser to slowly reveal a hidden drawing from a page covered in pencil scratches.

🥬 The Concept: Variational Autoencoder (VAE)

What it is: A VAE squeezes an image into a small code and learns to rebuild the picture from that tiny code.
How it works:
1. Encoder compresses the image into a low-dimensional code.
2. Decoder expands that code back into pixels.
3. Train both parts so the rebuilt picture looks like the original.
Why it matters: VAEs make diffusion training faster than working on raw pixels, but the squeezing can lose fine details (like tiny text). 🍞 Anchor: Imagine packing a suitcase so small that your fancy hat gets squished—you can travel light, but the hat loses its shape.

🥬 The Concept: Representation Learning

What it is: Representation learning teaches models to create rich, meaningful features (like “fur,” “round,” or “text-like shapes”) from images.
How it works:
1. Show many images (with or without captions).
2. Learn features that help tell things apart and connect images to words.
3. Freeze these features so other models can use them.
Why it matters: Good features help models understand scenes and objects, not just memorize pixels. 🍞 Anchor: It’s like learning the “language of pictures” so you can describe any photo using useful words in your head.

🥬 The Concept: Latent Space

What it is: Latent space is a special “idea space” where images become feature vectors instead of pixels.
How it works:
1. An encoder maps an image to feature tokens.
2. Models do their thinking in this space (e.g., generate or compare).
3. A decoder turns features back into pixels.
Why it matters: Working in latent space can be faster and more semantic than working on raw pixels. 🍞 Anchor: It’s like planning a LEGO build using instructions (ideas) first, then snapping bricks (pixels) together later.

The world before: Text-to-image generation took off by running diffusion in VAE latent spaces. This made training faster than pixel space, but the codes were tiny (often fewer than 100 channels). At the same time, vision-language models like CLIP and SigLIP learned powerful, high-dimensional features great for understanding, not generation. Many believed those features were too “abstract” or huge for making images.

The problem: Could we make a model that uses those rich, high-dimensional features for generation without falling apart? And could it scale to real, messy prompts from the web—not just tidy datasets like ImageNet?

Failed attempts: People tried supercharging VAEs (wider channels, better training) or quantizing features into codebooks. These helped a bit but still either lost detail (overcompression) or struggled with diversity and long fine-tuning (overfitting).

The gap: No clear recipe existed for stable, fast diffusion in high-dimensional representation spaces. Also, no one showed convincingly that such a recipe scales to big text-to-image transformers.

Real stakes: Faster convergence means less compute and cost. Better text fidelity means posters, memes, and signs look correct. More stable fine-tuning means creators can specialize models without them memorizing tiny datasets. And a shared space for understanding and generation means one model could check its own images without bouncing back and forth between different encoders and decoders.

02Core Idea

🍞 Hook: Imagine you’re building a city. You can either draw tiny, simple maps (easy to carry but missing details) or use big, detailed blueprints (harder at first, but everything fits and makes sense). Which one would help you build a better city faster?

Aha! moment in one sentence: Do diffusion directly in a rich, high-dimensional representation space (from a frozen encoder) and just learn a simple decoder—this is both simpler and stronger than using VAEs for text-to-image at scale.

🥬 The Concept: High-Dimensional Latents

What it is: High-dimensional latents are big, detailed feature tokens (many tokens, many channels) that capture meaning and structure.
How it works:
1. A strong encoder (e.g., SigLIP-2) turns an image into 256 tokens, each with 1152 channels.
2. These tokens store semantic clues like shapes, textures, and text glyphs.
3. A generator learns to model these tokens directly instead of tiny VAE codes.
Why it matters: More room in the code means less important detail gets lost—especially small text and fine edges. 🍞 Anchor: It’s like having a full orchestra’s score instead of a hummed melody—you can reconstruct the music much more faithfully.

🥬 The Concept: Representation Autoencoder (RAE)

What it is: RAE keeps the encoder frozen (use its great features) and trains only a lightweight decoder to turn those features back into pixels.
How it works:
1. Freeze the encoder (e.g., SigLIP-2) that already understands images well.
2. Train a decoder to reconstruct images from encoder tokens.
3. Run diffusion in that feature space, then decode to pixels at the end.
Why it matters: You get the stability and meaning of a mature encoder plus a simpler, faster training path for generation. 🍞 Anchor: It’s like using a world-class translator’s dictionary (frozen) and just learning to read it out loud clearly.

🥬 The Concept: Latent Diffusion (in representation space)

What it is: Latent diffusion means we do the “noise to image” dance on features, not on pixels.
How it works:
1. Add and remove noise to feature tokens during training.
2. Learn a Diffusion Transformer (DiT) that predicts how to denoise features.
3. At test time, sample clean features and decode to an image once.
Why it matters: It’s faster than pixel space and, in RAE, more faithful than low-dimensional VAEs. 🍞 Anchor: It’s like sculpting a clay model (features) and only painting it (pixels) at the very end.

🥬 The Concept: Noise Scheduling (dimension-aware)

What it is: A rule that adjusts how strong the noise should be, tuned to the size of the feature space.
How it works:
1. Measure the feature space “size” (tokens × channels).
2. Shift the time/strength curve so noise fits this size.
3. Train with the adjusted schedule; sample with it too.
Why it matters: Without this, high-dimensional diffusion gets confused and performs much worse. 🍞 Anchor: It’s like changing your bike’s gear depending on a hill’s steepness—wrong gear, and you stall.

Three analogies for the whole idea:

City blueprints: Use detailed blueprints (RAE features), not tiny sketches (VAE codes), so buildings (images) fit together faster and better.
Orchestra score: Compose in full notation (RAE latents) so the final performance (pixels) is faithful and rich.
Baking with a good starter: Begin with a robust sourdough starter (frozen encoder) and keep the recipe simple (decoder), so the loaf rises reliably.

Before vs. After:

Before (VAE): Small codes, slower learning, loss of detail, overfitting during fine-tuning.
After (RAE): Big semantic features, much faster convergence, clearer text and details, stable long fine-tuning.

Why it works (intuition):

The frozen encoder gives a well-structured map of meaning. Learning to generate on this map is easier than inventing structure from scratch.
High-dimensional space reduces information loss, especially important for small text and thin features.
The dimension-aware noise schedule matches the math to the space, keeping training stable.
Sharing one space for understanding and generation lets the LLM guide and critique results more directly.

Building blocks:

Frozen representation encoder (SigLIP-2) for 256×1152 tokens.
Trained RAE decoder to render pixels.
Diffusion Transformer (DiT) trained with flow matching for fast, stable learning.
Dimension-aware noise schedule shift (critical).
Data composition tuned for text (add text-rendering data) and diversity (combine synthetic + web).
Unified modeling via MetaQuery-style LLM with 256 query tokens.
Optional latent-space test-time scaling, where the LLM picks the best sample without decoding first.

03Methodology

High-level recipe: Text prompt → LLM with query tokens → small connector → Diffusion Transformer denoises high-dimensional features → RAE decoder renders pixels.

Stage 1: Train the RAE decoder beyond ImageNet

What happens: Freeze the encoder (SigLIP-2) that outputs 256 tokens of 1152 channels; train a ViT-based decoder to reconstruct images. Use a mix of losses (pixel, perceptual LPIPS, Gram/style, and a light adversarial term) to get sharp, faithful results.
Why this step exists: If the decoder can’t faithfully turn features back to pixels, great features still won’t look great as images.
Example data effect: On text scenes, adding text-rendering data drops the text reconstruction score (lower is better) from about 2.64 to 1.62, showing much sharper glyphs.

Stage 2: Pretrain the text-to-image model in representation space

What happens: Use the MetaQuery setup. Prepend 256 learnable query tokens to the text, feed it to an LLM (e.g., Qwen-2.5), project the query outputs to the DiT, and train the DiT with flow matching to denoise features sampled from the frozen encoder’s distribution.
Why this step exists: The DiT must learn how to move from noisy features to clean, meaningful features that match prompts.
Example metric: With the correct dimension-aware noise schedule, GenEval jumps from about 23.6 to 49.6.

Stage 3: Finetune on high-quality data

What happens: Start from the pretrained LLM+DiT and fine-tune both on a smaller, carefully curated set. Keep the encoder frozen and the decoder fixed.
Why this step exists: Fine-tuning polishes prompt alignment and style without losing generality.
Example: RAE-based models stay strong up to 256+ epochs; VAE-based models overfit badly after ~64 epochs.

Data composition matters more than raw size

Synthetic images (consistent style) train quickly and lower loss but lack full diversity.
Web data (diverse) is harder but boosts semantic coverage.
Together they synergize: mixing beats simply doubling either source alone.

Secret sauce 1: Dimension-aware noise scheduling

Match the noise schedule to the total feature size (tokens × channels). This single change makes high-dimensional diffusion stable and effective.

Secret sauce 2: Simpler architecture at scale

Fancy add-ons like a very wide denoising head help only in small models. At 2B+ parameter DiTs, standard widths already exceed latent width, so the add-on brings little gain.

Secret sauce 3: Targeted decoder data for text

If you want sharp text, train the decoder with text-rendering scenes; scale alone won’t fix missing glyph detail.

🍞 Hook: Imagine a school that teaches reading, writing, and drawing in the same classroom so students can cross-check each other’s work instantly.

🥬 The Concept: Multimodal Model

What it is: One model that understands text and images and can generate images from text.
How it works:
1. An LLM reads the text and shapes 256 learnable query tokens.
2. The DiT uses these as guidance to denoise feature tokens.
3. The RAE decoder renders pixels from the denoised features.
Why it matters: Using the same representation space for seeing and drawing lets the model stay consistent and efficient. 🍞 Anchor: Like having a bilingual friend who both reads a recipe and cooks the dish in the same kitchen.

🍞 Hook: Suppose you can try three rough sketches and pick the best one using your own eyes—without asking anyone else.

🥬 The Concept: Test-Time Scaling (latent space best-of-N)

What it is: Generate several candidates, then let the LLM score the feature tokens directly and keep the best—no pixel decoding needed.
How it works:
1. Sample N latent candidates with the DiT.
2. Feed each latent + prompt back to the LLM.
3. Score with prompt confidence or with a Yes/No alignment logit.
4. Select the top-scoring latent and decode once.
Why it matters: It boosts quality and saves time by avoiding repeated decode–re-encode steps. 🍞 Anchor: Like choosing the neatest pencil sketch before inking it, instead of fully coloring every sketch first.

🍞 Hook: Picture a coach who teaches runners how to move smoothly from start to finish instead of guessing every step.

🥬 The Concept: Flow Matching

What it is: A training target where the model learns the “velocity” that carries noisy features to clean ones in straight, efficient paths.
How it works:
1. Mix a real feature and noise at a random amount.
2. Ask the DiT to predict the direction to move toward the real feature.
3. Update the DiT so its directions get accurate.
Why it matters: It often trains faster and more stably than classic diffusion losses. 🍞 Anchor: Like teaching a paper airplane the exact wind direction to glide straight to your hand.

🍞 Hook: Think of a giant team of editors who all polish the same style guide so writing is consistent.

🥬 The Concept: Diffusion Transformer (DiT)

What it is: A transformer backbone specialized for denoising steps in diffusion models.
How it works:
1. Takes in noisy feature tokens and prompt-conditioned signals from the LLM.
2. Uses attention layers to refine tokens across steps.
3. Outputs cleaner tokens at each sampler step.
Why it matters: Scales well to billions of parameters and handles global structure effectively. 🍞 Anchor: It’s the orchestra conductor making sure every instrument (token) plays in harmony as the music (image) appears.

Concrete I/O example

Input: “A blue bicycle leaning on a red brick wall, crisp readable store sign above.”
Steps: LLM shapes 256 queries → DiT denoises 256×1152 features using a 50-step Euler sampler with the shifted schedule → RAE decoder renders pixels.
Output: Sharp bricks, correct colors, and readable sign text—thanks to high-dimensional features and text-focused decoder training.

04Experiments & Results

The test: What did they measure and why?

Reconstruction fidelity (rFID-50k): Do decoders faithfully rebuild images, especially tricky text scenes?
Text-to-image alignment and quality: GenEval (object-focused, prompt alignment) and DPG-Bench (semantic alignment). Higher is better.
Convergence speed: How quickly scores improve during pretraining.

The competition: Who/what was compared?

RAE-based diffusion (frozen SigLIP-2 encoder + trained decoder; DiT in 256×1152 latent space).
State-of-the-art VAE baseline from FLUX (compressive latent space).
Multiple scales: DiTs from 0.5B to 9.8B parameters; LLMs from 1.5B to 7B.

The scoreboard with context

Decoder scaling and data:
- Adding web and synthetic data brings moderate gains on diverse images (YFCC) but barely changes ImageNet.
- Adding text-rendering data dramatically improves text reconstruction (like going from a fuzzy C to a crisp C on a vision chart).
- RAE decoders beat SDXL-VAE on reconstruction but still trail FLUX-VAE; among RAEs, a strong SSL encoder (WebSSL) can reconstruct best.
Pretraining speed:
- With a 1.5B LLM + 2.4B DiT, RAE converges about 4.0× faster on GenEval and 4.6× on DPG-Bench versus VAE.
- Think of this like finishing your homework in 15 minutes when others need a full hour—and getting a higher grade.
Scaling DiT:
- Across 0.5B, 2.4B, 5.5B, 9.8B DiTs (LLM fixed at 1.5B), RAE consistently outperforms VAE.
- Returns start to flatten beyond ~6B unless data quality keeps improving—like lifting heavier weights without better nutrition.
Scaling LLM:
- Upgrading the LLM from 1.5B to 7B gives extra gains, especially with RAE, likely because larger DiTs and trainable LLMs can better use rich language cues.
Stability in fine-tuning:
- On a 60k high-quality set, VAE models peak then overfit after ~64 epochs (scores drop).
- RAE models stay strong and stable up to 256+ epochs, and even at 512 epochs show only mild decline.
- This is like a runner who keeps pace over a marathon while others sprint and fade early.
Unified modeling and verification:
- Adding generation did not hurt understanding performance (MME, TextVQA, AI2D, SeedBench, MMMU): RAE and VAE similar because both use the same frozen encoder.
- Latent-space test-time scaling works: best-of-N with LLM-based scores (prompt confidence or Yes/No logit) consistently boosts GenEval, without decoding every candidate.

Surprising findings

One simple trick dominates: the dimension-aware noise schedule shift nearly doubles GenEval in a tough setting; removing it tanks performance.
Architectural extras that helped at small scale (e.g., a special wide denoising head) mostly stop helping once DiTs are big; standard DiTs are enough.
Data composition beats data count: Mixing synthetic + web outperforms doubling either alone, showing complementary strengths.
An SSL encoder (WebSSL) can power an RAE that’s still better than a VAE in T2I—even without explicit text alignment.

Numbers in plain words

Noise schedule shift: like jumping from a D to a solid B+/A- on GenEval.
Convergence: RAE hits good scores about four times sooner than VAE—less time, less compute, better images.
Fine-tuning: VAE’s curve nose-dives after 64 epochs (overfitting), while RAE cruises comfortably to 256+.
Test-time scaling: picking the best 4 out of 8 (or 16/32) latent samples using LLM-based scores gives steady, noticeable boosts without extra decoding costs during selection.

05Discussion & Limitations

Limitations

Decoder ceiling: Although scaled RAEs outperform SDXL-VAE in reconstruction, they still trail the best proprietary FLUX-VAE on pure reconstruction in some domains. If you need the absolute sharpest pixel rebuilds without generative training, a top-tier VAE may still win today.
Data dependency: Text fidelity requires targeted text-rendering data; simply adding more generic images (even lots of them) won’t fix missed letter shapes. Careful data composition is a must.
Diminishing returns at scale: Once DiTs get very large (~6B+), gains flatten unless data quality and diversity keep scaling too. Bigger isn’t always better without better fuel.
Encoder choice matters: SigLIP-2 works great; WebSSL can be even stronger for reconstruction in places, but the ecosystem still needs more head-to-heads across tasks and resolutions.
Resolution and domains: Experiments commonly use 224–256 feature resolutions for RAEs and 256–512 outputs for VAEs; extending to very high-resolution photorealism or specialized domains (medical, maps) needs more testing and data.

Required resources

Hardware: Large-scale training used TPU v4/v5p/v6e; similar GPU clusters work but still require serious compute for 2B–10B DiTs and 1.5B–7B LLMs.
Data: Mixed web + synthetic + text-rendering corpora; curated fine-tuning sets.
Software: A training stack that supports flow matching, dimension-aware schedules, and unified LLM–DiT pipelines.

When not to use

Ultra-low-latency, on-device generation where a tiny VAE latent may still be preferable due to memory/compute limits.
Scenarios needing maximum compression for storage or streaming; RAEs deliberately keep high-dimensional codes.
Projects without the capacity to curate targeted data (e.g., text-rendering) when text fidelity is a must-have.

Open questions

Any-resolution and high-resolution scaling: How do RAEs perform at 1024–4096 px generation? What decoder/training tweaks help?
Jointly improving the encoder: Could light, safe updates to the frozen encoder further boost both understanding and generation?
Video and 3D: Do RAEs help temporal coherence in video or structure in 3D/NeRF-style generation?
Better verifiers: Which latent-space scoring methods (beyond prompt confidence and Yes/No logits) best predict human preferences?
Safety and alignment: How do shared latent spaces affect content filtering, watermarking, and controllable generation?

06Conclusion & Future Work

Three-sentence summary

This paper shows that running diffusion in a high-dimensional representation space with a frozen encoder and a trained RAE decoder is a simpler, stronger foundation than VAEs for text-to-image at scale.
A single crucial ingredient—dimension-aware noise scheduling—makes high-dimensional diffusion stable, while other architectural tricks matter little once models are large.
With better data composition (web + synthetic + text-rendering), RAE-based models converge 4× faster, generalize better, and resist overfitting during long fine-tuning, and they enable latent-space test-time scaling in unified models.

Main achievement

Establishing RAEs as a practical, superior backbone for large-scale text-to-image diffusion: faster convergence, better alignment, more stable fine-tuning, and a shared space where understanding and generation work together.

Future directions

Push to higher resolutions and broader domains (e.g., documents, maps, medical) with tuned decoder recipes and schedules.
Explore gentle encoder co-training, richer verifier signals for latent-space selection, and extensions to video/3D.
Refine data composition strategies and open-source tools for targeted domains like typography and diagrams.

Why remember this

The work overturns the “VAEs by default” assumption for T2I and offers a clear, simpler recipe that scales.
It bridges seeing and drawing in one space, letting the model judge its own outputs without expensive detours.
It’s a practical path to better, faster, and more robust generative systems that creators and engineers can build on today.

Practical Applications

•Design crisp marketing posters with readable small text and logos that stay sharp.
•Generate instructional diagrams and infographics where labels and arrows must be accurate.
•Create variations for product shots quickly, then select the best one via latent-space scoring.
•Prototype children’s books or comics with consistent characters and legible speech balloons.
•Produce UI mockups where fine fonts and icons remain clear through iterations.
•Build scientific visuals (charts, labeled sketches) that need precise, readable annotations.
•Speed up creative workflows by converging to good images in fewer training steps.
•Specialize models for certain styles (e.g., typography, signage) without heavy overfitting.
•Run quality control by letting the LLM verify generations directly in latent space before rendering.
•Support multilingual image generation where non-Latin scripts (e.g., Arabic, Chinese) must be legible.

Version: 1