SVG-T2I: Scaling Up Text-to-Image Latent Diffusion Model Without Variational Autoencoder

Minglei Shi; Haolin Wang; Borui Zhang; Wenzhao Zheng; Bohan Zeng; Ziyang Yuan; Xiaoshi Wu; Yuanxing Zhang; Huan Yang; Xintao Wang; Pengfei Wan; Kun Gai; Jie Zhou; Jiwen Lu

SVG-T2I: Scaling Up Text-to-Image Latent Diffusion Model Without Variational Autoencoder

Intermediate

Minglei Shi, Haolin Wang, Borui Zhang et al.12/12/2025

arXiv PDF

Key Summary

•This paper shows you can train a big text-to-image diffusion model directly on the features of a vision foundation model (like DINOv3) without using a VAE.
•Their model, called SVG-T2I, treats text and image features as one long sequence and learns with flow matching, a stable way to teach diffusion models.
•It reaches about 0.75 on GenEval and 85.78 on DPG-Bench, roughly matching strong VAE-based systems like SD3-Medium and FLUX.1 in overall quality.
•They provide two autoencoders: a simple one that just uses frozen DINOv3 features (Autoencoder-P) and an optional residual one (Autoencoder-R) that adds extra crisp details when needed.
•High-resolution inputs let DINOv3 features keep fine details well, so the simple autoencoder is often enough at larger sizes (like 1024×1024).
•They train in four stages: learn alignment at low/mid resolution on 60M images, then refine details at high resolution on 15M images, then polish aesthetics on 1M fancy images.
•Compared to VAEs, DINO features carry richer semantics but shift across scales (resolutions), which can make training across multiple sizes tricky.
•The team open-sourced the code, weights, and full training/evaluation pipeline so others can build on this representation-first pathway.
•Limitations include very fine facial details, tricky fingers, and reliable text rendering, often needing more specialized data and compute.
•The bigger idea is a unified visual space that supports understanding, perception, reconstruction, and generation with a single encoder.

Why This Research Matters

SVG-T2I shows that generation can happen directly in a semantic feature space, pointing toward a future where one encoder powers both understanding and creation. This could simplify AI systems used in art, design, education, accessibility, and content moderation by reducing the need for separate encoders. With fewer moving parts, developers may ship faster, more reliable tools that align closely with user intent. Bilingual, long-caption support hints at better global usability, serving creators across languages. Open-sourcing code and weights invites rapid community innovation, making advanced image generation more accessible. Insights about scale invariance will guide the next wave of robust, resolution-agnostic visual models. Overall, this work nudges AI toward cleaner, unified foundations that benefit practical apps and research alike.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Top Bread (Hook): You know how you can describe a picture to a friend, and they can imagine it in their head? Computers are learning to do the reverse: turn your words into pictures.

🥬 Filling (The Actual Concept):

What it is: Text-to-image diffusion models are AI systems that read a caption and gradually paint an image that matches it.
How it works:
1. Start with noisy, messy pixels (like TV static).
2. Take small steps to clean the noise while following the text’s clues.
3. Keep refining until a clear image appears.
Why it matters: Without this process, the AI’s pictures would look random and not match your words.

🍞 Bottom Bread (Anchor): When you type “a red fox jumping over a log at sunset,” the model slowly turns static into a bright scene with the fox and the sunset.

🍞 Top Bread (Hook): Imagine a giant library in the AI’s head that organizes visual facts—shapes, textures, and meanings—so it can recognize and create images.

🥬 Filling (The Actual Concept):

What it is: Visual Foundation Models (VFMs) are big vision encoders trained without labels to learn powerful visual features.
How it works:
1. See tons of images.
2. Learn to notice patterns (edges, objects, layouts) by predicting or matching parts.
3. Pack these patterns into feature maps (like smart summaries of images).
Why it matters: Without VFM features, the model would miss important semantic clues and struggle to align images with text.

🍞 Bottom Bread (Anchor): DINOv3 (a VFM) turns a 1024×1024 photo into a neat grid of 16×16 patches, each with 384 numbers describing content and meaning.

🍞 Top Bread (Hook): Think of shrinking a giant painting into a small postcard so it’s easier to mail—but you still want it to look great when someone sees it.

🥬 Filling (The Actual Concept):

What it is: Latent Diffusion Models (LDMs) generate images in a smaller “latent” space and then decode back to pixels.
How it works:
1. An encoder compresses an image into a compact code.
2. Diffusion is trained to denoise in this code space.
3. A decoder turns the cleaned code back into an image.
Why it matters: Without latents, training and sampling would be much slower and more expensive.

🍞 Bottom Bread (Anchor): Stable Diffusion uses a VAE to compress images so generation runs faster, then decodes the final latent into a full picture.

🍞 Top Bread (Hook): Imagine using an artist’s sketchbook directly, instead of copying their sketches onto another notebook first.

🥬 Filling (The Actual Concept):

What it is: This paper generates images directly in the VFM feature space (DINOv3), skipping the usual VAE pathway.
How it works:
1. Use DINOv3 to represent images as semantic features.
2. Train a diffusion transformer to denoise in this feature space, guided by text.
3. Decode features back to pixels with an image decoder.
Why it matters: Without using the VFM’s native space, we juggle multiple encoders (understanding vs. generation) and lose semantic structure.

🍞 Bottom Bread (Anchor): Instead of compressing with a VAE, SVG-T2I trains right on DINOv3 features and still produces high-quality 1024×1024 images aligned to text.

The world before: Most top text-to-image systems used VAEs to speed up diffusion. VAEs are excellent at compact compression and stable across scales (resolutions), but their latent spaces were not designed for semantic structure. That means the codes are great for reconstruction speed, yet less naturally aligned with tasks like understanding or reasoning, and you often need a separate vision encoder (like CLIP or SigLIP) for understanding.

The problem: Could we ditch the VAE and work directly in the VFM feature space—keeping strong semantics—without losing image quality or training stability at large scales and resolutions? And can one unified feature space support reconstruction, perception, and generation?

Failed attempts and limits: Earlier works trained on class-conditional ImageNet (cats, dogs, cars), but that setting is small and familiar to most encoders. It didn’t prove that the same VFM features would scale to open-ended, text-heavy prompts at high resolution.

The gap: We lacked a full, large-scale, text-to-image recipe running natively in a VFM’s feature space, plus open-source code and weights so the community could test and extend it.

Real stakes: One encoder for everything could simplify systems (fewer moving parts), reduce engineering overhead, and make it easier to mix perception (what’s in the image) and generation (make an image). For everyday life, this means faster, smarter creative tools, easier customization, and better alignment between what you ask for and what gets drawn—whether you’re illustrating homework or designing an ad.

02Core Idea

🍞 Top Bread (Hook): Imagine building with LEGO bricks that already know what a house or a tree looks like, so you can assemble scenes faster and more accurately.

🥬 Filling (The Actual Concept):

What it is: The key insight is to scale text-to-image diffusion directly in the VFM feature space (DINOv3), removing the VAE while keeping a simple, strong pipeline.
How it works:
1. Represent images as DINOv3 features (a semantic grid with 16×16 patches, 384 channels).
2. Use a single-stream diffusion transformer (Unified Next-DiT) that ingests both text and image tokens as one sequence for tight cross-modal interaction.
3. Train with flow matching, a stable objective that learns how to move from noise to data in feature space.
4. Decode features back to pixels with a lightweight decoder.
Why it matters: Without operating in the VFM space, you miss out on a unified, semantically organized latent that can serve understanding, reconstruction, and generation together.

🍞 Bottom Bread (Anchor): SVG-T2I reaches about 0.75 on GenEval and 85.78 on DPG-Bench, showing this ‘semantic-first’ route can hang with state-of-the-art VAE-based models.

Three analogies for the same idea:

Library analogy: Instead of stuffing books into random boxes (VAE latents), you shelve them in a well-labeled library (VFM features). Finding and using the right knowledge becomes easier.
Map analogy: Instead of a blurry, fast-driving route (VAE), you drive on a semantically labeled map (VFM) with clear signs (features) telling you where objects and textures likely belong.
Recipe analogy: Instead of mixing unknown powders (opaque latents), you use named ingredients (VFM features like edges, textures, parts), so the chef (diffusion) can cook the exact dish.

Before vs. After:

Before: Separate encoders for understanding and generation; VAE latents fast but semantically messy.
After: One encoder style for many tasks; diffusion learns directly on semantically rich features; comparable quality to top systems.

🍞 Top Bread (Hook): You know how a GPS doesn’t show equations but still gives you a clear path? Flow matching is like that—guiding the model from noise to features smoothly.

🥬 Filling (The Actual Concept):

What it is: Flow matching is a way to train the model to predict the direction and speed to move noisy features toward clean features.
How it works:
1. Mix a clean feature with noise based on a time step t.
2. Ask the model to predict the velocity that would unmix it.
3. Learn this across many samples and times until it reliably points home.
Why it matters: Without flow matching, training can be less stable or slower, and the model may not learn smooth, accurate paths.

🍞 Bottom Bread (Anchor): When SVG-T2I sees a noisy feature patch at t=0.6, it predicts the push needed to reach the clean DINOv3 feature—step by step until the full image appears.

Building blocks:

DINOv3 encoder: creates semantic features from images (frozen for Autoencoder-P).
Autoencoders: Autoencoder-P (pure DINO features) and optional Autoencoder-R (adds a residual branch for extra crisp textures/colors when needed).
Unified Next-DiT: a single-stream transformer that processes text and DINO features in one sequence, improving cross-modal attention and parameter efficiency.
Decoder: maps features back to pixels.
Progressive training: low/mid-res for alignment → high-res for details → aesthetic fine-tuning for polish.

Why it works (intuition):

The VFM space is already a smart, semantically arranged playground. Learning to denoise there is like sculpting with pre-shaped clay. The single-stream DiT lets words and visual features talk directly, so the picture follows the prompt closely. High-resolution training unlocks the fine details DINO naturally encodes, making reconstructions and generations crisp without needing a heavy VAE.

03Methodology

At a high level: Text prompt → tokenize (Gemma2-2B embeddings) → diffusion in DINOv3 feature space with Unified Next-DiT (flow matching) → decode features to pixels → final image.

Step 0: Representations and autoencoders 🍞 Top Bread (Hook): Imagine photographing a scene and saving both a summary (features) and the full photo (pixels) so you can rebuild the picture later.

🥬 Filling (The Actual Concept):

What it is: The paper uses DINOv3 features as the working space and provides two autoencoders to decode features back to pixels.
How it works:
1. Autoencoder-P: uses frozen DINOv3 features (ViT-S/16) as inputs to a decoder.
2. Autoencoder-R: adds a trainable residual ViT branch to inject high-frequency details and fix color casts.
3. Both share the same decoder architecture to reconstruct images.
Why it matters: Without a strong decoder, feature-space generations would never turn into sharp, realistic images.

🍞 Bottom Bread (Anchor): Feed a 1024×1024 image through DINOv3 (downsample 16×), then let the decoder rebuild pixels from the 64×64×384 feature grid.

Training the autoencoder (progressive):

Pretrain on ImageNet (about 1.2M images) at 256×256 for 40 epochs to learn stable decoding.
Multi-resolution fine-tune on a 3M set at native sizes: anchor at 512 for ~10M seen images, then 1024 for ~6M more.
Observation: At high resolution, plain DINOv3 features preserve details well; the residual branch becomes optional.

Step 1: Text conditioning 🍞 Top Bread (Hook): You know how a news anchor reads a script before going live? The model also needs a clear script from your prompt.

🥬 Filling (The Actual Concept):

What it is: A text encoder (Gemma2-2B) turns the prompt into token embeddings the image generator can understand.
How it works:
1. Tokenize the prompt (max length 256 early, 512 in final tuning).
2. Produce contextual embeddings that capture meaning, style, and details.
3. Mix these embeddings into the diffusion transformer’s sequence.
Why it matters: Without rich text embeddings, the picture won’t match the request (objects, colors, positions).

🍞 Bottom Bread (Anchor): “A turquoise river in a dense forest” becomes a sequence of vectors that steer the DiT to paint teal water, mossy rocks, and trees.

Step 2: Single-stream diffusion transformer (Unified Next-DiT) 🍞 Top Bread (Hook): Picture a roundtable where words and image pieces sit together, talking directly, instead of passing notes across rooms.

🥬 Filling (The Actual Concept):

What it is: A single-stream transformer that treats text tokens and image tokens as one long sequence.
How it works:
1. Concatenate text embeddings and noisy DINO features.
2. Apply attention layers so text can guide any image patch directly.
3. Predict a velocity field (v-prediction) used by flow matching to denoise features.
Why it matters: Without this joint sequence, cross-modal alignment can be weaker, and parameters less efficiently used.

🍞 Bottom Bread (Anchor): The token for “red flower” can directly guide the patch near petals, strengthening color and shape at the right spots.

Step 3: Flow matching training 🍞 Top Bread (Hook): Think of turning a blurry photo into a clear one by following arrows that tell you which way to sharpen.

🥬 Filling (The Actual Concept):

What it is: A training objective where the model learns the direction to move from a noisy feature to a clean one at each time step.
How it works:
1. Mix clean DINO features with Gaussian noise based on t.
2. Ask the model for the velocity to undo that mix.
3. Optimize so predicted and true velocities match.
Why it matters: Without this stable target, training could wobble and produce artifacts.

🍞 Bottom Bread (Anchor): At t=0.8, the feature is very noisy; the model predicts a big corrective step; at t=0.2, the step is small, gently polishing details.

Step 4: Progressive multi-stage training of the generator 🍞 Top Bread (Hook): Just like athletes train basics first and add complexity over time, the model learns alignment first, then fine detail, then style.

🥬 Filling (The Actual Concept):

What it is: A four-stage recipe that grows resolution and data quality.
How it works:
1. Low-res stage (256): 60M general images (mixed captions) to lock in text–image alignment and low-frequency structure (about 140M seen samples).
2. Mid-res stage (512): another 60M to stabilize composition and object relations (about 70M seen samples).
3. High-res stage (1024): 15M realistic images to sharpen details like textures and edges (about 34M seen samples).
4. HQ aesthetic tuning (1024): 1M high-aesthetic images, longer text max (512), to improve beauty and realism (about 30M seen samples).
Why it matters: Skipping stages risks poor alignment (early) or blurry details (late). The schedule smooths learning.

🍞 Bottom Bread (Anchor): The same prompt looks simpler and softer after stage 1, gains structure at stage 2, becomes crisp at stage 3, and looks polished and stylish after stage 4.

Data and captions:

Bilingual captions (English 80%, Chinese 20%).
Caption lengths sampled (short/mid/long) with probabilities (0.10, 0.35, 0.55) in main sets; final tuning favors long.
Text embeddings from Gemma2-2B provide multilingual understanding.

Secret sauce:

Stay in the semantic-rich VFM feature space.
Use a single-stream DiT to maximize text–image interaction.
Rely on high-res training where DINOv3 features shine for fine details.
Flow matching for stable, efficient denoising learning.
Optional residual encoder only when extra crispness is needed, keeping the core pipeline simple.

What breaks without each step:

No text encoder: images drift off-prompt.
No single-stream: weaker object–attribute and spatial control.
No flow matching: less stable training; poorer convergence.
No progressive scaling: either misaligned (if you jump to high-res) or blurry (if you never scale up).
No decoder: features never become pixels.

Example with actual data:

Prompt: “A woodpecker on a mossy branch, soft daylight, shallow depth of field.”
Text encoder: creates token embeddings for ‘woodpecker’, ‘mossy branch’, ‘soft daylight’, ‘bokeh’.
DiT + flow matching: denoises from random DINO features toward a bird shape with red crown, black-white patterning.
Decoder: renders feathers sharp, background smooth and blurred, matching the prompt.

04Experiments & Results

The tests: The team measured how well SVG-T2I follows instructions (GenEval) and handles diverse prompts and details (DPG-Bench). Both are standard scorecards for text alignment, attributes, counting, color/position, and overall visual quality at 1024×1024.

Competition: They compared against strong baselines—SDXL, DALL-E 2/3, SD3-Medium, FLUX.1, HiDream-I1-Full, and others, including autoregressive and diffusion-based models.

Scoreboard with context:

GenEval ≈ 0.75 overall (with LLM-rewritten prompts), which is like getting an A when many strong classmates get A– or B+ (comparable to SD3-Medium and above SDXL/DALL-E 2 by a clear margin).
DPG-Bench = 85.78 overall, which lands in the same league as top VAE-based diffusion models like FLUX.1 and HiDream-I1-Full—think of placing within a few points of the class topper on a tough final exam.

What this means: Operating directly in DINOv3’s feature space doesn’t hurt large-scale performance; it holds up against heavy-hitter VAE pipelines. The unified, semantic latent works for real-world text prompts, not just class labels.

Surprising findings:

High-resolution inputs make DINOv3 features especially detail-rich. Reconstructions from 1024 inputs look much crisper than 256, suggesting the encoder holds more texture and geometry at larger sizes.
Compared to VAE latents (which are very scale-stable), DINO features shift across resolutions. This scale dependence didn’t block success, but it highlighted an important research need: make VFM features more consistent across sizes.

Ablations and recipes (takeaways):

Optional residual encoder is handy at lower resolutions or when extra crispness is crucial; at higher resolutions, plain DINOv3 features may suffice.
Longer captions (and bilingual data) improved alignment and instruction-following.
The four-stage curriculum steadily upgraded realism and aesthetics (visually confirmed in their stage-by-stage samples).

05Discussion & Limitations

Limitations (be specific):

Very fine human details (eyes, eyebrows) can be inconsistent.
Hands/fingers under complex poses remain challenging (shape distortions).
Text rendering on images is less reliable (spelling/clarity can wobble).
Multi-scale training is harder because DINO features change across resolutions.

Required resources:

Large-scale image–text datasets (60M+ general, 15M realistic, 1M aesthetic) and bilingual captions.
Substantial compute for multi-stage training at 256→512→1024.
A modern text encoder (Gemma2-2B) and a large DiT (~2.6B parameters).

When NOT to use:

If you need rock-solid text rendering on images (logos, long phrases) without special training.
If your domain demands ultra-precise human anatomy (faces/hands) but you lack targeted data.
If you must train across many resolutions and need perfect scale invariance right now.

Open questions:

How to make VFM features scale-invariant without losing their strong semantics?
Can we jointly fine-tune the VFM encoder for both understanding and generation while keeping transferability?
What’s the best way to improve text rendering—data, architecture (token fusion), or loss design?
Can a single, universal encoder robustly power perception, reconstruction, and generation across all resolutions and modalities?
How far can flow matching and single-stream transformers scale with even larger, more diverse datasets?

06Conclusion & Future Work

Three-sentence summary: This paper scales a text-to-image diffusion model directly in a Visual Foundation Model’s feature space, skipping the usual VAE. The resulting system, SVG-T2I, achieves competitive scores on GenEval (~0.75) and DPG-Bench (85.78), showing that a semantic, representation-first pathway can match strong VAE-based pipelines. The full open-source release (code, weights, training/eval pipeline) invites the community to build unified visual models.

Main achievement: Proving, at scale and high resolution, that you can train T2I diffusion right on DINOv3 features with a single-stream DiT and flow matching, maintaining strong text alignment and image quality without a VAE.

Future directions: Improve scale invariance of VFM features, strengthen face/hand and text rendering via targeted data and architectural tweaks, explore joint tuning of encoders for unified perception–generation, and push to larger, more multilingual datasets and resolutions.

Why remember this: It’s a clean, unified path—generate where semantics live. If VFMs become the common language for seeing and drawing, one encoder could power understanding, reconstruction, and creation, simplifying systems and accelerating progress across vision tasks.

Practical Applications

•Creative assistants that generate on-brief marketing images from long, detailed prompts.
•Educational tools that turn student-written descriptions into accurate visual diagrams.
•Design ideation for product concepts, packaging, and interior layouts using semantic prompts.
•Multilingual content creation where English and Chinese prompts both yield consistent images.
•Scientific communication aides that render clear illustrations from descriptive captions.
•Storyboarding and previsualization for films and games with fine-grained scene control.
•Brand-safe image generation by aligning semantics and attributes (colors, positions, counts).
•Rapid prototyping of UI mockups and iconography from textual specs.
•Image personalization with richer attribute control by operating in a semantic feature space.
•Research platform to study scale-invariant feature learning and unified encoders.

Version: 1