GRAN-TED: Generating Robust, Aligned, and Nuanced Text Embedding for Diffusion Models

Bozhou Li; Sihan Yang; Yushuo Guan; Ruichuan An; Xinlong Chen; Yang Shi; Pengfei Wan; Wentao Zhang; Yuanxing zhang

GRAN-TED: Generating Robust, Aligned, and Nuanced Text Embedding for Diffusion Models

Intermediate

Bozhou Li, Sihan Yang, Yushuo Guan et al.12/17/2025

arXiv PDF

Key Summary

•This paper is about making the words you type into a generator turn into the right pictures and videos more reliably.
•The authors build TED-6K, a fast, text‑only test that predicts how good a text encoder will be for image/video diffusion models—no costly model training needed.
•They add a tiny "context aggregator" so very different encoders (like CLIP, T5, LLMs, and MLLMs) can be compared fairly.
•Scores on TED-6K strongly match real generation quality, so teams can try many encoders quickly (about 720× faster than training a full model).
•Guided by this test, they fine‑tune an MLLM (Qwen3‑VL‑8B) and then combine its layers with smart, learnable weights to form a better text embedding.
•Their final encoder, called GRAN‑TED, gets the best score on TED-6K and boosts text‑to‑image and text‑to‑video benchmarks.
•A two‑step training trick (learn the layer weights, then freeze them) keeps the diffusion model stable and improves results.
•Surprising finding: multimodal models beat plain LLMs on a text‑only test, because their language better matches visual concepts.
•This work helps fix common failures like wrong object counts, mixed‑up relationships, and ignored negatives in prompts.
•The code and data make it practical for others to evaluate and build better text encoders for generation.

Why This Research Matters

When you ask an AI to draw something, you expect it to match your words—counts, colors, positions, and what not to include. This work creates a fast way to tell which text encoder will make that happen, saving days of trial and error. It also provides a better encoder (GRAN‑TED) that measurably improves pictures and videos people actually see. Artists get tighter control, educators get clearer visuals, and product teams ship more dependable creative tools. Because the benchmark predicts real outcomes, companies can iterate quickly without training full models each time. In short, it makes text‑to‑visual AI more accurate, faster to develop, and easier to trust.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine giving a very clear recipe to a robot chef—“Make two blue cupcakes on a red plate”—but it serves three green muffins on a napkin. Frustrating, right?

🥬 The Concept (Diffusion models): What it is: Diffusion models are AI artists that turn noise into pictures or videos while following your text instructions. How it works: (1) Start with random noise; (2) Step by step, remove noise; (3) Use a text guide to decide what shapes and colors should appear; (4) End with a final image or video. Why it matters: Without a good guide, the model may draw the wrong things—like wrong counts, mixed‑up relationships, or missing details. 🍞 Anchor: When you say “a small yellow bird sitting on a big brown branch,” the model needs to understand size, color, and who sits on what as it removes noise.

🍞 Hook: You know how a good translator makes sure your meaning is not lost from one language to another? The same is true for turning words into visuals.

🥬 The Concept (Text encoder): What it is: A text encoder turns your words into numbers (an embedding) that the diffusion model can understand. How it works: (1) Read the whole prompt; (2) Build a meaning map across tokens; (3) Output a sequence of vectors that capture who, what, where, when, and how; (4) Feed these vectors to the image/video model as a compass. Why it matters: If the compass is fuzzy, the picture goes off course—wrong objects, wrong counts, wrong relationships. 🍞 Anchor: For “two red apples in a blue bowl on the left of a green vase,” the encoder must keep track of number, colors, and left/right.

The world before: Early text‑to‑image models often used CLIP or T5 as text encoders. This worked, but prompts still got misread, especially complex ones with multiple objects, spatial relations, numbers, or negatives (“no shadows”). Teams started swapping in powerful LLMs and multimodal LLMs (MLLMs), hoping their deeper understanding of language would fix these issues.

🍞 Hook: Picture trying to pick the best microphone by singing an entire concert every time. That’s slow and expensive.

🥬 The Concept (Evaluation bottleneck): What it is: There was no quick, reliable way to predict how a text encoder would affect final generation quality. How it works: People either (1) used unrelated NLP tests or retrieval tasks, which didn’t match generation needs, or (2) trained full diffusion models end‑to‑end for each encoder—very costly. Why it matters: Without a fast, faithful test, teams can’t iterate quickly to find the best encoder. 🍞 Anchor: It’s like testing a soccer shoe by running a marathon—related but not the right test.

Failed attempts: Standard LLM exams (like knowledge tests) don’t measure visual alignment, and retrieval benchmarks squash a whole sentence into one vector, unlike diffusion models which use a full sequence of token embeddings. Visual evaluation with generated images is slow—every candidate encoder would require hours of training.

🍞 Hook: Imagine judging reading comprehension by looking only at the final drawing a student makes from a paragraph. If you only look at the drawing, you don’t know where the misunderstanding started.

🥬 The Concept (Granularity mismatch): What it is: Retrieval tasks use single pooled vectors; diffusion models consume token‑by‑token sequences and subtle layer information. How it works: (1) Retrieval -> compress sentence; (2) Diffusion -> use full sequence and often certain layers; (3) These pipelines value different signals. Why it matters: A high retrieval score can still give weak conditioning for generation. 🍞 Anchor: A student who memorizes the main idea might ace a summary test but still miss details needed to draw the scene correctly.

The gap: A fair, fast, text‑only benchmark that mirrors how diffusion models actually use text embeddings was missing. Also missing: a training plan to adapt powerful LLM/MLLM encoders so their knowledge becomes sharply usable by diffusion models.

Real stakes: Better text encoders mean the picture on your screen actually matches what you asked for—correct counts (“exactly three”), precise layouts (“left of”), faithful styles, and true negatives (“without watermark”). That helps artists iterate faster, companies build reliable creative tools, educators illustrate concepts clearly, and scientists or designers prototype with fewer errors.

02Core Idea

🍞 Hook: You know how chefs taste a spoonful of sauce to judge the whole dish before serving? A tiny, quick taste can predict the final result.

🥬 The Concept (Aha!): What it is: Build a text‑only "taste test"—TED‑6K—that predicts real image/video generation quality, then use it to train a sharper text encoder (GRAN‑TED) with a two‑stage process and learnable layer mixing. How it works: (1) Create TED‑6K: captions paired with true/false statements across many semantic skills; (2) Add a small context aggregator that mimics what a diffusion model consumes; (3) Score encoders by similarity: the caption should match the true statement more than the tricky false ones; (4) Pick a strong backbone (an MLLM), fine‑tune it on visually grounded Q&A/captions; (5) Fuse all its layers with learnable weights and a two‑step freeze for stability. Why it matters: You can pick and improve encoders 100s of times faster and actually see better generated images/videos. 🍞 Anchor: It’s like a quick, well‑designed quiz that strongly predicts your report card and also teaches you how to study smarter.

Three analogies:

Library analogy: The text encoder is a librarian turning your request into a book‑finding plan. TED‑6K checks if the plan points to the right shelves (actions, counts, locations). GRAN‑TED trains the librarian to be even sharper.
Orchestra analogy: Each model layer is an instrument. Norm‑Avg is everyone playing at the same volume; learnable weights are a conductor dynamically balancing sections. Two‑step training sets the balance, then keeps it stable so the music (generation) sounds right.
Map analogy: The context aggregator folds a long written route into a compact, useful map the driver (diffusion model) can follow.

Before vs. After:

Before: Teams guessed using unrelated benchmarks or retrained whole models—slow and noisy feedback. Encoders often missed counts, relations, or negatives.
After: With TED‑6K, they get fast, predictive scores; with GRAN‑TED, they gain a state‑of‑the‑art encoder that measurably lifts T2I/T2V quality.

🍞 Hook: Think of mixing paint colors—small changes can make the picture pop.

🥬 The Concept (Why it works—intuition): What it is: TED‑6K targets exactly the semantics generation needs (actions, spatial/temporal relations, attributes, counts, OCR). How it works: (1) Measure sentence‑level meaning via a trained aggregator (two attention layers + a learnable token); (2) Use hard negatives so the test is challenging; (3) Fine‑tune an MLLM so its language space aligns with visual content; (4) Fuse all layers so both shallow syntax and deep semantics contribute; (5) Freeze learned layer weights mid‑training to keep the conditioning steady, avoiding training instability. Why it matters: This pipeline transmits cleaner, richer instructions to the diffusion model, which shows up as better alignment. 🍞 Anchor: In an example with “VANILLA Latte” on a cup, the method prefers the exact OCR match over near‑miss spellings, just like you would.

Building blocks (with mini sandwiches):

🍞 Hook: Imagine turning many paragraphs into one clean summary card. 🥬 The Concept (Context aggregator): What it is: A tiny transformer that pools token‑level features into one context vector the generator can use. How it works: (1) Normalize/fuse layer outputs; (2) Prepend a learnable context token; (3) Apply two self‑attention blocks; (4) Read the token as the sentence embedding; (5) Train with contrastive pairs to pull same‑image captions together. Why it matters: It mirrors how DiTs consume text, so evaluation is fair and predictive. 🍞 Anchor: Two different captions of the same photo land close together; unrelated captions land far apart.
🍞 Hook: Think of a bilingual friend trained to describe pictures better, not just talk. 🥬 The Concept (Specialized MLLM): What it is: Start with Qwen3‑VL‑8B and fine‑tune on visually grounded VQA/captions. How it works: (1) Curate image/video + rich captions; (2) Build VQA about attributes, counts, spatial/temporal relations; (3) Fine‑tune so text embeddings align with visuals. Why it matters: The model’s language now carries visual instincts useful for generation. 🍞 Anchor: Ask, “How many candles?” or “Which side is the vase?”—it encodes the answers more crisply.
🍞 Hook: In group work, not everyone should talk equally. 🥬 The Concept (Layer‑wise weighting): What it is: Learn a weight for each encoder layer, then softmax‑mix their normalized outputs. How it works: (1) Normalize every layer; (2) Learn scalar weights; (3) Softmax to get a distribution; (4) Weighted sum yields the final embedding. Why it matters: Different layers carry different kinds of knowledge; smart mixing beats using any single layer. 🍞 Anchor: Some layers help with names and counts, others with relations; the mix keeps the best of all.
🍞 Hook: When learning to ride a bike, training wheels help first, then you lock in the posture. 🥬 The Concept (Two‑step freeze): What it is: Train layer weights early, then freeze them so the diffusion model gets a stable condition. How it works: (1) Jointly train encoder + diffusion + weights; (2) After convergence, freeze weights; (3) Continue training diffusion. Why it matters: Prevents a moving target that can destabilize learning. 🍞 Anchor: Their ablation shows continuous updating made things worse; freezing improved scores.

03Methodology

High‑level flow: Input (caption + candidate statements) → Text encoder (extract per‑token features) → Layer fusion (Last layer / Norm‑Avg / Learnable weights) → Context aggregator (two attention layers + learnable token) → Sentence embedding → Similarity scoring → Pick the true statement.

Step 1: Build the TED‑6K evaluation set

What happens: The team collects rich images/videos, writes very detailed captions (with Gemini 2.5 Pro), then for each caption creates one true statement and three tricky false statements across 8 skills: action, spatial relation, temporal order, coreference, adjectives, adverbs, quantity, OCR, plus basic event. Humans verify every item.
Why it exists: Hard, targeted statements measure the exact semantics that matter for generation and avoid the need to train a full diffusion model.
Example: Caption mentions a cup labeled “VANILLA Latte.” The candidates include “VANILLA Latte” (true) and near‑miss spellings like “VAMILLA Latte” (false).

🍞 Hook: It’s like checking whether two descriptions talk about the same scene. 🥬 The Concept (Contrastive learning for aggregator): What it is: Train the aggregator so two captions of the same media land close together, different media land far apart. How it works: (1) Take two captions of the same image/video; (2) Pass through a frozen encoder; (3) Fuse layers; (4) The aggregator turns sequences into one vector; (5) Contrastive loss pulls positives together, pushes negatives apart. Why it matters: The aggregator becomes a fair, stable sentence‑level pooling method that mimics diffusion consumption. 🍞 Anchor: Two captions of a skier on a steep snowy ridge map close; a motorcycle caption maps far.

Step 2: Unified feature extraction and aggregation

What happens: For each encoder under test (CLIP, T5, LLMs, MLLMs), extract token features either from a single layer (last/penultimate), from all layers with Norm‑Avg, or with the learnable layer‑wise weights; then feed them to the same trained aggregator.
Why it exists: Diffusion models actually use full token sequences and benefit from multiple layers. This makes comparisons fair across very different architectures.
Example: Qwen3‑VL‑8B with Norm‑Avg vs. last layer only—the former uses a richer mix and typically scores higher.

🍞 Hook: Think of adding a spotlight to the right parts of the text. 🥬 The Concept (Layer‑wise weighting): What it is: Assign a learned importance to each encoder layer before aggregating. How it works: (1) LayerNorm each layer’s tokens; (2) Learn one scalar per layer; (3) Softmax to get weights; (4) Weighted sum of layers yields the final sequence; (5) Send to the aggregator. Why it matters: Different tasks need different layer mixtures; learned weights capture this nuance better than flat averaging. 🍞 Anchor: For counting, mid‑layers may matter more; for long‑range relations, deeper layers may.

Step 3: Specialize the backbone encoder (GRAN‑TED Stage 1)

What happens: Start with Qwen3‑VL‑8B‑Instruct. Fine‑tune on curated captions and visually grounded VQA (attributes, counts, spatial/temporal relations). The goal is not storytelling but producing crisper token embeddings for generation.
Why it exists: MLLMs trained on multimodal tasks learn language that aligns with visual concepts. Fine‑tuning pushes that alignment to be generation‑friendly.
Example: VQA like “What is to the left of the vase?” or “How many pastries?” tunes the embeddings to disambiguate space and number.

🍞 Hook: Practice first, then standardize your moves. 🥬 The Concept (Two‑step freezing in diffusion training): What it is: During training of the diffusion model, first learn the layer weights, then freeze them. How it works: (1) Train denoiser + layer weights jointly for an initial phase; (2) Once weights stabilize, freeze them; (3) Continue training only the denoiser (and other non‑frozen parts). Why it matters: Prevents a non‑stationary text condition that can confuse learning, leading to better convergence and quality. 🍞 Anchor: Their ablation shows continuous updating hurt performance; freezing improved GenAI‑Bench scores.

Step 4: Evaluate quickly with TED‑6K

What happens: For each test item, encode the long caption and each candidate statement. Use the trained aggregator to get a single vector for each. Compute cosine similarities; the correct answer is the statement with the highest similarity to the caption.
Why it exists: This mirrors how a generator checks if text embeddings and conditions match, without rendering any image or video.
Example: For the “VANILLA Latte” OCR item, the true statement should score highest over the three near‑miss spellings.

Step 5: Validate with real generation

What happens: For a subset of encoders, the team actually trains/fine‑tunes full T2I/T2V models and measures alignment (GenAI‑Bench). They compute correlation with TED‑6K scores.
Why it exists: To prove TED‑6K is a good predictor, not just a nice idea.
Example: Pearson correlations are very high (≈0.99 for T2I, ≈0.96 for T2V), confirming predictive power.

The secret sauce

Matching the consumer: The aggregator is designed to look like what Diffusion Transformers consume—sequence in, one context vector out—so the test actually reflects downstream use.
Hard negatives: Statement edits that are plausible but wrong (like near‑miss OCR or flipped relations) make the test discriminative.
Multilayer fusion: Using information from all layers, but with learned, then frozen weights, captures nuance without causing training drift.

04Experiments & Results

The test: TED‑6K measures whether a caption embedding matches the true statement more than three carefully crafted false ones across key skills (action, spatial/temporal relations, coreference, adjectives, adverbs, quantity, OCR, basic event). A learned aggregator turns token sequences into a sentence vector for fair comparison.

The competition: They evaluate many encoder families—encoder‑only (like UMT5), decoder‑only LLMs (Qwen3 sizes), and MLLMs (Qwen3‑VL, MiMo‑VL, Ovis‑2.5). They also compare feature strategies: last layer, penultimate layer, Norm‑Avg across layers, and the new learnable layer‑wise weighting.

The scoreboard (with context):

GRAN‑TED (full method with weighting) tops TED‑6K at 57.42, slightly above strong 32B models and above GRAN‑TED without weighting (57.22) and Qwen3‑VL‑8B Norm‑Avg (56.81). Think of this as going from a solid A to an A+ in a hard class.
Sub‑skills: GRAN‑TED improves most on Action (+2.32) and Temporal (+2.67), with smaller gains on Spatial (+0.79), Coreference (+1.39), OCR (+1.14), and Quantity (+0.86). Adjectives/Adverbs dip slightly, signaling room to grow on fine‑grained texture and manner words.
Speed: Evaluating an encoder with TED‑6K takes about 4 minutes versus ~50 hours to train a T2I model from scratch—a roughly 720× faster loop.

Surprising findings:

MLLMs excel on a text‑only test. Because they were trained to align language with vision, their word embeddings are already more visually grounded.
Instruction‑tuning effects are mixed; it’s not a guaranteed win for representation quality.
"Thinking" variants can help single‑layer features but may hurt when aggregating across layers.
Scaling laws show up clearly when using multi‑layer aggregation (Norm‑Avg); single‑layer features do not scale as predictably.

Correlation with real generation:

T2I: Pearson r ≈ 0.991 (p ≈ 1.1e‑4). That’s like a near‑perfect line: higher TED‑6K → higher GenAI‑Bench.
T2V: Pearson r ≈ 0.959 (p ≈ 0.041). Still a very strong match. This means TED‑6K is not just convenient; it’s meaningful.

Ablations that matter:

Learnable weights alone (kept changing during training) slightly hurt vs. fixed Norm‑Avg, confirming the “moving target” problem.
Two‑step training (learn weights, then freeze) beats both: GenAI‑Bench improves from 76.17 (Norm‑Avg) to 77.01.
Full GRAN‑TED (specialized fine‑tune + two‑step weighting) boosts downstream generation beyond strong baselines: +1.24 points for T2I (76.17 → 77.41) and +2.39 for T2V (77.94 → 80.33).

Robustness checks:

Shuffle test: Without the matching caption, accuracy drops to ~27–29% (near random for 4‑choice), proving items require the right context.
Aggregator vs. mean pooling: Mean pooling can mislead, especially for LLMs; the trained aggregator is necessary and aligns better with downstream results.
QA‑style evaluation: A very strong LLM can answer most questions perfectly regardless of encoder, making it non‑discriminative; similarity scoring is more telling here.

05Discussion & Limitations

Limitations:

Coverage: TED‑6K is broad but not complete. Some fine‑grained visual linguistics (tiny spatial nuances, micro‑attributes, rare OCR cases, subtle adverbs) remain challenging, reflected in slightly lower gains for adjectives/adverbs.
Proxy gap: It’s text‑only; while correlation is strong, it isn’t a perfect substitute for full end‑to‑end training across every domain (e.g., medical or scientific diagrams).
Aggregator dependence: The evaluation relies on a trained aggregator; poor training or domain shift could weaken its reliability.
Data construction cost: Generating dense captions and hard negatives (plus human verification) is non‑trivial.

Required resources:

Models: Access to an MLLM (e.g., around 8B parameters) for specialization.
Compute: GPUs for fine‑tuning encoder and training/fine‑tuning diffusion models; the aggregator is light.
Data: Curated captions/VQA sets aligned to generative semantics.

When NOT to use:

If you only need single‑vector retrieval or pure NLP reasoning, simpler benchmarks may suffice.
If your generator does not consume sequence‑level embeddings (or uses a very different conditioning path), TED‑6K may be less predictive.
For domains with highly specialized vocab/visuals (e.g., radiology), you may need domain‑specific TED variants.

Open questions:

Dynamic weighting: Could layer weights vary by diffusion timestep safely (e.g., different mixes for coarse vs. fine denoising) without causing instability?
Video temporality: How far can specialized training push temporal relation and motion verbs for long videos?
Negatives and safety: Can the encoder better enforce negative constraints (“no logo,” “no watermark”) under complex prompts?
Fairness and bias: How does performance vary across languages, cultures, or niche categories? What data additions improve equity?
Beyond transformers: Would alternative encoders or recurrent attention schemes offer more controllable, interpretable text conditions?

06Conclusion & Future Work

Three‑sentence summary: The paper introduces TED‑6K, a fast, text‑only benchmark with a small context aggregator that predicts how well a text encoder will guide diffusion models. Using this signal, the authors train GRAN‑TED—a specialized MLLM encoder with learnable, then frozen, layer‑wise weights—to produce richer, more stable text embeddings. Results show state‑of‑the‑art TED‑6K scores and clear boosts in text‑to‑image and text‑to‑video generation.

Main achievement: Turning text‑encoder selection and improvement from a slow guessing game into a rapid, predictive, and principled process—and delivering an encoder (GRAN‑TED) that measurably improves alignment.

Future directions: Expand TED‑6K’s coverage (more languages, domains, ultra‑fine spatial/OCR), explore safe timestep‑aware weighting, and strengthen negative constraint handling. Investigate training signals that further align text semantics with visual composition and aesthetics.

Why remember this: It shows that fixing the “language brain” of a generator—quickly, fairly, and with the right training—translates directly into images and videos that finally do what you asked. It’s a practical blueprint teams can adopt today to build more trustworthy creative AI.

Practical Applications

•Rapidly A/B test candidate text encoders for a new T2I/T2V product using TED‑6K instead of training full models.
•Upgrade an existing generator by swapping in GRAN‑TED to improve alignment on complex prompts.
•Tune encoder layer‑mixing with the two‑step freeze to stabilize and boost diffusion training.
•Diagnose prompt failures (counts, spatial/layout, temporal order) by checking TED‑6K sub‑scores.
•Build domain‑specific TED variants (e.g., ads, UI mockups, scientific diagrams) to select the best encoder for that niche.
•Automate regression checks in CI/CD by running TED‑6K to catch alignment drops after code/model changes.
•Guide dataset curation (more OCR or spatial cases) based on which TED‑6K skills lag for your model.
•Prototype lightweight aggregators for fair cross‑model comparisons in research benchmarks.
•Train safer negative‑constraint handling (e.g., “no logo”) by expanding TED‑style hard negatives.
•Educate users with prompt tips by surfacing which semantic aspects your encoder handles best (per TED‑6K).

Version: 1