Semantic Routing: Exploring Multi-Layer LLM Feature Weighting for Diffusion Transformers

Bozhou Li; Yushuo Guan; Haolin Li; Bohan Zeng; Yiyan Ji; Yue Ding; Pengfei Wan; Kun Gai; Yuanxing Zhang; Wentao Zhang

Semantic Routing: Exploring Multi-Layer LLM Feature Weighting for Diffusion Transformers

Intermediate

Bozhou Li, Yushuo Guan, Haolin Li et al.2/3/2026

arXiv PDF

Key Summary

•The paper shows that using information from many layers of a language model (not just one) helps text-to-image diffusion transformers follow prompts much better.
•It introduces a simple, unified way to mix these layers using normalized weights that can change with model depth and (optionally) time.
•Depth-wise Semantic Routing—letting each transformer block pick the best mix of language-model layers—works best and is stable.
•Purely time-based mixing can make images blurrier because the model’s actual cleaning speed at inference doesn’t match the training timeline.
•A small gating network learns which language-model layers to trust at each depth, using LayerNorm and softmax to combine them safely.
•On GenAI-Bench, depth-wise routing greatly improved hard skills like Counting by +9.97 over the standard single-layer setup.
•The method adds little compute overhead but gives consistent gains in alignment and compositional generation.
•A tiny manual time shift partially fixes time-based mixing, confirming that the real issue is a train–inference trajectory mismatch.
•The framework is general, interpretable, and sets a strong baseline for future trajectory-aware conditioning.
•Bottom line: match language meaning levels to the right parts of the image generator, and pictures follow instructions better.

Why This Research Matters

This work makes text-to-image models follow instructions more faithfully, which helps people get exactly the pictures they ask for. Artists and designers can specify tricky details—counts, colors, and layouts—and see them respected more often. Educators and students can create accurate visuals for lessons and projects without endless retries. Businesses can generate on-brand, precise images for ads or catalogs with fewer manual edits. The approach is simple, efficient, and interpretable, so others can adopt and build on it easily. Finally, it spotlights a key pitfall—time mismatch—guiding the community toward safer, smarter conditioning in future models.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine you’re building a LEGO city from a written plan. Early on, you focus on big roads and tall buildings. Later, you place tiny streetlights and window frames. If your helper only ever reads one page of the plan, some steps will be off.

🥬 The Concept: Text-to-image Diffusion Transformers (DiTs) turn sentences into pictures by cleaning random noise step by step until an image appears. They use a text encoder (often a big language model, or LLM) to understand your words and guide every step. How it works (big picture):

Start with noisy pixels. 2) Read the prompt with an LLM to get meaning. 3) At each denoising step, the DiT asks the text features, “What should I draw next?” 4) Repeat until clean. Why it matters: If the text guidance is too simple or not matched to each stage, the image may miss details, mix up objects, or ignore parts of the prompt. 🍞 Anchor: Think of drawing “a red balloon among pastel balloons.” Early steps set ‘balloons here’; later steps decide ‘this one is red’. You need the right kind of text hint at the right time.

🍞 Hook: You know how a thick book has chapters that go from basics to big ideas? LLMs are like that: early layers catch word meanings, middle layers catch phrases, and deep layers reason about complex ideas.

🥬 The Concept: Multi-layer LLM features are the hidden representations from many LLM layers, each carrying different levels of meaning—from simple word identity to complex relationships. How it works:

Feed the prompt to the LLM. 2) Save hidden states from multiple layers. 3) These layers act like a menu of meanings: shallow = simple, deep = abstract. 4) Choose and combine layers to guide the image. Why it matters: Using only one layer throws away helpful clues from other layers. 🍞 Anchor: When asked for “five purple roses in a vase,” some layers help with ‘roses’, others with ‘five’, and others with ‘in a vase’. Mixing layers helps count and place correctly.

🍞 Hook: Imagine a factory line: early stations shape big parts; later stations add tiny screws. Different stations need different instructions.

🥬 The Concept: Diffusion Transformers (DiTs) are image generators made of many transformer blocks (depth). Early blocks set structure; deeper blocks refine details. How it works:

DiT has stacked blocks. 2) Each block edits the image latent a little. 3) Early blocks handle layout; later blocks polish textures. 4) Text features steer each block via attention or normalization. Why it matters: If every block hears the same text message, some blocks get too much or too little of what they need. 🍞 Anchor: Telling both the first and last station “make it shiny” is wrong. The first station needs “make it big and round,” the last needs “add shiny gloss lines.”

🍞 Hook: You know how baking time changes your steps—mix at the start, frost at the end? Generation time in diffusion is similar.

🥬 The Concept: Text conditioning is the way the model injects language meaning into image-making throughout time and depth. How it works:

Encode the prompt with an LLM. 2) Feed those features into the DiT blocks. 3) Repeat at each denoising step. 4) Use the right ’flavor’ of text features when and where they help most. Why it matters: Without smart conditioning, the model might outline well but fail at colors, or count badly. 🍞 Anchor: “A pizza with pepperoni on the left, mushrooms on the right” needs conditioning that keeps left/right steady from start to finish.

🍞 Hook: Think of a toolbox with many screwdrivers. You wouldn’t always pick the same one.

🥬 The Concept: Before this paper, most systems used a single LLM layer (often the second-to-last one) for all blocks and times. How it works:

Encode text once. 2) Take one hidden layer. 3) Feed it to the whole generator. 4) Hope it works everywhere. Why it matters: This ignores the LLM’s rich hierarchy and the DiT’s different needs across depth and time. 🍞 Anchor: It’s like using a flathead screwdriver for every screw, even when a Phillips head would fit better.

🍞 Hook: Imagine a mixing board for music with sliders for bass, mids, and treble. Different parts of a song need different mixes.

🥬 The Concept: The paper’s gap: we lacked a simple, fair, and adaptive way to mix many LLM layers that changes with DiT depth and (maybe) time. How it works:

Collect multi-layer features. 2) Normalize them. 3) Learn weights to mix them. 4) Let weights depend on block depth and/or time. Why it matters: Without this, improvements were inconsistent, hard to compare, or tied to special architectures. 🍞 Anchor: A good DJ crossfades tracks at the right moments; a fixed blend sounds flat. This framework learns when to boost which semantic ‘track’.

🍞 Hook: Why should you care? Because better text-image matching means less frustration and more control.

🥬 The Concept: Real stakes are about making images that follow instructions exactly: correct counts, colors, layouts, and relations. How it works:

Stronger conditioning → better understanding. 2) Adaptive mixing → the right help at the right stage. 3) Results → clearer, more faithful pictures. Why it matters: It saves time, reduces edits, and helps creators, teachers, and designers get what they asked for. 🍞 Anchor: If you need “three blue cups to the right of a wooden fork,” you really want three, blue, and to-the-right—all at once.

02Core Idea

🍞 Hook: Think of a sports coach who gives different tips to different players at different moments in a game. That’s smarter than shouting one message to everyone all the time.

🥬 The Concept: Aha! Dynamically route and mix multiple LLM layers so each DiT block (and optionally each time step) gets the best-fitting semantics. Do it with a simple, normalized, convex combination and tiny gates. How it works:

Grab hidden states from many LLM layers. 2) Normalize each (LayerNorm). 3) Compute weights (via softmax) that sum to 1. 4) The weights can depend on DiT depth and/or time. 5) Take the weighted sum and feed it to the block. Why it matters: Matching the right level of meaning to the right place in the generator greatly boosts alignment and compositional skills. 🍞 Anchor: Early blocks prefer big-picture meaning; late blocks prefer fine-detail meaning. The router learns that pattern.

Three analogies for the same idea:

Orchestra: Different instruments (LLM layers) play louder or softer depending on which section of the song (DiT block) you’re in.
Cooking: Early simmering needs broad flavors; final plating needs precise spices. The recipe (weights) changes per stage.
Classroom: Younger grades need basic words; older grades need complex ideas. Teachers (router) adjust content by level.

Before vs After:

Before: One-size-fits-all text layer for every block and time; misses nuance, struggles with counting and spatial relations.
After: Depth-aware mixing consistently improves alignment and hard compositions; time-only mixing can be risky unless trajectory-aware.

Why it works (intuition, no equations):

LLMs store meanings in a semantic ladder: shallow → words, mid → phrases, deep → reasoning. DiT blocks also form a ladder: early → layout, late → texture. Ladders align! Matching rungs yields cleaner guidance.
Normalized convex mixing keeps features stable and interpretable. Softmax makes weights positive and sum to one, so the mix stays inside safe bounds.

Building Blocks (each with a sandwich):

🍞 Hook: Picking the right level of meaning for each station in a factory. 🥬 The Concept: Depth-wise Semantic Routing lets every DiT block choose its own blend of LLM layers. How it works:

Learn one weight vector per block. 2) Normalize LLM-layer features. 3) Softmax to get blend weights. 4) Feed blend to that block. Why it matters: Blocks doing structure vs. detail get the exact semantics they need. 🍞 Anchor: The chassis station gets ‘car shape’ hints; the paint station gets ‘exact shade of red’ hints.

🍞 Hook: A microwave timer changes what you do: prep at 10 minutes, plate at 1 minute. 🥬 The Concept: Time-conditioned Fusion Gate (TCFG) tries to change layer mixing as denoising time moves from coarse to fine. How it works:

Embed time with sinusoidal features. 2) Tiny MLP outputs logits for layer weights. 3) Softmax to get a time-varying blend. 4) Use same blend across blocks (in time-only mode) or per block (in joint mode). Why it matters: In theory, early time needs global meaning; later time needs fine meaning. 🍞 Anchor: Early ‘balloon here’, late ‘this one bright red’.

🍞 Hook: If you’re not sure which flavor is best, start by mixing everything evenly. 🥬 The Concept: Uniform Normalized Averaging takes all layers, normalizes them, and averages equally. How it works:

LayerNorm each layer. 2) Average them with equal weights. 3) No learning needed. Why it matters: A strong, simple baseline that often beats single-layer choices. 🍞 Anchor: Like giving everyone an equal slice of cake when you don’t yet know who’s hungriest.

🍞 Hook: A preset playlist that never changes. 🥬 The Concept: Static Learnable Fusion learns one global set of layer weights that never changes with depth or time. How it works:

Train a weight vector. 2) Softmax to mix layers. 3) Use one blend everywhere. Why it matters: Simpler than adaptive methods, but can’t tailor to different blocks or times. 🍞 Anchor: One volume mix for an entire concert, even when a solo comes up.

🍞 Hook: Sometimes you need both where and when—like traffic lights that depend on the intersection and the hour. 🥬 The Concept: Joint time-and-depth fusion gives each block its own time-aware gate. How it works:

One tiny gate per block reads time. 2) Each outputs its own layer weights. 3) Softmax, mix, feed per block per time. Why it matters: More flexible than time-only; often steadier. 🍞 Anchor: Rush-hour signals differ by intersection; off-hours signals relax.

🍞 Hook: Mixing paints safely so colors don’t explode off the palette. 🥬 The Concept: Normalized convex fusion (LayerNorm + softmax) keeps the mixture stable and interpretable. How it works:

Normalize each layer to align scales. 2) Compute positive weights that sum to 1. 3) Weighted sum stays inside a safe “convex hull.” Why it matters: Prevents wild activations and makes weights easy to understand. 🍞 Anchor: If you mix 30% blue, 70% yellow, you stay between blue and yellow, not some unrelated color.

03Methodology

High-level recipe: Prompt → LLM multi-layer features → Normalize each layer (LayerNorm) → Compute fusion weights (by depth, by time, or both) → Softmax to get a convex mix → Fused text features → Feed to each DiT block’s conditioning → Image.

Step-by-step with sandwich explanations where new ideas appear:

Inputs: Multi-layer LLM features. 🍞 Hook: Think of a layered cake—each layer tastes different. 🥬 The Concept: Multi-layer LLM features are hidden states from many layers of the LLM, each capturing different meanings. How it works:

Feed prompt into LLM.
Save hidden states from L layers.
Treat them as a menu of semantic flavors. Why it matters: More flavors let you match the right taste to the right bite of the image. 🍞 Anchor: For “two cats on a sofa,” some layers help with ‘two’, others with ‘cats’, others with placement.

Normalize each layer. 🍞 Hook: Before mixing paints, you put them in same-sized cups. 🥬 The Concept: Layer Normalization (LayerNorm) scales and centers each layer’s features to be comparable. How it works:

Compute per-token mean and variance.
Normalize to zero-mean, unit-variance-like scale.
Keep affine parameters if needed. Why it matters: Without normalization, some layers dominate just because they’re larger in scale. 🍞 Anchor: If one microphone is twice as loud, it drowns out the others; normalization evens volumes.

Compute fusion weights. 🍞 Hook: A mixing board sets how loud each track plays. 🥬 The Concept: Fusion weights tell how much each LLM layer contributes to the final mix. How it works:

Start from logits (one score per layer).
Apply softmax to make them positive and sum to 1.
The logits can be learned per depth, per time, or both. Why it matters: Clear, stable, and interpretable blends. 🍞 Anchor: 0.5 on deep layer, 0.2 on mid layer, 0.3 on another means a balanced blend favoring depth.

Parameterize the weights in three ways:

S1: Time-wise fusion (shared over depth)
S2: Depth-wise fusion (shared over time)
S3: Joint time-and-depth fusion (per-block time gates)

🍞 Hook: A kitchen timer that changes your cooking steps. 🥬 The Concept: Time-Conditioned Fusion Gate (TCFG) makes weights depend on the denoising time. How it works:

Encode time with sinusoids (so nearby times look similar).
Tiny MLP maps time-embedding to logits.
Softmax → weights for layers.
Use the same weights for every block (S1) or one gate per block (S3). Why it matters: Early vs. late steps want different semantics, in principle. 🍞 Anchor: Early: global placement; late: fine textures.

🍞 Hook: Each factory station needs a different instruction sheet. 🥬 The Concept: Depth-wise Semantic Routing (S2) learns one weight vector per DiT block. How it works:

For block d, keep learnable logits β_d (one per layer).
Softmax(β_d) → weights for that block.
Mix normalized layer features with those weights.
Feed to block d’s conditioning (e.g., cross-attention or adaLN controls). Why it matters: Structure-focused blocks get big-picture semantics; detail blocks get fine-grain semantics. 🍞 Anchor: The layout block gets ‘where things go’; the texture block gets ‘fur strands and shine’.

🍞 Hook: Sometimes you need both where and when. 🥬 The Concept: Joint fusion (S3) gives each block its own time-aware gate. How it works:

For each block d, have a tiny TCFG.
At time t, produce logits z_{t,d}.
Softmax → weights.
Mix and feed to that block at that time. Why it matters: More flexibility than either axis alone; often smoother over time than S1. 🍞 Anchor: Morning rush vs. evening calm differs by intersection.
Softmax convex mixing. 🍞 Hook: Measuring scoops so the total always adds to one cup. 🥬 The Concept: Softmax makes all weights positive and sum to 1, so the final feature is a convex mix. How it works:
Exponentiate logits.
Divide by their sum.
Multiply each normalized layer by its weight; sum. Why it matters: Prevents unstable, negative, or exploding mixes; keeps interpretability. 🍞 Anchor: 25% layer A, 50% layer B, 25% layer C = a safe blend.
Feed fused features into DiT blocks. 🍞 Hook: Hand each station the exact instructions it needs. 🥬 The Concept: Text conditioning injects these fused features into each block’s attention or normalization controls so the image follows the prompt. How it works:
Provide fused sequence to cross-attention keys/values or to adaptive normalization.
The block uses it to steer edits to the image latent.
Repeat across blocks and time steps. Why it matters: This is how words become pixels. 🍞 Anchor: ‘Five purple roses in a vase’ persists through all steps so the final image counts correctly.

Concrete toy example:

Suppose 4 LLM layers after LayerNorm: L0, L1, L2, L3.
For a mid-depth block, learned weights might be [0.10, 0.20, 0.50, 0.20].
Fused = 0.10L0 + 0.20L1 + 0.50L2 + 0.20L3.
Early block might prefer more L3 (global meaning), late block more L2/L1 (fine details), depending on what the router learns.

Secret sauce:

Match two ladders: LLM’s meaning ladder and DiT’s block ladder.
Keep mixing simple, normalized, and tiny (low overhead).
Let data teach which layers each block trusts most.

Optional advanced detail (kept simple): 🍞 Hook: Walking from noise to image is like following a river current. 🥬 The Concept: Flow matching describes the clean-up as following a learned vector field over time. How it works:

Start at noise. 2) The model learns directions that move you toward the data. 3) Integrate tiny steps along time. 4) End at an image. Why it matters: It’s the backbone schedule the router must respect. 🍞 Anchor: The current is stronger early (big moves), gentler late (polish).

04Experiments & Results

🍞 Hook: Imagine a science fair where different teams build image generators. We judge them on how well they follow instructions, how neat they look, and how well they handle tough challenges like counting.

🥬 The Concept: The authors tested three adaptive strategies (time-wise S1, depth-wise S2, joint S3) against common baselines (single layer, uniform average, static learnable) and a deep-fusion model (FuseDiT). How it works:

Train on a 30M high-quality subset of LAION-400M with dense captions.
Use a strong DiT (about 2.24B params) and Qwen3-VL-4B as text encoder.
Evaluate with GenEval and GenAI-Bench for text-image alignment, plus UnifiedReward style scores. Why it matters: These cover everyday alignment, compositional challenges (like counting and spatial relations), and aesthetics. 🍞 Anchor: It’s like testing whether the model draws “five balloons, one bright red” correctly, not just any balloons.

Scoreboard with context (selected highlights):

Baseline (B1, single penultimate layer): GenEval 64.54; GenAI 74.96; UnifiedReward 3.02.
Uniform averaging (B2): better than B1: GenEval 66.51; GenAI 76.82; UnifiedReward 3.06.
Static learnable (B3): roughly similar to B2; not clearly better than uniform.
Deep-fusion baseline (FuseDiT): trails on GenEval (60.95) and similar GenAI (75.02); shows that reusing LLM internal states doesn’t guarantee better conditioning here.
Our strategies: • S1 (Time-only): GenEval 63.41; GenAI 76.20; UnifiedReward 2.97. Sometimes blurrier images. • S2 (Depth-only): Best overall. GenEval 67.07; GenAI 79.07; UnifiedReward 3.06. • S3 (Joint): Strong but slightly below S2. GenEval 66.05; GenAI 77.44; UnifiedReward 3.06. Interpretation: S2’s GenEval 67.07 is like getting an A- when the typical class score is a B; on GenAI-Bench, 79.07 is a clear jump over 74.96.

Fine-grained skills (GenAI-Bench):

Biggest gains in “Advanced” skills (Counting, Comparison, Differentiation, Negation, Universal).
Counting example: S2 beats B1 by +9.97 and B2 by +5.45—like moving from often-miscounting to reliably getting numbers right.

Surprising findings: 🍞 Hook: Sometimes, trying to be extra clever can backfire. 🥬 The Concept: Purely time-wise fusion (S1) often hurt fidelity, causing blur, due to a train–inference trajectory mismatch. How it works:

At training, each ‘time’ value maps to a certain noise level. 2) At inference, with classifier-free guidance (CFG), the model cleans faster than the training schedule predicts. 3) So the nominal ‘time’ no longer matches the real cleanliness (SNR) of the image. 4) The time-gate injects the wrong kind of semantics at the wrong moment. Why it matters: Clever timing needs the right clock; if the clock is off, timing-based mixing misfires. 🍞 Anchor: It’s like seasoning food as if it were still raw, even though it’s already half-cooked—flavors get muddy.

A small fix that proves the point:

Manually shifting the time input a tiny bit forward (to ‘catch up’ with the faster inference) slightly improves S1 (e.g., GenEval +0.24), confirming the mismatch cause.

Compute overhead:

Depth-wise S2 adds negligible parameters and about ~8% latency; S3 adds a bit more but is still light.
Despite lower FLOPs, FuseDiT underperforms here, suggesting that simple, interpretable routing can beat heavier internal state sharing.

Bottom line:

Multi-layer beats single-layer.
Depth-wise adaptive routing is the safest, strongest win.
Time-only routing needs a trajectory-aware signal (a better clock) to shine.

05Discussion & Limitations

Limits and caveats:

Time-only fusion can blur images due to a train–inference mismatch: the model cleans faster during sampling (especially with CFG), so nominal time t doesn’t reflect real progress (SNR). Without a ‘true progress’ signal, timing-based gates inject the wrong semantics.
Even though routing is lightweight, using a large DiT (≈2.24B params) still requires strong compute and memory; training on 30M pairs isn’t trivial.
Gains are shown for text-to-image at 256×256 with specific settings; other tasks (e.g., video, very high resolutions) may need extra care to transfer benefits.
Deep-fusion approaches may still win in other contexts; here, the simple convex routing worked better, but that may depend on architecture and training.

Required resources:

A capable LLM text encoder (Qwen3-VL-4B here) that exposes multi-layer features.
A DiT backbone with cross-attention or adaptive norm hooks for conditioning.
Enough data (tens of millions of pairs) and GPUs for multi-hundred-thousand training steps.

When not to use:

If you cannot access internal multi-layer LLM states (e.g., closed APIs), you lose the main advantage.
Ultra-low-latency edge settings where even ~8% extra cost is prohibitive.
Pipelines with strong temporal schedulers already coupled to progress-aware signals; time-only gating may conflict unless aligned.

Open questions:

How to build a robust, trajectory-aware time signal that tracks real SNR/progress at inference? (e.g., progress estimators, learned denoise meters, or schedule-adaptive controllers.)
Can routing be extended to spatial regions (different parts of the image use different semantic mixes) safely and efficiently?
What about token-level routing (different prompt tokens get different layer mixes) without destabilizing training?
How does this scale to video, where both time in diffusion and time in content interplay?
Can we distill or prune routing to keep benefits with even lower overhead?

06Conclusion & Future Work

Three-sentence summary:

The paper introduces a unified, lightweight framework that mixes multiple LLM layers using normalized convex weights conditioned on DiT depth (and optionally time), to better guide diffusion transformers.
Depth-wise Semantic Routing consistently boosts text–image alignment and compositional skills, while pure time-wise fusion can hurt fidelity due to a train–inference trajectory mismatch.
The approach is efficient, interpretable, and sets a strong baseline, highlighting the need for trajectory-aware signals to unlock safe time-dependent conditioning.

Main achievement:

Showing that simply aligning the LLM’s semantic hierarchy with the DiT’s functional depth—via normalized, convex layer mixing—delivers robust, significant gains (e.g., +9.97 on Counting) with minimal overhead.

Future directions:

Design trajectory-aware time signals that reflect actual inference progress (effective SNR) so time-based gates help rather than hurt.
Explore spatial and token-level routing, and extend to video generation where multiple time axes interact.
Distill routing into ultra-light policies for real-time deployment.

Why remember this:

It provides a clean, general recipe: mix many LLM layers, normalize them, and let each DiT block choose what it needs.
It’s a rare win that is both simple and powerful, turning the LLM’s hidden hierarchy into practical, controllable gains.
It reframes conditioning as routing: the right meaning, to the right place, at the right moment.

Practical Applications

•Prompting product photos with exact counts and arrangements (e.g., ‘three mugs in a row, middle one red’).
•Generating educational diagrams that respect numbers and spatial relations (e.g., ‘five planets orbiting to the left of the star’).
•Design mockups with reliable color and layout constraints (e.g., ‘logo centered, two blue buttons below’).
•Story illustration where character attributes persist (e.g., ‘the pilot with aviator sunglasses appears on every page’).
•Marketing creatives that precisely follow brand guidelines (colors, positions, counts).
•Data augmentation for vision tasks requiring compositional control (objects, counts, relations).
•Rapid prototyping of UI elements with strict placement (e.g., ‘three icons aligned right, equally spaced’).
•Visual tutoring tools that render math or science prompts faithfully (e.g., ‘a beaker with two red balls and one blue ball inside’).
•Game asset generation with structured layouts (e.g., ‘five coins arranged in a cross pattern’).
•Assistive tools for accessibility where exact visual attributes need to be consistent with text descriptions.

Version: 1