What matters for Representation Alignment: Global Information or Spatial Structure?

Jaskirat Singh; Xingjian Leng; Zongze Wu; Liang Zheng; Richard Zhang; Eli Shechtman; Saining Xie

What matters for Representation Alignment: Global Information or Spatial Structure?

Intermediate

Jaskirat Singh, Xingjian Leng, Zongze Wu et al.12/11/2025

arXiv PDF

Key Summary

•This paper asks whether generation training benefits more from an encoder’s big-picture meaning (global semantics) or from how features are arranged across space (spatial structure).
•Across 27 vision encoders and multiple diffusion model sizes, spatial structure strongly predicts generation quality, while ImageNet-style accuracy barely does.
•They define simple spatial self-similarity metrics (like LDS) that correlate with FID much better than linear probing accuracy.
•Surprisingly, encoders with lower ImageNet accuracy but stronger spatial structure often yield better generations when used for representation alignment.
•The authors introduce iREPA, a tiny (<4 lines) change to REPA: swap the MLP projector for a small convolution and add a spatial normalization layer.
•These two changes consistently speed up convergence and improve FID/IS across many encoders, model sizes, and training recipes (REPA, REPA-E, MeanFlow, JiT).
•Injecting extra global information (like mixing CLS into patch tokens) improves classification but hurts generation by washing out spatial contrasts.
•Even classic spatial features (without strong semantics) can help, showing that spatial signals alone can be valuable.
•Takeaway: for aligning diffusion features to a teacher encoder, preserve and accentuate spatial structure first; global semantics are not the main driver.

Why This Research Matters

Picking the wrong teacher encoder can waste time and compute, slowing creative tools, photo editors, and research workflows. This paper gives a fast, practical rule: pick and shape teacher features for spatial structure, not just high classification accuracy. The iREPA changes are minimal yet reliably speed training and improve image quality across many backbones and recipes. That means better results sooner for applications like design assistance, product visualization, and scientific imaging. It also encourages a shift in how we evaluate and pretrain vision encoders aimed at generation tasks. Ultimately, focusing on spatial coherence leads to more faithful shapes, cleaner edges, and more reliable generations.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine you’re building a LEGO city. You can know the city’s name (global meaning), but if you don’t know where the roads and buildings go (spatial structure), your city won’t look right.

🥬 The Concept: Diffusion model

What it is: A diffusion model is a generator that learns to turn noisy images into clean images step by step.
How it works:
1. Start with random noise (like TV static).
2. Take a tiny step to make it less noisy, guided by a learned rule.
3. Repeat many times until a clear picture appears.
4. The model learns these steps by practicing how to remove noise.
Why it matters: Without this careful noise-removal process, you don’t get realistic images; they’d look messy or blurry. 🍞 Anchor: Like erasing pencil smudges one pass at a time to reveal a neat drawing.

🍞 Hook: You know how a student learns faster when a tutor shows good examples? That’s what alignment is like.

🥬 The Concept: Representation Alignment (REPA)

What it is: REPA trains a diffusion model faster by matching its internal features to a strong, pretrained vision encoder’s features.
How it works:
1. Pass an image through a frozen teacher encoder to get patch-level features.
2. Pass a noisy version through the diffusion model to get its features.
3. Add a loss to make diffusion features line up with teacher features.
4. Keep training so the student quickly learns rich visual cues.
Why it matters: Without alignment, diffusion models take much longer to learn meaningful structure. 🍞 Anchor: Like tracing over a good sketch to quickly learn good proportions.

🍞 Hook: Knowing a book’s title doesn’t tell you where each character is on the map of the story.

🥬 The Concept: Global Semantic Information

What it is: Global semantics capture the “what is in the image” idea (like the label “dog”).
How it works:
1. Squeeze many details into a compact summary.
2. That summary helps with classification.
3. It often blurs local differences between nearby parts.
Why it matters: If you only keep global meaning, you might lose where things are and how they relate spatially. 🍞 Anchor: Knowing a photo “is a beach” doesn’t tell you exactly where the umbrella is.

🍞 Hook: If you rearrange puzzle pieces randomly, the picture’s meaning is lost.

🥬 The Concept: Spatial Structure

What it is: Spatial structure is how parts of an image relate across space, such as nearby patches being more similar than far-away ones.
How it works:
1. Break the image into patches.
2. Compute how similar each patch is to others.
3. Notice patterns: close patches should be more alike than distant ones.
4. Preserve these patterns during training.
Why it matters: Without spatial structure, images lose shape and coherence (e.g., eyes could drift off a face). 🍞 Anchor: Streets near each other in a city are connected; if those links vanish, the map becomes useless.

🍞 Hook: A quick check-up tells you if a student can answer a simple kind of question.

🥬 The Concept: Linear Probing

What it is: A simple test where we train a linear classifier on frozen features to see how well they predict labels (like ImageNet classes).
How it works:
1. Freeze the encoder’s features.
2. Train a single linear layer to predict labels.
3. Measure accuracy on a validation set.
4. Higher accuracy means better global semantics.
Why it matters: If we rely only on this score, we may overlook spatial quality that matters for generation. 🍞 Anchor: It’s like a quiz that checks general knowledge but not map-reading skills.

🍞 Hook: Think of a photo grid, like a tiled floor, where each tile describes what’s under it.

🥬 The Concept: Patch Tokens

What it is: Patch tokens are per-patch feature vectors that describe local image areas.
How it works:
1. Split the image into patches.
2. Encode each patch into a vector (a token).
3. Compare tokens to learn local relationships.
4. Use these tokens to guide the generator.
Why it matters: Without strong patch tokens, generators can’t keep shapes aligned. 🍞 Anchor: Each city block on a map gets its own note; together they form the full city plan.

The world before: People believed that “stronger global semantics” (high ImageNet linear-probe accuracy) in the teacher encoder would make aligned diffusion models generate better images. Teams picked encoders by their classification prowess and celebrated rising validation scores when aligning diffusion features.

The problem: Lots of real results didn’t fit this belief. Some encoders with stellar ImageNet accuracy made generation worse. Other encoders, with modest classification, made generation better. Injecting extra global information into patch tokens (mixing the CLS token) improved classification—but hurt FID (image quality).

Failed attempts: Scaling to bigger encoders of the same family often gave little or worse generation gains, even though linear-probe scores climbed. Pushing more global cues into local tokens made spatial contrast weaker and generations blurrier.

The gap: No one had a simple, reliable predictor of which teacher features actually help generation. We were measuring the wrong thing (global labels) instead of the right thing (spatial organization among patches).

Real stakes: Choosing the wrong teacher slows training, wastes compute, and gives blurrier results. For creative tools, photo editors, design co-pilots, or science imaging, that’s lost time and lower quality. The paper fills this gap by showing that spatial structure—not global semantics—drives success, and by offering a tiny, practical recipe (iREPA) to consistently speed up and improve training.

02Core Idea

🍞 Hook: When you build a puzzle, the picture on the box (global meaning) helps, but the way pieces lock together (spatial structure) is what really makes it work.

Aha in one sentence: The spatial structure of teacher features—not their global semantic strength—is what actually powers representation alignment for better, faster generation.

Three analogies:

Puzzle vs. Picture: The box art tells you “it’s a beach,” but the interlocking edges tell you how to assemble it; generators need the interlocking edges (spatial structure).
City Map vs. City Name: Knowing it’s “New York” (global) doesn’t help you drive; knowing street-to-street connections (spatial) gets you there.
Choir vs. Seating Chart: A strong choir (good global sound) is nice, but a performance falls apart if singers don’t know who stands next to whom (spatial arrangement).

Before vs. After:

Before: Researchers picked teachers by ImageNet linear-probe accuracy; they assumed better global semantics meant better generation.
After: We pick teachers (or adjust them) for strong patch-to-patch structure. We measure spatial self-similarity and even enhance it. Results: faster convergence, better FID, broader generalization.

Why it works (intuition, no equations): Diffusion models reconstruct images by stitching local neighborhoods back together from noise. That demands crisp, local relationships—nearby patches should be more similar than far-away ones, and same-object patches should cohere while background stays distinct. Too much global information can wash out these differences, making many patches look too similar to everything else. When the teacher preserves strong spatial contrast, the student learns a clear blueprint for where edges, parts, and textures belong—so denoising has the right guide rails.

Building blocks of the idea:

🍞 Hook: You know how neighbors in a neighborhood usually look more alike than people across town?

🥬 The Concept: Spatial Self-Similarity Metric (LDS as an example)

What it is: A simple score that checks whether nearby patches are more similar than far-away patches on average.
How it works:
1. Split an image into patches and extract features.
2. Compute patch-to-patch cosine similarities.
3. Average similarities for close pairs and for distant pairs.
4. Subtract distant from close: bigger is better spatial structure.
Why it matters: Without a way to measure spatial structure, we’d keep using the wrong yardstick (classification) to pick teachers. 🍞 Anchor: Like checking if nearby houses share styles more than random houses across the city.

🍞 Hook: A tiny tool swap can make a job way easier, like using a rubber mallet instead of a steel hammer when you need gentler taps.

🥬 The Concept: iREPA

What it is: A minimal upgrade to REPA that emphasizes spatial structure using two code-line changes.
How it works:
1. Replace the projection MLP with a small convolution (kernel 3) to respect the spatial grid.
2. Add a spatial normalization layer that removes global overlays and heightens local contrast.
3. Train alignment as usual.
4. Enjoy faster convergence and better FID.
Why it matters: Without these tweaks, the projector can blur spatial cues and the teacher features can carry too much global haze. 🍞 Anchor: Like sharpening the outline on a coloring page before you start coloring; staying inside the lines gets easier and faster.

🍞 Hook: It’s hard to color neatly if the lines are faint and everything looks the same shade.

🥬 The Concept: Spatial Normalization Layer

What it is: A simple normalization over spatial tokens that removes the shared global component and boosts patch-to-patch contrast.
How it works:
1. Compute the mean and variance across the spatial grid for each feature channel.
2. Subtract the mean (remove global bias).
3. Divide by the standard deviation (normalize scale).
4. Pass the contrast-enhanced tokens to alignment.
Why it matters: Without it, foreground and background patches can look too similar, confusing the generator about edges and parts. 🍞 Anchor: Like turning down a bright room light so local desk lamps reveal details on the page.

🍞 Hook: If you want to preserve a pattern on a quilt, don’t mash all squares together in a blender.

🥬 The Concept: Convolution Layer (as projector)

What it is: A small conv that maps features while respecting neighbors on the grid.
How it works:
1. Arrange tokens back into an H×W grid.
2. Apply a 3×3 convolution with padding to mix local neighbors.
3. Produce target-dimension features.
4. Keep local relationships intact better than a fully-connected MLP.
Why it matters: Without a spatially-aware projector, alignment can lose the very structure we’re trying to transfer. 🍞 Anchor: Like sewing quilt patches with neat, local stitches instead of gluing the whole blanket at once.

Put together, the core idea flips the selection rule (pick strong spatial encoders, not just high-accuracy ones), provides a better metric (LDS and friends), and gives a tiny, practical recipe (iREPA) that consistently speeds training.

03Methodology

High-level recipe: Input (images + teacher encoder) → Measure spatial structure → Align diffusion features to teacher features → Accentuate spatial info (conv projector + spatial norm) → Output (faster convergence, better FID).

Step 1: Choose and evaluate teacher encoders by spatial structure (not just global accuracy)

What happens: For each candidate encoder, extract patch tokens and compute a spatial self-similarity score (e.g., LDS). Optionally compute variants (CDS, SRSS, RMSC) to cross-check.
Why this step exists: If we only check linear probing (classification), we might pick teachers that wash out local structure, leading to worse generations.
Example: Encoder A has higher ImageNet accuracy than Encoder B, but B has higher LDS. According to the paper, B likely yields better FID when used for alignment.

Step 2: Set up REPA alignment

🍞 Hook: Copying neat handwriting makes your own neater faster.

🥬 The Concept: REPA Setup

What it is: Matching student (diffusion) features to teacher (encoder) features on patch tokens.
How it works:
1. Teacher gets clean image patch features; student gets noisy input features.
2. A projector maps student features to teacher’s dimension.
3. A loss encourages student features to match teacher features.
4. Train jointly with the diffusion loss.
Why it matters: Without this guide, the student takes longer to learn usable structure. 🍞 Anchor: Like overlaying tracing paper to copy a great sketch.

Step 3: Replace the projector MLP with a small convolution

🍞 Hook: Use a comb to align hair strands locally, not a leaf blower.

🥬 The Concept: Convolutional Projector

What it is: A 3×3 conv (padding 1) that maps student features while preserving neighborhood relations.
How it works:
1. Reshape student tokens into a 2D grid.
2. Apply 3×3 conv to mix each patch with its neighbors.
3. Output the target-dimensional features for alignment.
4. Backprop along with the rest of training.
Why it matters: Without it, an MLP can scramble or dilute local patterns critical for generation. 🍞 Anchor: Like smoothing a wrinkled map with small local presses instead of flattening it with a giant roller that erases landmarks.

Step 4: Add a spatial normalization layer on teacher patch tokens

🍞 Hook: Dim the room’s overhead light so you can see the glow of each lamp.

🥬 The Concept: Spatial Normalization on Teacher Features

What it is: Normalize across space to remove a global overlay and boost local contrasts.
How it works:
1. For each feature channel, compute mean and std across all patches.
2. Subtract mean (remove shared global component).
3. Divide by std (balance scales).
4. Use these normalized tokens as the alignment target.
Why it matters: Without removing global bias, patches from different regions can look too similar, weakening spatial signals. 🍞 Anchor: Imagine rebalancing a photo’s lighting so edges and textures pop.

Step 5: Train as usual and evaluate with generation metrics

🍞 Hook: Report cards mean more when they measure what you practiced.

🥬 The Concept: FID (as the main scoreboard)

What it is: A standard score that compares distributions of real vs. generated images; lower is better.
How it works:
1. Generate a lot of images.
2. Extract features (Inception) from both real and generated sets.
3. Compare their statistics.
4. Report a single number; lower means closer to real.
Why it matters: Without a reliable generation metric, we can’t judge progress or compare methods. 🍞 Anchor: It’s like checking how closely your practice drawings match the originals.

Concrete data examples from the paper:

Injecting more global information (mixing CLS into patch tokens) raises linear-probe accuracy from ~70.7% to ~78.5% but worsens FID from ~19.2 to ~25.4.
Across 27 encoders, spatial metrics correlate strongly with FID (|r| > 0.85), while linear probing correlates weakly (|r| ≈ 0.26).
iREPA improves FID broadly, e.g., PE-G goes from ~32.3 to ~19–20 at 100K steps; DINOv3-B from ~21.4 to ~16.2.

The secret sauce:

Keep what denoising needs most: crisp local relationships among patches.
Prevent the projector from smearing spatial patterns (use conv).
Prevent the teacher’s global overlay from drowning local contrasts (use spatial norm).
Measure and pick teachers by spatial metrics, not just classification accuracy.
These steps are tiny to implement and consistently speed up training.

04Experiments & Results

The test: The authors measured how well different teacher encoders help a diffusion model train under REPA-style alignment. They examined convergence speed and final image quality (mostly via FID), and looked for what properties of the teacher best predict performance.

The competition: 27 different vision encoders (including DINOv2/v3, WebSSL, PE family, CLIP, SAM2) paired with multiple diffusion model sizes (SiT-B/L/XL). They compared classic REPA versus iREPA (their improved recipe), and checked generalization to REPA-E, MeanFlow+REPA, and pixel-space JiT+REPA.

The scoreboard with context:

Spatial beats global: Spatial self-similarity metrics (like LDS, SRSS, CDS, RMSC) show very strong correlation with generation quality (Pearson |r| > 0.85), while ImageNet linear-probe accuracy shows weak correlation (|r| ≈ 0.26). That’s like a test that predicts A+ report cards almost perfectly vs. one that barely predicts a B-.
Inversion examples: PE-Core-G (82.8% validation) yields worse FID (~32.3) than PE-Spatial-B (53.1% validation, ~21.0 FID). Similarly, WebSSL-1B (76.0%) underperforms compared to a lower-accuracy model with better spatial structure.
Global hurts when over-mixed: Mixing the CLS token into patch tokens improved linear probing but degraded FID (from ~19.2 to ~25.4 as mixing increased), showing that extra global info can wash out spatial contrasts.
Tiny model wins big: SAM2-S with very low validation accuracy (~24.1%) still gave better FID than some encoders with ~60% higher accuracy, because its spatial structure was stronger.

iREPA improvements (examples):

Across encoders at 100K steps (SiT-XL/2), iREPA consistently lowers FID and raises IS. For DINOv3-B, FID improved from ~21.4 to ~16.2; for WebSSL-1B, ~26.1 to ~16.6; for PE-G, ~32.3 to ~18–19.
Scaling holds: Larger diffusion models see even bigger percentage gains (e.g., SiT-XL shows stronger relative improvement than SiT-B), indicating the method scales.
Across encoder sizes (PE-B/L/G), iREPA cuts FID substantially, and the percentage improvement grows with encoder size.
Recipe variants: Adding iREPA on top of REPA-E and MeanFlow+REPA yields consistent boosts; with JiT+REPA (pixel-space diffusion), iREPA again converges faster across multiple encoders.

Surprising findings:

More global isn’t better: Injecting global info to improve classification can harm generation quality by reducing spatial contrast.
Bigger isn’t always better: Larger encoders in the same family can produce similar or worse FID if their spatial structure weakens during scaling.
Spatial alone can help: Even encoders that don’t shine at semantics (e.g., SAM2) can be excellent teachers for generation if their spatial coherence is strong.

Take-home: If you want better, faster diffusion training, choose and shape teacher features to maximize spatial structure, and use iREPA to preserve and amplify those cues during alignment.

05Discussion & Limitations

Limitations:

Scope of factors: While spatial structure explains a lot, generation quality can also depend on other aspects (texture diversity, long-range dependencies, training noise schedules). Spatial metrics are powerful predictors, but not the whole story.
Metric choices: LDS and related metrics are simple and fast, but there could be corner cases where they mis-rank encoders with unusual structures or specialized domains.
Task generality: The study focuses on image generation at 256×256 and common backbones; results may vary in other resolutions or niche domains without re-tuning.

Required resources:

A pretrained vision encoder (teacher), a diffusion transformer (student), and typical training hardware for image generation (multi-GPU training for ImageNet-scale runs).
Minimal code changes (<4 lines) for iREPA: a 3×3 conv projector and a spatial normalization pass on teacher tokens.

When not to use:

If your task is pure classification, global semantics matter more, and iREPA’s spatial emphasis might not help.
If your teacher already has extremely strong spatial structure and minimal global overlay, the marginal gains from spatial normalization could be small.
If your model architecture is non-grid or deliberately discards spatial layouts, a conv projector may not be the right fit.

Open questions:

How do spatial metrics extend to higher resolutions and multimodal settings (video, 3D, text-to-image with complex layouts)?
Can we learn a projector that balances local and long-range structure adaptively, rather than a fixed 3×3 conv?
What is the best way to quantify and control the global-vs-spatial trade-off during pretraining of the teacher itself?
Can new self-supervised objectives explicitly target strong spatial self-similarity without hurting semantics, giving the best of both worlds?
How does spatial structure interact with guidance methods (e.g., classifier-free guidance) at different sampling budgets?

06Conclusion & Future Work

Three-sentence summary: This paper shows that for representation alignment in diffusion training, spatial structure—not global semantic accuracy—drives better generation. Simple spatial self-similarity metrics predict FID far better than linear probing, and a tiny upgrade called iREPA (conv projector + spatial normalization) consistently speeds convergence and improves quality. The message is clear: measure, preserve, and enhance spatial cues in teacher features to supercharge generative training.

Main achievement: Flipping the field’s selection rule—choose teacher encoders by spatial structure and make alignment preserve it—backed by large-scale evidence and a practical, minimal implementation that works across models and recipes.

Future directions:

Design teacher encoders that explicitly optimize spatial self-similarity without sacrificing necessary semantics.
Develop adaptive projectors that preserve both local and long-range spatial relations.
Extend spatial-structure evaluation and iREPA to higher resolutions, videos, and 3D.

Why remember this: When training generators, it’s not just what’s in the image that matters—it’s where and how parts fit together. Keep the spatial map crisp, and the pictures come out faster and better.

Practical Applications

•Select teacher encoders for diffusion training by their spatial self-similarity (LDS), not their ImageNet score.
•Add a 3×3 convolutional projector (padding 1) to replace the MLP in REPA for better spatial transfer.
•Apply spatial normalization to teacher patch tokens to remove global overlays and boost local contrast.
•Avoid mixing CLS or global averages into patch tokens when the goal is better generation quality.
•Use spatial metrics (LDS/CDS/SRSS/RMSC) as a quick pre-screen to choose among encoders before expensive training.
•Adopt iREPA on top of REPA-E or MeanFlow+REPA to accelerate convergence further.
•Tune spatial normalization strength (gamma range) to balance global vs. local emphasis per dataset.
•Monitor both FID and spatial metrics during training to diagnose when spatial cues are being lost.
•For pixel-space diffusion (e.g., JiT), still use iREPA; it generalizes and speeds training there too.

Version: 1