Is There a Better Source Distribution than Gaussian? Exploring Source Distributions for Image Flow Matching

Junho Lee; Kwanseok Kim; Joonseok Lee

Is There a Better Source Distribution than Gaussian? Exploring Source Distributions for Image Flow Matching

Intermediate

Junho Lee, Kwanseok Kim, Joonseok Lee12/20/2025

arXiv PDF

Key Summary

•Flow Matching is like teaching arrows to push points from a simple cloud (source) to real pictures (target); most people start from a Gaussian cloud because it points equally in all directions.
•The authors built a special 2D sandbox that mimics high‑dimensional geometry so we can actually see how these arrows learn and where they fail.
•Trying to make the source look too much like the data (density approximation) sounds smart but backfires: it misses rare parts of the data (mode discrepancy) and hurts results.
•Pointing the source in the same directions as the data (directional alignment) can also fail if it’s too tight, because paths crash into each other (path entanglement).
•Gaussian’s secret power is omnidirectional coverage: during training every data point gets guidance from many angles, which makes learning robust.
•Two practical fixes work best together: Norm Alignment (match average sizes of source and data) during training, plus Pruned Sampling (skip bad directions) only during inference.
•Pruned Sampling is plug‑and‑play: you can add it to any already‑trained Gaussian‑source flow model and get better images without retraining.
•On CIFAR‑10 and ImageNet64, these ideas consistently reduce FID and make sampling more reliable and efficient.
•Big lesson: in flow matching, keeping broad coverage while trimming obviously bad starts beats trying to perfectly copy the data’s density.
•The paper offers clear guidelines for choosing source distributions and a ready‑to‑use recipe that upgrades today’s models.

Why This Research Matters

Better source choices make generative models more reliable, which means fewer weird or broken images when speed matters. The proposed pruning can upgrade existing models without retraining, saving time, money, and energy. Matching norms removes a hidden difficulty so models learn meaningful structure faster. These improvements help applications like rapid design previews, educational visuals, and assistive tools that need trustworthy pictures. The work also offers a simple diagnostic mindset—separate size and direction—to debug future models. Finally, it pushes the field toward practical, geometry‑aware design rather than fragile mimicry.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: You know how when you’re learning to ride a bike, it’s easiest to start on a big, open playground where you can move in any direction without bumping into things?

🥬 The Concept: Flow Matching (FM) is a way to teach a model a smooth “arrow field” that carries points from a simple starting blob (source) to real data (target), like moving from a wide playground to a cozy neighborhood.

How it works:
1. Pick a simple source distribution (usually a Gaussian “cloud”).
2. Pair each source point with a target data point.
3. Train a neural network to point arrows that smoothly push sources to targets over time.
4. At test time, start from the source and follow the arrows to get new samples.
Why it matters: Without a good start or clear arrows, the trip gets twisty, slow, or misses the neighborhood entirely. 🍞 Anchor: Imagine dots spread in a fog (source) learning to glide into clusters that look like real pictures of animals (target).

🍞 Hook: Imagine most sprinkles on a cupcake end up on a ring near the edge—hardly any stay in the very center.

🥬 The Concept: In high dimensions, a Gaussian doesn’t live near the origin; it mostly sits on a thin shell (a big sphere), which we can separate into “size” (norm) and “direction.”

How it works:
1. A Gaussian sample can be seen as: radius r (size) times a unit direction s.
2. r follows a chi distribution; s is uniformly spread over all directions.
3. This χ‑Sphere view keeps the same statistics but makes geometry obvious.
Why it matters: Knowing “size vs. direction” helps us diagnose when learning fails because of wrong sizes or missing directions. 🍞 Anchor: Think of arrows on a compass (directions) and how far you walk (size); reaching a house needs both to be right.

🍞 Hook: Have you ever been paired with a random project partner and wished you’d matched with someone closer to your topic?

🥬 The Concept: Pairing schemes decide who gets matched with whom when training flows.

How it works:
1. Independent pairing (I‑CFM): pick a random source and a random target.
2. OT‑CFM: within a mini‑batch, find the best matching that makes paths as short as possible.
3. Global OT (ideal): find the best matching over everything (too expensive in practice).
Why it matters: Bad pairings make arrows bend and swirl; better pairings straighten paths, but can miss learning from all directions. 🍞 Anchor: Random partners expose you to many styles (robust but messy); carefully assigned partners make projects faster but you meet fewer styles.

🍞 Hook: Reading a novel is easier with a map of the story world; a 2D map beats staring at tangled paragraphs.

🥬 The Concept: The authors build a special 2D simulation that still preserves the high‑dimensional “size + direction” geometry.

How it works:
1. Sample directions on a circle and sizes from a chi distribution (to mimic a high‑D shell).
2. Make target clusters with different densities and smaller norms (like real images).
3. Visualize learned trajectories and measure failures and distances.
Why it matters: Regular 2D toys miss high‑D geometry; this 2D‑but‑high‑D‑aware sandbox reveals the real training dynamics. 🍞 Anchor: It’s like shrinking a 3D city into a flat subway map that still shows where the lines cross and where jams happen.

🍞 Hook: If you copy only the busy parts of a city map, you’ll miss quiet side streets where friends live.

🥬 The Concept: Past attempts tried to make the source look like the data (density approximation) or point in the same directions (directional alignment), but both can break.

How it works:
1. Density approximation misses rare modes (mode discrepancy).
2. Too‑tight directional alignment makes many paths squeeze together (path entanglement).
3. OT pairing learns straighter but narrower paths; independent pairing learns broader but curvier paths.
Why it matters: These trade‑offs explain why simple Gaussian sources often win: they cover all directions. 🍞 Anchor: A wide playground (Gaussian) gives safer practice from every angle; narrow lanes (tight alignment) cause pileups.

Real Stakes:

Before: People assumed “smarter” sources that mimic data should help.
Problem: In high dimensions, mimicry often loses rare directions and sizes, making learning unstable.
Gap: We needed a clear, visual, geometry‑aware way to see why sources fail and what to fix.
Why care: Better sources mean faster image generation, fewer broken samples, and simple upgrades to existing models—useful for apps that need quick, reliable visuals (education, design, accessibility, and more).

02Core Idea

🍞 Hook: You know how building a fort goes best if you first pile pillows to the right height (size) and then block the drafty corners (bad directions)?

🥬 The Concept: The key insight is to keep Gaussian training for robust, all‑around learning, but fix two things: match average sizes (Norm Alignment) and, only at sampling time, skip bad directions (Pruned Sampling).

How it works:
1. Train with Gaussian so every data point gets arrows from many directions.
2. Scale norms so source and data have matching average size; this saves learning effort.
3. During inference, prune directions that don’t lead to data; follow the safer roads.
Why it matters: You keep robustness while removing known troublemakers—no retraining required to enjoy pruning. 🍞 Anchor: Practice soccer on the whole field (learn everywhere), then on game day avoid muddy patches (prune) and wear the right‑sized cleats (norm alignment).

Multiple Analogies:

Map analogy: Train with a full compass (all directions), then draw a “no‑go” tape over swamps (prune) and set the right scale on your map so distances make sense (norm alignment).
Cooking analogy: Learn a recipe by trying ingredients from all shelves (Gaussian), then during serving skip stale spices (prune) and use the right pot size (norm alignment).
Classroom analogy: Study with questions from every topic (Gaussian), but on test day skip trick questions no one studied (prune) and match time per section to its weight (norm alignment).

🍞 Hook: Imagine a library that keeps every aisle open while marking a few clearly wrong exits.

🥬 The Concept: Omnidirectional coverage means the model sees supervision coming from many angles around each data mode.

How it works:
1. Gaussian training spreads starts in all directions.
2. Independent pairing ensures each data point is approached from multiple angles.
3. The vector field near modes becomes well‑learned and robust.
Why it matters: If pairing later is imperfect, the model still knows how to guide you from unusual angles. 🍞 Anchor: It’s like a city where every fire station has roads coming in from all sides, not just one highway.

🍞 Hook: Copying only the crowded streets ignores hidden cul‑de‑sacs where people live.

🥬 The Concept: Mode discrepancy happens when a data‑like source forgets rare regions, leaving no good starts for those targets.

How it works:
1. Approximate sources cluster around common modes.
2. Rare modes end up with few or zero source partners.
3. Even OT pairing then creates long, twisted detours.
Why it matters: Missing rare cases means worse coverage and lower image quality. 🍞 Anchor: A bus route that skips small neighborhoods leaves riders stranded.

🍞 Hook: If too many kids rush through the same doorway at once, they get stuck.

🥬 The Concept: Path entanglement is when tightly aligned sources force many paths into the same narrow corridor, making learning unstable.

How it works:
1. Increase directional concentration too much.
2. Paths start almost on top of each other.
3. The model must learn very sharp, inconsistent arrows—training gets shaky.
Why it matters: Over‑focusing directions can backfire, even if it seems geometrically neat. 🍞 Anchor: A single‑file hallway jam makes the whole class late to lunch.

Before vs After:

Before: Try to imitate the data’s density or directions closely; use OT to straighten paths.
After: Keep Gaussian for broad learning, fix scale (norms), and only prune during sampling. This keeps robustness and avoids weakly trained regions.

Why It Works (intuition):

Training breadth (Gaussian + independent pairing) builds a sturdy vector field around modes.
Matching average norm removes a big, boring task (scale fixing) from the learner.
Pruning at inference bypasses regions that the model barely practiced, so sampling succeeds more often.

Building Blocks:

χ‑Sphere view: separate size and direction to reason clearly.
Robust training: Gaussian source, independent pairing.
Scale fix: Norm Alignment (simple proportional rescaling).
Safe decoding: Pruned Sampling (PCA‑guided direction filter at test time).

03Methodology

At a high level: Input images → Choose a source (Gaussian) → Train vector field with Conditional Flow Matching → Add Norm Alignment during training → During inference, apply Pruned Sampling → Follow ODE to generate images.

🍞 Hook: Planning a science fair needs both a small practice table and a big gym sketch to see crowd flow.

🥬 The Concept: Two pipelines run in this work: a 2D simulator to understand learning, and a practical training/inference recipe to improve real models.

How it works:
1. Analysis pipeline: 2D high‑D‑aware sandbox to watch trajectories and failures.
2. Practical pipeline: Train with Gaussian plus Norm Alignment; sample with pruning.
Why it matters: Seeing the “why” (simulator) makes the “how” (recipe) reliable. 🍞 Anchor: First, draw a traffic map; then, fix the road signs.

Part A — Analysis Pipeline (2D but high‑D‑aware)

Build the χ‑Sphere source in 2D

What happens: Sample directions on a circle and radii from a chi distribution to mimic a high‑D shell.
Why it exists: Regular 2D toys don’t capture the shell geometry of high‑D Gaussians.
Example: Think of beads on a bracelet (circle) but with bead sizes drawn from the chi distribution.

Create realistic targets

What happens: Place 2–3 clusters at smaller radii and with imbalanced densities.
Why it exists: Real images (normalized to [−1,1]) live inside the Gaussian shell and are unevenly distributed.
Example: One big crowd, one medium crowd, and a tiny meetup inside the ring.

Train I‑CFM and OT‑CFM

What happens: Learn vector fields with random pairing (I‑CFM) or mini‑batch OT pairing (OT‑CFM).
Why it exists: Compare robustness (I‑CFM) vs straightness (OT‑CFM).
Example: Random partners teach coverage; best‑in‑batch partners teach short paths.

Try alternative sources

What happens: Density‑like sources (DCT/GMM/CNF) and direction‑aligned sources (von Mises‑Fisher/vMF) are tested.
Why it exists: See when mimicry or tight alignment helps or hurts.
Example: Copying downtown streets (density) or pointing only toward known hubs (direction).

Visualize and score

What happens: Plot trajectories; compute Normalized Wasserstein, failure rate, and distance metrics.
Why it exists: Numbers plus pictures reveal mode loss and entanglements.
Example: Bright path heatmaps show where the model really learned.

Part B — Practical Training/Inference Pipeline

🍞 Hook: Before a race, you make sure everyone has similar‑sized shoes; after the whistle, you avoid slippery lanes.

🥬 The Concept: Norm Alignment (training) and Pruned Sampling (inference) form a simple, effective recipe.

How it works:
1. Train with Gaussian + independent (or OT) pairing; rescale targets to match average source norm.
2. After training, at sampling time, reject bad directions using a PCA‑based test.
3. Generate by integrating the learned ODE.
Why it matters: You learn broadly, save effort on scaling, and avoid weakly trained areas when it counts. 🍞 Anchor: Practice everywhere, wear right‑fit shoes, and sprint on solid ground.

Step‑by‑Step (practical):

Compute average norms

What: Estimate mean norm of Gaussian (via χ(d)) and of the dataset; scale targets so averages match.
Why: Removes costly scale mismatch so the model can focus on structure.
Example: If source average is 55 and data is 27, multiply targets so both average to 55 during training; undo later.

Train vector field

What: Use standard FM/CFM loss with your usual UNet/architecture.
Why: Gaussian’s omnidirectional coverage teaches robust arrows around each mode.
Example: Each cat image gets approached from many angles.

Learn pruning directions (no retraining)

🍞 Hook: Imagine labeling winds on a compass: which gusts lead nowhere?

🥬 The Concept: PCA‑based Pruned Sampling identifies source directions far from the data manifold.

How it works:
1. L2‑normalize data and run PCA to get principal directions v1…vd; include negatives to cover both signs.
2. For each basis direction, compute its best cosine similarity with any normalized data point.
3. Mark directions with low max‑cosine as “irrelevant”; define a rejection threshold (slightly looser at inference for safety).
Why it matters: Cuts off starts that the model barely practiced and often fail. 🍞 Anchor: It’s like taping an “X” over dead‑end alleys on your city map.

Sample with pruning

What: Draw x0 ~ N(0,I). Keep it only if its direction isn’t in the rejected set; otherwise redraw.
Why: Steer starts toward regions with better learned guidance, improving quality and stability.
Example: On ImageNet64, pruning consistently reduced FID at various step counts.

Integrate the ODE

What: Use your solver (e.g., Euler steps) for NFE steps; map source samples to images.
Why: This is the actual generation; now it benefits from better starts and matched norms.
Example: With 100 NFEs on CIFAR‑10, pruning + norm alignment beats the Gaussian baseline.

The Secret Sauce:

Keep omnidirectional supervision during training (don’t prune then).
Fix big, boring mismatches (norms) so training focuses on real structure.
Only prune at inference, where avoiding weak zones pays off immediately—no retraining needed.

04Experiments & Results

🍞 Hook: Think of a school race where we compare running times fairly: same track, same whistle, but different shoes and lanes.

🥬 The Concept: The authors test on standard image datasets (CIFAR‑10, ImageNet64), compare against common baselines, and use meaningful scores like FID.

How it works:
1. Datasets: CIFAR‑10 (32×32), ImageNet64 (64×64).
2. Baselines: Gaussian source; density‑like sources (DCT‑filtered Gaussian, GMMs, CNF/FFJORD); directional vMF sources (oracle and clustered).
3. Metrics: FID for image fidelity; Normalized Wasserstein and failure rates in 2D sandbox for insight.
Why it matters: Shows both real‑world gains and why they happen. 🍞 Anchor: It’s like timing laps (FID) and also reviewing drone footage (trajectory heatmaps) to see where runners stumbled.

🍞 Hook: A grade of 87% only makes sense if you know everyone else got around 80%.

🥬 The Concept: FID puts numbers on image quality; lower is better (like fewer mistakes).

How it works:
1. Extract features (Inception net); compute mean and covariance for real vs generated.
2. Distance between those Gaussians is the FID.
3. Report across different step counts (NFE) and methods.
Why it matters: Lets us say “A+ vs B−,” not just “pretty vs not.” 🍞 Anchor: FID 4.0 is like scoring a 96 when classmates get 89.

Key Tests and Scoreboard (with context):

Density approximation on CIFAR‑10 (OT‑CFM)

DCT‑Weak slightly better than Gaussian (around 4.20 vs 4.30–4.40 FID), but stronger approximations (DCT‑Strong, GMM‑k, CNF) got worse.
Context: Approximating density sounds clever but increases mode discrepancy; rare modes get lost.

Directional alignment via vMF

Oracle vMF (κ very large) nearly reproduces training samples (trivial, near‑zero FID) but proves a point: directions matter.
Practical clustering (e.g., K=3) helps at mild κ (50–100) but hurts when too tight (≥300) due to path entanglement.
Context: Too much focus squeezes paths together; some spread is necessary for stable learning.

Gaussian vs OT‑CFM vs I‑CFM (2D sandbox)

I‑CFM: omnidirectional learning around modes; more robust but some curved paths.
OT‑CFM: straighter local paths but misses broad angular supervision; failures occur from undertrained directions.
Context: Heatmaps show where learning actually happened; dark zones predict failures.

Pruned Sampling (plug‑in) on CIFAR‑10

Train with Gaussian; sample with pruning: consistently better than Gaussian→Gaussian.
Example: I‑CFM FID improved (e.g., 4.36 → 3.95 at 100 NFE); OT‑CFM also improved (e.g., 4.40 → 4.10 at 100 NFE) and even more at few steps in some settings.
Context: Avoiding bad start directions pays off immediately.

Norm Alignment (scaling) + Pruned Sampling

On CIFAR‑10 (100 NFE), combining both gives the biggest gains: OT‑CFM down to about 3.88; I‑CFM to about 3.64.
At very low NFE, norm alignment alone can hurt due to increased path curvature (needs more steps to trace accurately).
Context: Fix sizes and avoid weak directions; at few steps, prefer straighter paths.

ImageNet64 scale‑up

OT‑CFM with pruning improved FID across step counts (e.g., at 100 NFE, 9.10 → 8.78).
Context: Method generalizes beyond tiny images.

Surprising Findings:

Stronger data mimicry (GMM/CNF) performed worse than plain Gaussian.
Tight direction alignment hurt—more isn’t always better; stability needs angular support.
Training‑time pruning disappointed, but inference‑time pruning shined—robustness first, then trim.

Bottom line: The best combo was Train: Gaussian (+ Norm Alignment) and Sample: Pruned. It’s robust, simple, and works now on existing models.

05Discussion & Limitations

🍞 Hook: Even a great recipe can flop if your oven is tiny, you rush the bake, or you switch cuisines mid‑meal.

🥬 The Concept: These methods are powerful but not magic; they have limits, needs, and open questions.

How it works:
1. Limits: Findings are from images; other modalities (text, audio, molecules) may differ. Very low NFE can make Norm Alignment worse due to curvature.
2. Resources: PCA over the dataset (for pruning) needs memory/time; rejection sampling adds a small compute cost.
3. When not to use: If you must sample in extreme low steps, avoid Norm Alignment alone; if data directions shift a lot over time, retrain or update pruning.
4. Open questions: Can we prove optimal source designs? How to auto‑tune pruning thresholds? How does this play with conditional generation and latents?
Why it matters: Clear boundaries help you deploy wisely and spark the next advances. 🍞 Anchor: It’s like knowing your bike is great on roads but not for mountain rocks—choose paths accordingly and plan upgrades.

Specific limitations:

Theory is explanatory, not fully formal; more math could give guarantees.
Hyperparameters (pruning thresholds) need tuning; too aggressive pruning risks support loss.
Results are strongest for unconditional image generation; conditional tasks need study.

Required resources:

A pass over normalized data for PCA directions.
Storage for PCA components; minimal changes to sampling code.

When not to use:

Ultra‑fast demos with extremely small NFE: skip Norm Alignment or increase steps.
Datasets with shifting manifolds (e.g., streaming domains): refresh pruning periodically.

Open questions:

Can we learn pruning masks end‑to‑end without PCA?
Better surrogates for omnidirectional coverage with OT‑style efficiency?
Adaptive κ in directional schemes to avoid entanglement automatically?
Extensions to latent token spaces and cross‑modal settings?

06Conclusion & Future Work

Three‑sentence summary: Training with a Gaussian source gives robust, omnidirectional learning, but scale mismatches and bad start directions still cause failures. The paper shows that copying the data’s density or over‑focusing directions actually hurts due to mode discrepancy and path entanglement. The winning recipe is simple: align norms during training and prune directions only during inference, boosting quality without retraining.

Main achievement: A clear geometric explanation of why Gaussian works so well (omnidirectional coverage) and a practical, plug‑and‑play Pruned Sampling method—plus Norm Alignment—that consistently improves flow matching models.

Future directions: Develop theory for optimal source design; learn pruning masks automatically; extend to conditional and cross‑modal generation; adaptively balance straightness and coverage; test on larger and more varied datasets and latent spaces.

Why remember this: In flow matching, breadth beats brittle precision—train wide (Gaussian), fix the big mismatch (norms), and only then trim the edges (prune). This mindset delivers immediate, reliable gains and a roadmap for designing better sources in high‑dimensional generative modeling.

Practical Applications

•Upgrade an existing Gaussian‑source flow model by adding Pruned Sampling at inference to reduce bad generations.
•Apply Norm Alignment during training to speed convergence and improve final FID at moderate or high step counts.
•Use the χ‑Sphere viewpoint (size vs direction) to debug failures: check if issues are from wrong norms or missing directions.
•Prefer independent pairing when robustness is needed; use OT‑CFM when you can afford potential angular narrowness and will prune at inference.
•Tune pruning thresholds to balance quality vs sampling time; start conservative to keep sufficient support.
•For datasets with known rare modes, avoid density‑mimicking sources (e.g., tight GMMs) that may drop rare regions.
•In low‑step (very fast) settings, consider skipping Norm Alignment or increasing steps to handle added curvature.
•Use clustering‑based directional sources with mild concentration if you must bias directions, and watch for entanglement.
•Periodically recompute PCA for pruning if your data domain drifts over time.
•Combine pruning with better ODE solvers or schedulers to maximize gains at fixed compute.

Version: 1