Bidirectional Normalizing Flow: From Data to Noise and Back

Yiyang Lu; Qiao Sun; Xianbang Wang; Zhicheng Jiang; Hanhong Zhao; Kaiming He

Bidirectional Normalizing Flow: From Data to Noise and Back

Intermediate

Yiyang Lu, Qiao Sun, Xianbang Wang et al.12/11/2025

arXiv PDF

Key Summary

•Normalizing Flows are models that learn how to turn real images into simple noise and then back again.
•Old flows demanded a perfect built-in reverse button, which forced slow, step-by-step decoding and limited model designs.
•BiFlow drops the must-be-perfect rule and trains a separate reverse model to learn the way back from noise to data.
•A new training trick called hidden alignment lines up the reverse model’s hidden steps with the forward model’s steps, like matching footprints on a trail.
•BiFlow folds denoising into the reverse model, removing a costly extra cleanup step used by prior flows.
•It can generate an image in a single pass (1-NFE), making it much faster than autoregressive flows like TARFlow.
•On ImageNet 256×256, BiFlow gets FID 2.39 with a base-size model and runs up to about 100× faster than strong NF baselines depending on hardware.
•Because the reverse is learned, BiFlow can use flexible Transformers with bidirectional attention and perceptual loss for better-looking images.
•Training-time classifier-free guidance lets BiFlow keep the benefits of guidance without doubling compute at sampling.
•BiFlow sets a new state-of-the-art among NF-based methods and is competitive with top one-step (1-NFE) generators.

Why This Research Matters

BiFlow makes high-quality image generation fast enough for real-time use on everyday devices. This can power instant creative tools, rapid design previews, and educational apps where waiting many seconds per image isn’t acceptable. The single-pass approach saves energy and cost in data centers and on phones, making AI more sustainable and accessible. Built-in flexibility (like perceptual loss and guidance) means outputs can be tuned for human judgment and specific tasks. The bidirectional mapping also enables simple, training-free edits like inpainting and class editing, opening doors for intuitive photo fixes and content creation. By modernizing a classical idea, BiFlow broadens the toolkit for both researchers and developers to build responsive, user-friendly AI experiences.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 You know how a good recipe lets you take ingredients and turn them into a cake, and a really great recipe also lets you reverse it—so you could, in theory, separate the cake back into ingredients? That second part is much harder!

🥬 The Concept: Normalizing Flows (NF)

What it is: An NF is a model that learns a reversible path between real data (like pictures) and simple noise (like random “static”).
How it works:
1. Forward process: take a real image and transform it step by step into simple noise.
2. Reverse process: take simple noise and exactly reverse those steps to rebuild an image.
3. Because the path is reversible, the model can also compute how likely each image is.
Why it matters: Without a proper reverse, you can’t reliably turn noise back into images, and you lose the ability to sample or score images. 🍞 Anchor: Think of rolling cookie dough into a perfect ball (data→noise) and then unrolling it back into the exact same shape (noise→data). NFs learn both directions.

🍞 Imagine a maze with a strict rule: every move must be reversible with perfect certainty.

🥬 The Concept: Explicit Inverse Constraint

What it is: Traditional NFs require the reverse path to be a perfect, analytic (exact math) inverse of the forward path.
How it works:
1. Design the forward steps so they can be exactly undone.
2. Keep each step’s math simple enough to compute how the volume changes (the Jacobian determinant).
3. Build the whole model from these reversible blocks.
Why it matters: Without this exactness, old NFs couldn’t guarantee correct sampling or likelihoods—but the rule also chained model design to only special, invertible blocks. 🍞 Anchor: It’s like insisting every LEGO move be undone by snapping the exact same piece backward—no new tools allowed.

🍞 Think about writing a story one word at a time, never peeking ahead.

🥬 The Concept: Autoregressive Flows (like TARFlow)

What it is: A kind of NF that transforms one token/pixel-block at a time, using only earlier parts for context (causal decoding).
How it works:
1. Split an image into a long sequence of tokens.
2. For each step, update the next token using only the past ones.
3. Repeat this many times across many blocks.
Why it matters: This design uses powerful Transformers but forces slow, one-by-one decoding at inference time. 🍞 Anchor: It’s like building a 256-piece necklace by threading beads in strict order; you can’t add bead #200 until #199 is in place.

🍞 Picture two different highways to the same city: one is a fixed, scheduled bus route; the other is a road you can choose yourself.

🥬 The Concept: Flow Matching/Diffusion (context)

What it is: Modern cousins of NFs that pre-plan the path from noise to data with a schedule and often use many steps.
How it works:
1. Set a time schedule for how noisy the image is.
2. Train a network to follow this schedule backward from noise to clean images.
3. Sample by simulating many steps.
Why it matters: They’re strong but typically need many steps, and they don’t learn a clean, direct two-way map like classic NFs. 🍞 Anchor: It’s like following a timetable for a train ride back to the city—reliable but multi-stop and not instantly reversible in one hop.

The World Before: Classic NFs promised a neat, learned two-way map, but demanded every step be precisely undoable. Autoregressive NF variants (like TARFlow) finally used modern Transformers, but their reverse pass had to tiptoe token by token, making sampling slow and hard to parallelize. Many people switched to diffusion-like methods that sample in many steps.

The Problem: Can we keep the NF spirit—learned forward trajectories and a real two-way map—without being trapped by the exact inverse rule and the causal decoding bottleneck?

Failed Attempts: People tried complex invertible blocks, coupling layers, and autoregressive masks to stay strictly reversible. That kept likelihoods tractable but limited architecture choices and slowed generation.

The Gap: No one had fully decoupled the forward path (for likelihood) from a fast, flexible learned reverse path (for sampling) that wasn’t forced to be an exact analytic inverse.

Real Stakes: Faster, high-quality, one-pass image generation matters for creative tools, mobile devices, games, education, and accessibility—think instant previews, low-latency apps, and energy savings on everyday hardware.

02Core Idea

🍞 Imagine learning to ride a bike forward on a winding path (data→noise), and then training a separate, super-skilled friend to carry you back quickly (noise→data) without retracing every pebble.

🥬 The Concept: Bidirectional Normalizing Flow (BiFlow)

What it is: A framework with two models—one learns the forward map (data→noise) as in classic NFs; a separate reverse model learns an approximate inverse (noise→data).
How it works:
1. Train a forward NF with maximum likelihood (standard NF training).
2. Freeze it; then train a reverse model to map noise back to data.
3. Use a new “hidden alignment” loss so the reverse model’s hidden steps match the forward model’s hidden steps.
Why it matters: We keep the NF benefits (learned trajectories and likelihoods) but drop the handcuffs of exact invertibility, unlocking fast, parallel, single-pass generation. 🍞 Anchor: It’s like having a trail map for going out and a helicopter to bring you back—faster and not forced to walk every switchback in reverse.

Three Analogies for the Key Insight:

Translation Team: One expert translates English→French with strict grammar notes. A second expert learns to go French→English by studying the notes and examples, not by memorizing the exact reverse of every rule—still accurate, but freer and faster.
Baking & Unbaking: The forward baker records each mixing step; the reverse chef learns how to assemble the cake directly from ingredients with guidance from the baking journal, rather than rewinding dough molecule by molecule.
Maze Footprints: The forward walker leaves footprints; the reverse runner learns to align with those footprints at key spots (hidden alignment) and speeds to the exit in one go.

🍞 You know how landmarks on a trail help you know you’re on the right path?

🥬 The Concept: Hidden Alignment Objective

What it is: A training loss that lines up the reverse model’s intermediate hidden representations with the forward model’s intermediate states.
How it works:
1. Run data through the forward model; collect its hidden states.
2. Run noise through the reverse model; collect its hidden states.
3. Use learnable projection heads so we can compare apples-to-apples, then minimize differences along the whole trajectory.
Why it matters: Without alignment, the reverse model might only learn to match the final output and miss the path structure; with alignment, it learns a stable step-by-step route back. 🍞 Anchor: It’s like checking in at every trail sign, not just at the finish line.

🍞 Think about getting a full meal in one tray instead of many mini-courses.

🥬 The Concept: Single-Pass Generation (1-NFE)

What it is: Generating an image in one forward pass of the reverse model.
How it works:
1. Sample noise once.
2. Feed it through the reverse model once.
3. Get the final image—no loops, no token-by-token decoding.
Why it matters: Without this, sampling is slow and hard to parallelize; with it, we get major speedups and easy scaling. 🍞 Anchor: Like pressing “print” once and getting the whole page, not letter by letter.

🍞 Imagine cleaning a photo with a built-in filter instead of running a separate cleanup program afterward.

🥬 The Concept: Learned Denoising Block

What it is: A final block inside the reverse model that learns to remove noise as part of the same pass.
How it works:
1. Extend the reverse with one more block dedicated to denoising.
2. Train it jointly with hidden alignment and reconstruction.
3. Skip the old extra score-based denoising step entirely.
Why it matters: Without learned denoising, you’d pay for an extra, expensive cleanup; with it, you get cleaner images in the same single pass. 🍞 Anchor: It’s like your camera app auto-fixes graininess when you snap the photo—no extra app needed.

🍞 When judging a painting, we care about the whole look, not just counting pixels.

🥬 The Concept: Perceptual Loss

What it is: A loss that compares images by feature similarity (e.g., VGG/ConvNeXt features) so results look better to people.
How it works:
1. Decode latents to images via a VAE.
2. Extract features with pre-trained networks.
3. Add this loss to guide training toward visually pleasing outputs.
Why it matters: Pixel-only losses can be too strict or blind to semantics; perceptual loss steers the model toward sharper, more realistic images. 🍞 Anchor: Like judging two songs by melody and rhythm, not by matching the exact sound wave at every microsecond.

🍞 Think of a radio with a clarity knob.

🥬 The Concept: Classifier-Free Guidance (CFG) in BiFlow

What it is: A way to push generations to match the class label more strongly without a separate classifier.
How it works:
1. Train conditional and unconditional predictions together.
2. During training, fold guidance into the model so sampling stays 1-NFE.
3. Optionally condition on the guidance scale so you can tune it later.
Why it matters: Without CFG, images may be less faithful to the prompt/class; traditional CFG doubles compute, but training-time CFG keeps it one-pass. 🍞 Anchor: A clarity knob that’s already built into the speaker—better sound without extra boxes.

Before vs After:

Before: Exact-inverse rule, causal decoding, extra denoising pass, limited architectures.
After: Learned reverse with hidden alignment, fully parallel 1-NFE sampling, built-in denoising, flexible Transformers, perceptual loss.

Why It Works (intuition): Hidden alignment teaches the reverse model not just the destination but the whole path structure. Learning denoising inside the reverse simplifies the pipeline. Decoupling design frees the reverse model to use powerful bidirectional attention. And training-time CFG keeps guidance benefits without extra steps.

Building Blocks:

Forward NF (e.g., improved TARFlow) trained by likelihood.
Reverse Transformer with bidirectional attention.
Projection heads for hidden alignment.
Final denoising block.
Flexible distance metrics (MSE + perceptual).

03Methodology

High-level recipe: Input (image) → Forward NF to noise (and record hidden states) → Train Reverse to map noise back to image aligning hidden steps → One-pass sampling from noise to image.

Step 0: Data domain

Work in the VAE latent space: images are encoded into a 32×32×4 latent grid. This keeps models lighter and faster.

🍞 You know how a camera makes a smaller, compressed version of a photo (a thumbnail) for quick operations? 🥬 The Concept: VAE Tokenizer/Latent Space

What it is: A pre-trained encoder/decoder that maps images to compact latents and back.
How it works: Encode image → latent; model operates on latents; decode latent → image for viewing and perceptual loss.
Why it matters: Without latents, models are heavier and slower; latents make training and sampling efficient. 🍞 Anchor: Like sketching on a small notepad before painting on a big canvas.

Step 1: Train the forward model (NF)

What happens: Train an improved TARFlow-like NF with maximum likelihood to map noisy images to a simple Gaussian prior. Record all intermediate hidden states along the forward trajectory.
Why needed: Establishes the learned path from data to noise and provides exact pairings (x ↔ z) and hidden “checkpoints” for supervising the reverse model.
Example: For an ImageNet cat image, the forward model outputs a latent z and a list of states x1...xB.

🍞 Think of turning down the volume when a song gets too loud so the whole playlist is comfortable to hear. 🥬 The Concept: Norm Control

What it is: Techniques to keep hidden-state magnitudes stable across blocks so training signals stay balanced.
How it works: Clip certain forward parameters within a range; normalize hidden states for alignment.
Why it matters: Without it, some blocks dominate the loss, confusing the reverse model. 🍞 Anchor: Like setting consistent microphone levels so every speaker is heard clearly.

Step 2: Design the reverse model

What happens: Build a bidirectional Transformer (not causal) with B + 1 blocks, where the last block is a denoiser. Add learnable projection heads that map reverse hidden states into the forward state space for comparison.
Why needed: Bidirectional attention enables full parallelism and richer context; projections allow flexible hidden spaces without forcing repeated back-and-forth to input size.
Example: A base-size ViT with modern components and in-context conditioning runs all tokens at once.

Step 3: Hidden alignment training

What happens: For each training image:
1. Add training noise to the image (as in the TARFlow setup) and pass through the forward NF to get hidden trajectory and final z.
2. Feed z into the reverse model to get reverse hiddens and a reconstructed clean latent x′.
3. Use projection heads φi to align reverse hidden states to the forward hidden states across all blocks.
4. Add a reconstruction loss on (x, x′).
Why needed: Aligning the whole path teaches the reverse model a stable, invert-like mapping, not just a lucky final guess.
Example: If forward x7 is large-scale structure and x15 is fine texture, the reverse must learn complementary h7 and h15 that project to those levels.

Step 4: Learned denoising inside reverse

What happens: Extend the reverse with a final denoising block that maps a slightly noisy prediction to a clean sample.
Why needed: Removes the costly score-based denoising used by TARFlow, keeping generation to one pass.
Example: The last block wipes faint speckles while preserving edges and color.

Step 5: Distance metrics (losses)

What happens: Combine adaptively-weighted MSE for hidden alignment with perceptual losses (VGG, ConvNeXt) on the decoded image.
Why needed: MSE keeps numeric alignment stable; perceptual loss improves visual fidelity and class faithfulness.
Example: If two dog images are pixel-different but look the same to us, perceptual loss acknowledges that similarity.

Step 6: Training-time CFG for 1-NFE guidance

What happens: Train the reverse model to produce guided outputs without requiring two passes at inference. Optionally feed the guidance scale as a condition so users can tweak it later.
Why needed: Classic CFG doubles compute during sampling; training-time CFG keeps single-pass speed.
Example: Dialing guidance from 0.0 to 2.0 sharpens class details while staying 1-NFE.

Step 7: Sampling

What happens: Draw a noise sample z ~ N(0, I), run the reverse once, and decode via the VAE to get an image.
Why needed: This is the fast path—no loops, no causal chains, no extra denoising.
Example: One forward pass yields a crisp 256×256 class-conditional image.

What breaks without each step?

No hidden alignment: reverse learns only the endpoint, leading to instability and lower fidelity.
No denoising block: you need an extra score-based pass—slower and more complex.
No norm control: losses skew, some blocks mis-train.
No perceptual loss: images may be numerically close but look worse.
No training-time CFG: guided results cost twice the compute.

Secret sauce:

Hidden alignment with learnable projections: keeps supervision rich and flexible.
Learned denoising: merges cleanup into the same pass.
Decoupled reverse: enables powerful, fully parallel Transformers.
Training-time CFG: keeps guidance while staying 1-NFE.

04Experiments & Results

🍞 You know how a race matters only if you time the runners and compare them to others?

🥬 The Concept: FID (Fréchet Inception Distance)

What it is: A popular score that measures how close generated images are to real images—lower is better.
How it works: Extract features from real and fake image sets and compare their distributions.
Why it matters: Without a fair score, we can’t tell which model really makes better images. 🍞 Anchor: It’s like judging two photo albums by how similar their ‘style fingerprints’ are.

The Test:

Dataset: ImageNet 256×256, class-conditional generation, evaluated on 50,000 samples.
Metrics: FID (lower is better) and Inception Score (higher is better).
Setup: Models operate in the latent space of a pre-trained VAE (32×32×4). All timings reported include or note VAE decoding where relevant.

The Competition:

Baseline forward model: improved TARFlow (iTARFlow) with various sizes (B/2 to XL/2), used both as a strong baseline and as BiFlow’s forward.
Comparisons: Classic NF baselines (e.g., TARFlow/STARFlow), other one-step (1-NFE) methods, GANs, and multi-step diffusion/flow-matching Transformers.

Scoreboard (highlights):

BiFlow-B/2 (≈133M params): FID 2.39 with a single pass (1-NFE) and competitive IS.
Versus improved TARFlow-XL/2 (≈690M params): BiFlow-B/2 achieves better FID while being dramatically faster (e.g., up to two orders of magnitude wall-clock speedup on some hardware for the generator; overall speedup depends on including VAE).
Speed: Thanks to 1-NFE and full parallelism, BiFlow runs extremely fast on TPU/GPU/CPU; the VAE decode can become the dominant remaining cost.

Context for Numbers:

Saying “FID 2.39” is like scoring an A+ when many strong NF baselines get B’s or C’s; and doing it with a smaller model and in one step.
Speed-ups like 60× to 700× (hardware-dependent) are like downloading a movie in seconds instead of minutes.

Ablations (what mattered most):

Reverse Learning Strategy: • Naive distillation (match only final output) beat the exact inverse baseline. • Hidden distillation (match every hidden in input space) underperformed. • Hidden alignment (our method) won clearly—best FID and best recon metrics.
Learned Denoising: • Replaced TARFlow’s extra score-based pass with a learned block, improving FID and cutting compute.
Norm Control: • Clipping forward parameters or normalizing trajectories stabilized losses and improved results.
Distance Metrics: • Adding perceptual loss (VGG + ConvNeXt) sharply improved FID; with strong perceptual features, optimal CFG often approached zero.
Guidance: • Training-time CFG matched or beat inference-time CFG while keeping 1-NFE.

Surprising Findings:

A learned reverse can outperform the exact analytic inverse in quality. Why? It’s trained directly to reconstruct real images (not to mimic the analytic inverse’s outputs), uses perceptual feedback, and leverages hidden alignment to learn a stable global mapping.
After adding ConvNeXt perceptual loss, scaling the reverse further gave diminishing returns—suggesting possible overfitting that future work can address.

Big Picture:

Among NF methods, BiFlow sets a new bar: state-of-the-art NF FID with true 1-NFE speed.
Among all one-step generators, BiFlow is highly competitive, showing that classic NF ideas can shine with modern tricks.

05Discussion & Limitations

Limitations:

Dependence on the forward model: If the forward NF learns a poor trajectory, the reverse training signals (including hidden alignment) weaken.
Overfitting risk: With strong perceptual losses and large models, performance can plateau or even degrade; careful regularization or data strategies may be needed.
VAE bottleneck: Because BiFlow’s generator is so fast, the fixed VAE decoder can become a noticeable fraction of total runtime.
Scope: Results are reported on ImageNet 256×256 with a specific VAE; generalization to other domains and high resolutions, while promising, requires engineering.

Required Resources:

Forward NF training (iTARFlow) at scale and then reverse model training; modern accelerators (TPUs/GPUs) advised.
Pre-trained VAE tokenizer/decoder.
Feature networks (VGG/ConvNeXt) for perceptual losses.

When NOT to Use:

If exact likelihood evaluation at sampling time via an explicit inverse is strictly required by the application (BiFlow keeps likelihood via the forward, but the reverse used for sampling is learned, not analytic).
If your hardware cannot host the VAE decoder or desired Transformer backbone efficiently.
If your domain lacks good perceptual features and forward trajectories are unstable.

Open Questions:

Scaling laws: How do reverse capacity, data size, and perceptual losses interact without overfitting?
Beyond images: How well does BiFlow extend to audio, video, or 3D where trajectories and perceptual metrics differ?
Joint training: Can forward and reverse be co-trained (or alternated) for even stronger synergy while keeping stability?
Editing toolbox: What new training-free edits (beyond inpainting and class editing) emerge from an explicit bidirectional map?

06Conclusion & Future Work

Three-sentence summary:

BiFlow keeps the Normalizing Flow spirit—learning data→noise trajectories—but replaces the rigid exact inverse with a learned, flexible reverse model.
A hidden alignment loss shapes the reverse path, a built-in denoising block removes extra cleanup, and training-time guidance preserves one-pass (1-NFE) speed.
The result is state-of-the-art NF quality on ImageNet 256×256 and massive sampling speedups over causal NF baselines.

Main Achievement:

Showing that an approximately learned inverse, guided by hidden alignment and perceptual feedback, can outperform an exact analytic inverse in both quality and speed—without giving up NF principles.

Future Directions:

Explore co-training forward and reverse; refine norm control and regularization to curb overfitting at larger scales; port BiFlow to video, audio, or 3D; and compress or fuse the VAE to reduce total latency further.

Why Remember This:

BiFlow breaks a long-held NF rule—“the reverse must be exact”—and proves that learning the way back can be better. It turns a classical idea into a modern, practical, one-pass generator, opening the door to fast, flexible, and high-fidelity synthesis across domains.

Practical Applications

•Instant class-conditional image generation in mobile apps with one-tap previews.
•Real-time inpainting (object removal or hole filling) without extra training steps.
•Fast class editing: keep structure but switch labeled content (e.g., change dog breed).
•Interactive design tools that update images live as sliders move (guidance scale).
•Low-latency dataset augmentation for training downstream vision models.
•On-device creative filters that require minimal compute and energy.
•Rapid storyboard or thumbnail creation for media workflows.
•Edge deployment for vision systems where bandwidth and latency are limited.
•Educational demos showing data↔noise mapping in a single step.
•Research platform for studying learned inverses and path alignment in generative models.

Version: 1