One-step Latent-free Image Generation with Pixel Mean Flows

Yiyang Lu; Susie Lu; Qiao Sun; Hanhong Zhao; Zhicheng Jiang; Xianbang Wang; Tianhong Li; Zhengyang Geng; Kaiming He

One-step Latent-free Image Generation with Pixel Mean Flows

Beginner

Yiyang Lu, Susie Lu, Qiao Sun et al.1/29/2026

arXiv PDF

Key Summary

•This paper shows how to make a whole picture in one go, directly in pixels, without using a hidden “latent” space or many tiny steps.
•The new method, called pixel MeanFlow (pMF), asks the network to predict a cleaned-up picture (x) while it learns using a different signal (a velocity-style loss), and it connects the two with a simple bridge formula.
•Predicting the clean picture is easier because these pictures lie on a smoother, lower‑dimensional “manifold,” while velocities are noisy and hard to learn in pixel space.
•A clever two-time trick (using times r and t) plus a tiny “how-fast-is-it-changing” peek (JVP) makes one-step training work.
•On ImageNet, pMF reaches 2.22 FID at 256×256 and 2.48 FID at 512×512 in just one step, matching or beating many much heavier systems.
•Trying to predict velocity directly in pixels fails badly at high resolution; predicting images (x) works great.
•Because the model outputs real pixels, we can add a perceptual loss (LPIPS), which gives a big quality bump.
•Using the Muon optimizer speeds learning and improves results for this single-step setup.
•The method keeps compute low even for big images by using big patches and skipping any latent decoder overhead.
•This work is a strong step toward simple, end-to-end generators that map noise directly to pictures in one shot.

Why This Research Matters

Faster, simpler image generators mean creative tools that respond instantly—great for artists, designers, and students. Because pMF outputs pixels directly, it naturally supports perceptual tuning, which leads to images that look better to human eyes. Skipping latent decoders and multi-step solvers reduces compute and energy, making high-quality generation more accessible and eco-friendlier. The one-step pipeline is easier to build into phones, tablets, or web apps where speed and simplicity count. This approach can also inspire similar end-to-end designs in audio, video, and 3D content. Overall, pMF nudges AI image creation closer to push-button practicality without sacrificing quality.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: You know how some kids build a LEGO castle by adding one brick at a time, while others can snap together big chunks in just one move? For image-making AIs, we’ve mostly used the “many-bricks” way (many steps) and often built in a hidden workshop (latent space) before showing the picture.

🥬 Deep Learning (the helper that learns patterns)

What it is: Deep learning is a way for computers to learn patterns from lots of examples using layered “neuron” blocks.
How it works:
1. Show the network many input–answer pairs (like noisy image → clean image).
2. It guesses an answer and gets a score for how close it was.
3. It nudges its knobs (weights) to do a bit better next time.
4. Repeat until it gets very good.
Why it matters: Without this, the computer can’t improve itself from data. 🍞 Anchor: Like practicing free throws: shoot, see how close you were, adjust, repeat.

🥬 Generative Modeling (making new things)

What it is: Generative modeling is teaching a computer to create new, realistic examples, not just label existing ones.
How it works:
1. Learn what real images look like.
2. Start from something simple (like random noise).
3. Transform the noise step by step into a believable image.
Why it matters: Without generators, we can’t make new art, photos, or designs on demand. 🍞 Anchor: Like a baker who learns the recipe so well they can invent new, tasty cupcakes.

🥬 Diffusion (denoising) Models (cleaning pictures)

What it is: Diffusion models learn to remove noise to get a clean image back.
How it works:
1. Add noise to real pictures until they look like TV static.
2. Train a model to undo that noise at different levels.
3. At test time, start with noise and keep cleaning until you get a picture.
Why it matters: Without a great “cleaner,” the final images are blurry or weird. 🍞 Anchor: Like un-crumpling a crumpled drawing a little at a time until it’s flat again.

🥬 Latent Space (the hidden workshop)

What it is: A smaller, squished version of pictures where it’s easier to work.
How it works:
1. A tokenizer compresses an image into fewer numbers (latents).
2. The generator works in that tiny space.
3. A decoder expands it back to pixels.
Why it matters: It saves compute, but it adds extra parts (tokenizer/decoder) and hides details. 🍞 Anchor: Like folding a big map into a pocket guide and later unfolding it.

🥬 Flow Matching (following motion to the goal)

What it is: A method that learns a “velocity field” telling how to move noise toward a real image.
How it works:
1. Mix an image and noise to get an in-between picture.
2. Learn the direction-and-speed (velocity) that would carry it closer to the real image.
3. Repeat for many times and mixtures so the model knows how to move anywhere.
Why it matters: Without a good motion plan, the model gets lost. 🍞 Anchor: Like arrows on a treasure map that tell you which way to step from any spot.

🥬 Velocity Field (the motion arrows)

What it is: A function that, for any in-between picture and time, says which direction to go next.
How it works:
1. Look at the current noisy picture.
2. Figure out the best “push” to reduce noise and add structure.
3. Keep doing this until you reach a clean image.
Why it matters: If the arrows point the wrong way, you won’t reach a good picture. 🍞 Anchor: Like wind arrows on a weather map showing how a leaf would drift.

🥬 MeanFlow (averaging the motion)

What it is: MeanFlow learns the average push between two times (r → t) so one model can do big, fast moves—even in one step.
How it works:
1. Pick two times (start r, end t).
2. Define the average velocity as the mean push needed from r to t.
3. Train the model so this average push matches what real flows would do.
Why it matters: Without the average view, one-step jumps are hard to aim. 🍞 Anchor: Like planning a road trip by the overall direction from morning to evening, not every tiny turn.

🥬 Manifold Hypothesis (data lives on a smooth shape)

What it is: Real images don’t fill all pixel combinations; they lie on a smoother, lower‑dimensional “surface” inside the giant pixel space.
How it works:
1. Notice that natural photos share structure (edges, textures).
2. These rules squeeze possible images onto a simpler shape.
3. Predicting points on that shape is easier than predicting anywhere in the huge space.
Why it matters: If we predict noisy stuff off the manifold, learning gets much harder. 🍞 Anchor: Like train tracks (the manifold) guiding where the train (images) can realistically go.

The world before: Most strong image makers either (a) used many steps or (b) worked in a hidden latent space. We got great pictures but paid in complexity, extra decoders, and time. People improved single-step training (Consistency Models, MeanFlow) and also pixel-space Transformers (like JiT). But combining both—one step and pixels—stressed the network: one model had to control the whole jump and also handle raw, huge pixel spaces.

The problem: In pixels, predicting the velocity field is super noisy and high-dimensional. The network gets overwhelmed, especially at high resolution. We need a target that’s easier to learn but still lets us train a one-step flow.

Failed attempts: Predicting raw velocity in pixels falls apart at 256×256 and beyond. Pre-conditioners that mix the input and output make the target leave the nice manifold. Training with only one time line (just r=t or just r=0) also fails because it misses the big picture of the trajectory.

The gap: We were missing a way to let the network predict something clean and low-dimensional (like a denoised picture) while still training with the right flow signals for one-step jumping.

Real stakes: A simple, one-step, pixel-space generator means faster creation, fewer moving parts, and no heavy latent decoder. It’s easier to plug in “how humans see” (perceptual loss) because the model outputs real pixels. That can mean crisper photos on your phone, speedier design tools, and greener compute for big labs.

02Core Idea

🍞 Hook: Imagine you’re pushing a sled down a hill to a target spot. You can plan every micro-push (many steps), or you can learn the one perfect push that gets you there in one go—if you know how strong and where to push.

🥬 The Aha! (in one sentence)

Let the network predict a denoised image (easy, smooth manifold) but train it using a velocity-style loss (the right physics), connecting them with a simple bridge formula.

Three analogies for the same idea:

Map and destination: Predict the destination photo (denoised x) because destinations lie on clean roads (manifold). Then compute the needed average push to get there and learn from that push signal.
Chef and sauce: Have the chef make the finished dish (x). Use a food judge who scores not just taste but also the “path to taste” (velocity loss). Connect dish → score with a recipe rule.
Lego and instructions: Build the final model (x) directly. Also learn from how fast the build should change with time (velocity/JVP). A short rule translates between them.

Before vs After:

Before: One-step models often worked in latent space, or pixel-space models needed many steps. Predicting pixel velocities directly was too noisy.
After: pMF predicts clean images but trains with motion signals by converting x → average push u → a refined push V. This makes one-step, pixel-space generation feasible and strong.

Why it works (intuition, no math):

Clean pictures lie on a smoother track (manifold). Predicting them is easier than predicting noisy arrows in pixel land.
A short bridge formula says: “clean picture = current noisy picture minus time × average push.” So, once you predict the clean picture, you can back out the average push.
A tiny extra peek—how that push changes with time (JVP)—aligns the model with the true immediate push. That’s the secret to jumping in one step.

Building blocks:

Two times (r, t): r is a start time, t is a later time. We learn average push from r to t. Training samples many (r, t) pairs across the whole triangle 0 ≤ r ≤ t ≤ 1.
x-prediction: The network outputs a denoised image x̂ that stays near the image manifold (easy to learn).
Convert to u: From x̂ and the mixed input z_t, compute the average push û = (z_t − x̂)/t.
Refine to V: Add a “how-fast-it-changes” correction to get V̂ = û + (t − r) × (tiny time-derivative peek).
Train with velocity loss: Compare V̂ to the true instantaneous push signal and improve.
Perceptual loss: Because the network outputs pixels, we can add a human-vision-style loss (LPIPS) when noise is low to boost realism.

🍞 Anchor: Like aiming a basketball: you predict where the ball should land (the hoop, x), then compute what shove you need (u, then V) and train your arms using that shove signal. Predicting the hoop is easier than predicting the air gusts directly.

🥬 Perceptual Loss (seeing like humans)

What it is: A score that checks if two images “feel” the same to a vision network trained on real pictures.
How it works:
1. Pass both images through a pretrained vision net.
2. Compare their feature maps, not just pixels.
3. Penalize differences that humans would notice.
Why it matters: Without it, pixel-perfect but less realistic textures can slip through. 🍞 Anchor: Like judging two songs by rhythm and melody, not only by raw sound waves.

03Methodology

At a high level: Input (image x and noise ε) → mix them into a noisy picture z_t → network predicts a cleaned picture x̂(z_t, r, t) → convert x̂ to an average push û → refine to V̂ with a small time-change peek → train so V̂ matches the true push → output is the clean picture in one step.

Step-by-step, like a recipe:

Make an in-between picture

What happens: Pick a time t between 0 and 1 and mix the real image x with noise ε to get z_t = (1 − t)·x + t·ε.
Why it exists: The model must learn how to move from any noisy level back to clean.
Example: If t = 0.7, z_t is 70% noise, 30% image—quite messy.

Feed z_t plus times (r, t) to the network

What happens: The network sees the noisy picture and the pair (r, t). It outputs a denoised-looking picture x̂(z_t, r, t).
Why it exists: We want the model’s direct output to live on the easy manifold (denoised images), not on noisy velocity space.
Example: Even when z_t is very noisy, x̂ should look like a blurred-but-sensible version of the true image.

Convert the denoised x̂ into an average push û

What happens: Use a simple bridge: û = (z_t − x̂)/t. Intuition: If you know where you are (z_t) and your guess of the clean image (x̂), the average push needed is their difference scaled by time.
Why it exists: We train using motion signals, but we predict images. This step connects the two worlds.
Example: If z_t is far from x̂, û is big (needs a strong push); if close, û is small.

Make a refined push V̂ using a tiny time-change peek

What happens: We add a correction (t − r) × (how û changes with time). This “peek” is computed with an efficient trick (JVP) and we stop its gradient to keep training stable.
Why it exists: This aligns the model’s average push with the true instantaneous push the loss uses.
Example: If push needs to change quickly as time moves, the correction helps the one-step aim.

Compute the training target push

What happens: The true instantaneous push at time t is basically (ε − x). With classifier-free guidance (optional), we form a guided target that improves class-conditional sharpness.
Why it exists: The model needs a clear, consistent signal to learn the right motion.
Example: Think “where the wind should blow right now” to clean the picture fastest.

Measure error (loss) and update the network

What happens: Compare V̂ and the target push with an L2 loss. Also, when noise is not too heavy (t below a threshold), compare x̂ to the true image x with a perceptual loss (LPIPS) to reward human-pleasing structure.
Why it exists: Two signals—physics-style motion and human-vision similarity—guide learning.
Example: Even if pixels differ a bit, if the features match, LPIPS says, “looks right!”

Train with smart choices

What happens: Sample many (r, t) pairs across the whole triangle 0 ≤ r ≤ t ≤ 1 so the model learns the entire field, not just a line. Use the Muon optimizer to speed up early learning, which matters more here because the target uses the model’s own predictions.
Why it exists: Restricting time sampling (only r=t or only r=0) fails; faster early progress improves the moving training target.
Example: It’s like practicing throws from everywhere on the court, not just one spot.

One-step inference (make a picture)

What happens: Start from pure noise ε, set up the condition (like a class) and a guidance scale, run the network once to get x̂, and that’s your final image—no decoder, no multi-step solver.
Why it exists: Simplicity and speed: a single forward pass to pixels.
Example: Press a button, get a picture.

Mini data example (toy intuition):

Suppose an 8×8 grayscale image and t=0.5. We mix the clean image with 50% noise to get z_t.
The network sees (z_t, r, t) and predicts x̂ that looks like a softened version of the image (edges still there, details fuzzy).
From x̂, compute û = (z_t − x̂)/0.5. If the difference is big around edges, û is big there.
Add the small time-change correction to get V̂, compare to the target push, and learn.

Secret sauce (why this is clever):

Decoupled spaces: Predict in the easy space (denoised images) but train in the correct space (velocity). Best of both worlds.
Two-time training: Learning average pushes across (r, t) helps one-step jumps.
What-you-see-is-what-you-get: Because outputs are real pixels, perceptual loss cleanly fits, improving realism.
Compute savvy: No latent decoder; big patches keep FLOPs low even for big images.

04Experiments & Results

The test: The team measured Fréchet Inception Distance (FID)—lower is better—on ImageNet. They focused on 1-NFE (one network function evaluation), meaning truly one-step generation directly to pixels.

The competition: They compared against multi-step latent models (like DiT/SiT), multi-step pixel models, one-step latent models (like iMF), GANs, and the only other one-step pixel method (EPG). Most rivals either need many steps, use a latent decoder (extra compute), or both.

Scoreboard with context:

Core results: pMF gets 2.22 FID at 256×256 and 2.48 FID at 512×512 in one step. That’s like scoring an A+ while others need many tries or extra tools.
Against one-step latent methods: pMF narrows or beats gaps while avoiding decoder overhead (which alone can be hundreds to over a thousand GFLOPs at 256–512 resolutions).
Against multi-step pixel methods: Many are strong but require 100–1000 steps; pMF gets comparable quality in one.
Against GANs: Top GANs are also one-step in pixels; pMF is competitive on FID and scales well with Transformer backbones and big patches for lower compute.

Key ablations and findings:

Predicting x vs predicting velocity u in pixels: At 64×64, both work. At 256×256, u-prediction collapses (FID ~164!), while x-prediction is fine (~9.56 baseline before extra tricks). This shows the manifold advantage of x.
Perceptual loss (LPIPS): Adding it drops FID by about 6 points (e.g., 9.56 → 3.53 in the ablation), a huge boost. Because outputs are pixels, LPIPS fits naturally.
Optimizer: Muon trains faster and better than Adam in this one-step setting, improving both convergence speed and final FID.
Time sampling: Training across the full (r, t) triangle is crucial; sticking to only r=t (flow matching line), only r=0 (consistency-like line), or just both lines fails badly.
High resolution with constant sequence length: By increasing patch size (e.g., 32 or 64), pMF keeps compute roughly steady from 256 to 1024 while staying strong (e.g., 2.48 at 512×512).

Surprising discoveries:

u-prediction in pixel space fails catastrophically at high dimension, even though both setups use the same velocity loss. The difference is the prediction target’s learnability.
Pre-conditioners that mix input and output (common in other one-step models) underperform pure x-prediction here because they drag predictions off the clean manifold.
The latent decoder’s cost, often ignored, can exceed the whole pMF generator—so pixel-space one-step can be more compute-friendly than it first appears.

05Discussion & Limitations

Limitations:

One-step may still trail the absolute best multi-step systems on some ultra-fine textures, especially without extra tricks (e.g., adversarial losses) not explored here.
Training is sensitive: you need the two-time setup, stable JVP handling, and good time sampling; poor choices can collapse performance.
The method is shown on ImageNet class-conditional images; text-to-image or cross-modal tasks need extra conditioning work.
Very high resolutions are feasible with big patches, but some tiny details may benefit from refiners or multi-scale heads.

Required resources:

Big Vision Transformers, lots of data, and many training epochs.
Modern accelerators (TPUs/GPUs), mixed-precision training, and memory for large batches.
Feature networks for perceptual loss (e.g., VGG or ConvNeXt-V2).

When NOT to use it:

If you already rely on a well-tuned latent pipeline that must support many guidance passes or extra modules (e.g., super-res refinement).
If your target domain isn’t image-like (where the “denoised x lies on a manifold” story doesn’t hold).
If you require calibrated likelihoods or explicit density estimates (this is a sampler, not a likelihood model).

Open questions:

Theory: Can we formally prove that the generalized denoised field x(z_t, r, t) stays near the image manifold for all r<t?
Better time samplers: Are there smarter ways to cover the (r, t) triangle for faster training?
Loss design: Can hybrid perceptual/physics losses do even better? What about learned perceptual metrics?
Beyond images: How does pMF extend to audio, video, or 3D, where manifolds differ?
Efficiency: Can we distill pMF into even smaller, mobile-friendly models without losing quality?

06Conclusion & Future Work

Three-sentence summary: This paper introduces pixel MeanFlow (pMF), a one-step method that generates images directly in pixel space by predicting a clean image while training with a velocity-style loss. A simple bridge converts the predicted image into an average push, plus a small time-change peek to match the true instantaneous motion, making one-shot sampling accurate. The result is state-of-the-art one-step, latent-free image generation on ImageNet with low compute and strong quality at multiple resolutions.

Main achievement: Decoupling the network’s prediction space (denoised images on a manifold) from the training loss space (velocity) and linking them with a short, effective formula enables reliable one-step, pixel-space generation.

Future directions:

Extend pMF to text-to-image, video, and multi-modal tasks with richer conditioning.
Improve theory and time sampling for even faster, stabler training.
Explore learned or adaptive perceptual losses and tiny refiner heads for ultra-fine textures.
Distill pMF for edge devices and interactive applications.

Why remember this: pMF shows that “noise → pixels” in one clean step is not only possible but competitive, simplifying the pipeline, saving compute, and opening doors to faster, more accessible generative tools.

Practical Applications

•Instant style transfer or filter previews directly on mobile devices.
•Rapid thumbnail and mockup generation in design tools with immediate feedback.
•Fast class-conditional dataset augmentation for training image classifiers.
•On-device creative assistants that sketch ideas in a single tap without cloud compute.
•Interactive education demos that show how noise turns into images in one step.
•Low-latency content generation for games (textures, props) during runtime.
•Efficient A/B testing of product imagery without heavy pipelines.
•Quick restoration of lightly noisy or compressed photos using perceptual tuning.
•Edge deployment in kiosks or AR glasses where compute and power are limited.
•Prototype-to-visual concepting for marketing teams under tight deadlines.

Version: 1