Self-Refining Video Sampling

Sangwon Jang; Taekyung Ki; Jaehyeong Jo; Saining Xie; Jaehong Yoon; Sung Ju Hwang

Self-Refining Video Sampling

Intermediate

Sangwon Jang, Taekyung Ki, Jaehyeong Jo et al.1/26/2026

arXiv PDF

Key Summary

•This paper shows how a video generator can improve its own videos during sampling, without extra training or outside checkers.
•The key trick is Predict-and-Perturb (P&P): at the same noise level, the model first cleans its guess, then lightly re-noises it to try again.
•They reinterpret flow matching as a time-aware denoising autoencoder, which justifies these repeat clean-and-noise mini-loops.
•An uncertainty-aware version only refines parts of the video the model is unsure about, avoiding over-bright, over-sharpened artifacts.
•Across strong base models like Wan2.1/2.2 and Cosmos-2.5, motion looks more coherent and physics looks more believable.
•Human viewers prefer the refined videos over default sampling and guidance baselines by more than 70% in motion tests.
•In robotics-style videos, grasping and task-following improve notably, even beating rejection sampling with a verifier at lower cost.
•Physics benchmarks (like free-fall and physical commonsense) also improve, and spatial consistency under big camera moves gets better.
•Computation grows only moderately (about 1.5x NFEs), far less than running many rejections or doubling steps everywhere.
•The method is plug-and-play at inference and works for both flow-matching and diffusion-style video generators.

Why This Research Matters

Videos that move like the real world are crucial for robots, safety training, sports, and science learning. This method improves motion and physics without retraining or depending on outside judges, making it practical to deploy on today’s models. It focuses compute where the model is unsure, so pictures don’t get overcooked while motion gets smoother. Human viewers prefer the refined videos, and robots in tests handle objects more correctly, showing real impact beyond pretty visuals. Because it is plug-and-play, many video systems can adopt it quickly. Better physics in generated video can also help downstream AI learn more trustworthy world models.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Top Bread (Hook) You know how when you draw a flipbook, each page needs to line up so the motion looks smooth and real? If one page is off, the flipbook suddenly looks weird.

🥬 Filling (The Actual Concept)

What it is: Modern AI video generators can make amazing videos, but they still trip over tricky physics, like hands gripping objects or balls flying in arcs.
How it works (the situation before this paper): Most video generators learn in a big training phase, then at test time they sample frames from noisy seeds into finished videos. But small mistakes—like a hand passing through a cup—can sneak in, especially during the first steps that decide motion.
Why it matters: If the AI shows wobbly people, floating objects, or jumpy motion, robots might pick wrong actions, and viewers lose trust.

🍞 Bottom Bread (Anchor) Imagine a robot watching a video to learn how to place a bowl on a cloth. If the bowl slides without touching the hand, the robot might copy that mistake in real life.

🍞 Top Bread (Hook) Imagine taking a math test where a helper peeks at answers and rejects wrong ones until you finally pass. It can work, but it’s slow and wasteful.

🥬 Filling (The Problem)

What it is: People have tried two main fixes: external verifiers that “grade” each video and reject bad ones, or extra training on more curated or synthetic data.
How it works: External verifiers repeatedly reject samples until one looks physically okay (rejection sampling). Extra training fine-tunes the model with more physics-friendly data.
Why it matters: These approaches are expensive, slow, and often still miss fine, frame-to-frame motion details.

🍞 Bottom Bread (Anchor) It’s like baking 10 cakes and throwing away 9 just to find one good slice. That’s costly and still might not fix a tiny crack in the frosting.

🍞 Top Bread (Hook) Think of a student who has already studied a lot. During the test, they can re-check their own work and fix small slips, without calling the teacher.

🥬 Filling (The Gap)

What it is: Video generators don’t usually have a built-in “self-check” loop at inference that lets them fine-tune their own output on the fly.
How it works: Unlike language models that can read their own text and revise it, video models lacked a simple, reliable way to feed back their own signals to fix motion and physics while sampling.
Why it matters: We need a way to improve realism using the model’s own knowledge, quickly, and without retraining.

🍞 Bottom Bread (Anchor) Imagine doing a tiny erase-and-redraw on a sketch when a line looks off—right in the moment, not weeks later in art class.

🍞 Top Bread (Hook) You know how the first strokes of a painting set the whole scene? Early choices matter most.

🥬 Filling (Failed Attempts and Limits)

What they tried: More steps in the sampler, stronger guidance tricks, or heavy post-training. These help a bit but either cost a lot or overcook the image (too bright/sharp) and still miss delicate temporal details.
Why it didn’t fully work: Many fixes don’t target the rooted motion decisions decided early in sampling, or they globally push the whole frame instead of just the uncertain parts.
What was missing: A light-weight, model-internal, stepwise refinement loop that can nudge motion toward realism exactly when it matters.

🍞 Bottom Bread (Anchor) Think of balancing on a bike: tiny nudges early keep you straight. Big corrections later can wobble you more.

🍞 Top Bread (Hook) Imagine a movie where characters teleport slightly between cuts—very distracting.

🥬 Filling (Real Stakes)

What it is: Realistic motion and physics are crucial for robots, safety videos, sports analysis, science education, and any world-modeling application.
How it works: If videos respect gravity, contact, and consistency, downstream decisions (like a robot’s grasp) become far more reliable.
Why it matters: Better physical alignment builds trust and unlocks practical uses in homes, factories, and research labs.

🍞 Bottom Bread (Anchor) A robot that watches a self-refined video of a proper grasp is more likely to pick up your mug without dropping it.

02Core Idea

🍞 Top Bread (Hook) Imagine cleaning a room: you tidy up, then take a small step back to see new clutter you missed, then tidy again—little loops that make things better each time.

🥬 Filling (The “Aha!” Moment)

One sentence: Let the video generator act like its own cleaner by repeatedly predicting a cleaner video latent and then gently re-adding noise at the same level so it can try again—focused self-refinement without retraining.

Multiple Analogies

Art eraser: Lightly erase a messy line, redraw cleaner, smudge a bit to blend, and redraw cleaner again.
Bowling with bumpers: Roll, see drift, gently nudge with bumpers (small noise), roll again, ending closer to the center.
Puzzle snap-fit: Press pieces together (predict), wiggle slightly (perturb), press again—fit improves without getting new pieces.

Before vs After

Before: One straight pass from noise to video; early motion errors can lock in; fixing them needs extra models or big compute.
After: Each step has a mini-loop: predict clean → add tiny noise → predict again. Motion gently migrates toward smoother, more physically plausible trajectories.

Why It Works (Intuition, no equations)

The authors re-interpret flow matching as a time-aware denoising autoencoder. Denoisers are trained to map noisy inputs toward realistic data. So, if at a given noise level you keep doing a clean-then-light-re-noise cycle, you gradually pull the latent toward the high-density “real-looking” region.
Early steps get larger local exploration, helping avoid bad early lock-ins for motion.
Selectively refining only uncertain regions prevents over-saturating stable backgrounds.

Building Blocks (Sandwich explanations)

🍞 Top Bread (Hook) You know how a mirror can help you fix your hair by showing what’s off?

🥬 Self-Refining Video Sampling

What it is: A way for a pre-trained video generator to improve its own samples during inference using only its internal predictions.
How it works: At each step, the model briefly re-cleans and re-tries parts of the video latent, nudging it toward realism; no new training or outside judge needed.
Why it matters: Without it, small mistakes can stick around and spoil motion and physics.

🍞 Bottom Bread (Anchor) A gymnast’s arm looks doubled; a few self-refine loops align it into one smooth arm swing.

🍞 Top Bread (Hook) Imagine guessing a riddle, then being allowed a tiny hint and a re-guess.

🥬 Predict-and-Perturb (P&P)

What it is: A two-step mini-loop: Predict a cleaner latent, then Perturb it with small noise at the same level, and repeat a couple times.
How it works: Predict pulls the current guess toward realism; Perturb lets the model re-sample nearby options so it doesn’t get stuck.
Why it matters: Without Perturb, the model might overcommit too early; without Predict, noise won’t get organized into realism.

🍞 Bottom Bread (Anchor) For a fastball video, P&P steers the ball’s path into a physically believable arc instead of a jittery zigzag.

🍞 Top Bread (Hook) Think of using a gentle sponge to clean only the smudged spots on a window.

🥬 Uncertainty-aware sampling

What it is: A way to refine only the parts the model is unsure about, while keeping confident areas untouched.
How it works: The model compares its own consecutive predictions; where they disagree more, it’s less certain. It applies P&P only there.
Why it matters: Without this, repeating guidance can over-brighten or oversharpen quiet backgrounds.

🍞 Bottom Bread (Anchor) Refine the moving hands and the thrown ball; leave the calm blue sky alone.

🍞 Top Bread (Hook) Imagine a vacuum that knows where the dust is and doesn’t vacuum clean spots twice.

🥬 Flow Matching (as DAE)

What it is: A training approach where the model learns a vector field to move noisy latents toward clean data; you can see this as learning to denoise at each time.
How it works: At any fixed noise level, a denoise pass estimates a cleaner latent. Doing tiny corrupt-and-clean cycles keeps creeping toward real-looking video space.
Why it matters: This link justifies running inner “denoise + slight re-noise” loops at inference.

🍞 Bottom Bread (Anchor) On a toy 2D dataset, samples spiral closer to the true data curve after a few P&P rounds.

🍞 Top Bread (Hook) You know how a coach can yell louder to make a player follow a play more strictly?

🥬 Classifier-Free Guidance (CFG)

What it is: A common trick to push the model to follow the text prompt more strongly, without a separate classifier.
How it works: It scales the difference between conditioned and unconditioned predictions.
Why it matters: Too much CFG, repeated many times, can over-saturate colors or simplify textures—hence gating by uncertainty helps.

🍞 Bottom Bread (Anchor) If you keep turning up the volume on instructions for a static background, it can look too bold; better to only boost where motion is uncertain.

03Methodology

High-Level Overview: Input → Base ODE Step → Early-Step Self-Refine (P&P mini-loops) → Uncertainty Gate (refine only unsure regions) → Continue to Next Step → Output Video

🍞 Top Bread (Hook) Imagine making a smoothie: you blend, pause to taste, add a tiny splash of juice if it’s too thick, blend again, and only fix the parts that need fixing.

🥬 Filling (Step-by-Step Recipe)

Inputs

Start from the usual setup: a pre-trained video generator (flow matching in latent space). Begin with noisy latents and a text/image prompt.
Why this exists: Without the regular pipeline, there’s nothing to refine.
Example: Prompt: “A gymnast performs a back handspring.” Start at high noise like normal.

Base ODE Step

What happens: Take one standard solver step (e.g., UniPC/Euler-like) that moves the latent a bit toward a denoised, video-like latent.
Why it exists: This is the usual progress-maker from noise to video. If missing, you wouldn’t move toward a final sample.
Example: After this step, the gymnast’s body layout appears roughly in place, but arms or legs may look off.

Predict-and-Perturb (P&P) at the Same Noise Level (early steps only)

What happens: At early timesteps (where motion is set), run K small inner loops: a) Predict: Use the model’s denoiser to estimate a cleaner latent (its guess of a more realistic video latent right now). b) Perturb: Add a touch of noise to the cleaned latent to re-sample nearby, then feed it back for another predict.
Why it exists: Early decisions define motion. Tiny local re-sampling avoids locking into bad motion paths and nudges the sample toward plausible movement.
Example with data feel: On a 40-step schedule, you might apply P&P only for steps 3–14, with K=2–3 loops per step. After a few loops, the gymnast’s double arm becomes one arm following a believable arc.

Uncertainty-Aware Masking (during P&P)

What happens: For each P&P loop, build an uncertainty map by comparing consecutive predictions: where they differ more, the model is less sure. Create a binary mask that marks those spots as “refine” and others as “keep.”
Why it exists: Repeating guidance at full blast everywhere can over-saturate static regions. Masking focuses compute and preserves stable backgrounds.
Example: The moving hands and torso (high uncertainty) get refined; the still mat and sky (low uncertainty) remain stable.

Masked Update to the Next Step

What happens: Use the mask to combine the refined and previous latents for the next ODE update, so only uncertain areas get changed.
Why it exists: Without masking, the background might get over-processed, causing color shifts or too much sharpness.
Example: You keep the gym floor texture unchanged, while the gymnast’s limb motion becomes smoother.

Continue the Sampling

What happens: Move on to the next global timestep. Outside the early motion window, run fewer or no P&P loops to save compute.
Why it exists: Late steps mainly add details; most motion is already set. Skipping P&P here saves time with little quality loss.
Example: The gymnast’s final landing looks stable; late steps just refine small details like fingers and shadows.

Output

What happens: Decode latents into the final video.
Why it exists: This is the visible result that people watch or robots analyze.
Example: The final back handspring has smooth arcs, clean contacts, and no doubled arms.

Per-Step Breakdown (why each matters and what breaks without it)

Base ODE step: Core progress engine. Without it, you’d never converge.
Predict (denoise) pass: Pulls you toward realism. Without it, perturbation alone is just random noise.
Perturb (tiny noise): Lets you try near variations. Without it, you might get stuck in an early mistake.
Uncertainty mask: Protects stable regions from over-refinement. Without it, repeated guidance can oversaturate or simplify textures.
Early focus: Motion set early. If you wait too late, cross-frame consistency makes changes harder.

Concrete Micro-Example

Suppose at t=0.1 (early), the ball’s path jitters. a) Predict: Clean estimate shows a smoother arc. b) Perturb: Add a little noise at the same level. c) Predict again: Even smoother arc, closer to physics. d) Mask: Only the ball and hand regions differ a lot → refine them; keep the bleachers. e) Move to next step.

The Secret Sauce

The DAE view of flow matching justifies the inner loop: at a fixed noise level, denoise pulls you toward the data manifold.
Doing small re-noise plus denoise cycles acts like local search: it explores nearby options without resetting everything.
The uncertainty mask gates where power is applied, preventing the guidance from “overcooking” static regions.
Efficiency: Only a couple of P&P loops (often 2–3) at early steps, adding about 1.5× NFEs, versus 2×–4× or many rejection attempts.

🍞 Bottom Bread (Anchor) Think of it like steering a sled at the top of a hill: make small, smart nudges early, mainly where the snow is bumpy, and you glide straight the rest of the way.

04Experiments & Results

🍞 Top Bread (Hook) Imagine a science fair where you don’t just say “we did better,” you show lots of tests with judges and stopwatches.

🥬 Filling (The Test)

What they measured and why: They tested motion coherence (does movement stay smooth and believable?), physical plausibility (does it obey common-sense physics like gravity and contact?), spatial consistency (do scenes stay consistent under big camera moves?), and alignment with prompts (does the video match the text?). They used both human judgments and automated metrics.

The Competition (Baselines)

Default ODE solver (UniPC), same solver with twice the steps (NFE×2), CFG-Zero guidance, and FlowMo (a training-free motion guidance method). They also tried verifier-based rejection sampling using an external video critic.

Scoreboard with Context

Motion Coherence (challenging motions): Using Wan2.2 T2V, human evaluators preferred the new method in motion about 73% of the time over the default sampler and 70% over FlowMo, like getting an A when others get B’s. Automated VBench motion/consistency metrics also ticked up.
Robotics I2V (PAI-Bench-G): On grasp tasks and robot-QA, self-refinement outperformed baselines and even best-of-4 rejection sampling, while being cheaper. For example, grasp success improved by about +11.0% on Cosmos-Predict-2.5 and +8.4% on Wan2.2 I2V. Robot-QA accuracy also rose.
Physics Alignment in the Wild: On VideoPhy2 and PhyWorldBench, human raters preferred the refined videos for physical common sense, and automated scores improved too, especially on motion-centric tests. On PisaBench (free-fall), refined trajectories matched real physics better (lower L2/CD, higher IoU).
Spatial Consistency: Under large camera rotations, the refined videos kept the background and scene content more consistent. SSIM went up, L1 went down, and PSNR rose—like remembering the room’s layout after spinning around.

Costs and Efficiency

The method typically adds about 1.5× NFEs (e.g., 60 vs 40 steps), which is less than doubling everywhere or running many rejections, and far less than heavy guidance methods that need gradients. It is a simple plug-in sampler change.

Surprising Findings

Mode-Seeking but Helpful: Many P&P loops in images can reduce diversity; in videos, it instead reduces temporal jitter and flicker—i.e., it seeks stable motion modes, which is good for physics realism.
Local vs Global Reasoning: Self-refinement helped tasks like graph traversal (big jump in success), which benefit from smoother, consistent progression. But it didn’t fix maze solving, which needs global, discrete decisions—pointing to limits of local refinement.
Early Steps Matter Most: Applying P&P only in early timesteps gave the best motion improvements with minimal overhead; late-only refinements did less.

🍞 Bottom Bread (Anchor) Think of a judge watching two videos of a gymnast: the refined one gets more thumbs-ups for smooth, believable swings and clean landings, and it took only a few extra small nudges early on to get there.

05Discussion & Limitations

🍞 Top Bread (Hook) Imagine tuning a guitar: small turns can make music sweet, but too much twisting can snap a string.

🥬 Filling (Honest Assessment)

Limitations
1. Over-Refinement Risk: Too many inner loops or too-strong guidance can over-saturate colors or simplify textures, especially in static areas.
2. Local Search Nature: P&P is a gentle local nudge, great for smoothing motion, less suited for global, discrete fixes (like solving a maze with walls).
3. Parameter Balance: Choosing how many loops, early-step window, and uncertainty threshold needs care; overly cautious settings weaken gains.
4. Reliance on Base Model Knowledge: If the generator lacks the right physics prior, self-refinement can’t invent it from scratch.
Required Resources • One GPU (e.g., a single H100 in their tests) and only moderate extra NFEs (~1.5×). No extra training, no external verifier, no gradients.
When NOT to Use • If you need maximal diversity across very different outcomes (many P&P loops may reduce variety). • If tasks require strict symbolic or discrete correctness (e.g., exact pathfinding), where a global planner or verifier is needed. • If your base model already over-saturates badly under guidance; then use the uncertainty mask and smaller loop counts.
Open Questions • Can thresholds adapt over time automatically, refining more early and less later without hand-tuning? • Can we mix light global search or a tiny verifier with P&P for the best of both worlds on hard reasoning tasks? • What’s the optimal schedule of loops across timesteps for different motion types? • Can self-refinement signals teach the model during post-training, closing the loop between inference and learning?

🍞 Bottom Bread (Anchor) Like a careful editor, this method polishes sentences (motion) very well, but it won’t rewrite a whole chapter’s plot (global logic) without help.

06Conclusion & Future Work

🍞 Top Bread (Hook) Think of a chef who tastes as they cook—stir, taste, tiny fix, repeat—ending with a dish that just feels right.

🥬 Filling (Takeaway)

3-Sentence Summary: This paper turns a video generator into its own refiner by adding tiny, training-free inner loops: predict a cleaner latent, gently re-noise, and try again at the same noise level. Seeing flow matching as a denoising autoencoder justifies these local refinement cycles. An uncertainty-aware gate focuses the fixes where the model is unsure, improving motion, physics, and consistency with modest extra compute.
Main Achievement: A simple, plug-and-play sampling method (P&P) that reliably boosts motion coherence and physical plausibility across strong base models, beating heavier baselines in both quality and efficiency.
Future Directions: Adaptive thresholds and loop schedules, hybridizing with lightweight verifiers for hard discrete tasks, and using these self-refinement signals to guide post-training. Extending the approach to broader generators while protecting diversity is another direction.
Why Remember This: It shows that, with the right inner loop, powerful video models can clean up their own outputs on the fly—no extra teachers needed—pushing videos closer to the real, physical world.

🍞 Bottom Bread (Anchor) Just a few early, smart nudges turn a wobbly back handspring into a smooth, believable one—right when it counts.

Practical Applications

•Improve robot training videos for grasping, placing, and tool use, boosting real-world success.
•Generate safer simulation clips for factory workflows that respect contact and gravity.
•Create sports analysis videos with smoother motion for coaching or play visualization.
•Produce educational science clips that demonstrate correct physics (like free-fall and waves).
•Enhance storytelling videos with consistent scenes during big camera moves.
•Refine product demos so object interactions (pouring, cutting, stacking) look realistic.
•Strengthen data for world-model learning by reducing motion jitter and artifacts.
•Upgrade AR/VR previsualizations with stable motion and fewer flickers.
•Assist video-based planning systems with more reliable action outcomes.
•Polish marketing or explainer videos where tiny motion corrections make a big quality difference.

Version: 1