FourierSampler: Unlocking Non-Autoregressive Potential in Diffusion Language Models via Frequency-Guided Generation

Siyang He; Qiqi Wang; Xiaoran Liu; Hongnan Ma; Yiwei Shi; Yuerong Song; Ying Zhu; Tianyi Liang; Zengfeng Huang; Ziwei He; Xipeng Qiu

FourierSampler: Unlocking Non-Autoregressive Potential in Diffusion Language Models via Frequency-Guided Generation

Intermediate

Siyang He, Qiqi Wang, Xiaoran Liu et al.1/30/2026

arXiv PDF

Key Summary

•Diffusion language models (dLLMs) can write text in any order, but common decoding methods still prefer left-to-right, which wastes their superpower.
•This paper studies dLLMs in the frequency domain and finds a simple rule: low-frequency signals carry structure (the big plan), and high-frequency signals carry details (the small pieces).
•Using this insight, the authors design FourierSampler, a decoding recipe that first locks in structure and then fills in details, like sketching before shading.
•FourierSampler slides a window over frequencies during generation, scoring tokens with a Translated Fourier Score and mixing it with the model’s own confidence via an Adaptive Fourier Calibrator.
•On math and code tasks, FourierSampler beats other decoding tricks (PC-Sampler, RWS) and even surpasses similar-size autoregressive models in several cases.
•Gains are large and consistent: up to 20.4% improvement on MBPP for LLaDA1.5-8B, and up to 45.1% on Countdown for SDAR-1.7B-Chat.
•Larger decoding blocks make the frequency guidance even stronger, because the model can see a longer “song” to analyze.
•Parts-of-speech analysis matches the theory: conjunctions and prepositions (structure) are low-frequency; nouns (specific items) are high-frequency.
•The approach is training-free, plugs into inference, and offers a principled, internal signal for better non-autoregressive planning.
•It opens a new lens—frequency—to guide text generation schedules without extra reward models or rules.

Why This Research Matters

Better planning during generation means fewer logic slips in math, cleaner control flow in code, and stronger fill-in-the-middle edits for everyday writing. Because the method is training-free and internal, teams can upgrade existing dLLMs without costly retraining or extra reward models. For students, this can mean clearer step-by-step explanations and fewer off-track answers. For developers, it improves code assistants that first lay out correct scaffolding before inserting exact variables and constants. For writers and editors, it supports structured drafting—headlines and outlines first, then precise wording—leading to more coherent documents. In broader AI systems, principled scheduling reduces error cascades and may lower hallucinations by delaying fragile details until the structure is set.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine building a LEGO castle. First you place the big base plates (the structure), then you add tiny flags and windows (the details). If you start with flags, your castle might wobble.

🥬 The Concept: Diffusion Language Models (dLLMs)

What it is: dLLMs are language models that generate text by gradually cleaning up noisy guesses, and they’re free to fill in words in any order, not just left-to-right.
How it works: (1) Start with many masked or noisy tokens. (2) Repeatedly unmask or denoise positions. (3) Use context from both sides to refine the text. (4) End with a clean, coherent answer.
Why it matters: Without this freedom, the model can get stuck following only the past words, missing helpful clues that appear later. 🍞 Anchor: When solving a math word problem, a dLLM can consider both the question and the ending steps while filling in the middle, like having the whole puzzle box picture while placing pieces.

🍞 Hook: You know how you don’t always do chores in order? You might first sort laundry piles (big plan), then deal with socks later (details).

🥬 The Concept: Non-Autoregressive Generation

What it is: A way to write text where you don’t have to add words strictly from left to right.
How it works: (1) Look at all blank and filled spots. (2) Choose which spots to fill based on usefulness. (3) Fill several places in parallel. (4) Repeat until done.
Why it matters: If you force a single order, you can miss better choices that need future context. 🍞 Anchor: Filling in a crossword: you bounce around the grid, using clues from all directions to complete the answers.

🍞 Hook: Think about humming a tune. The steady beat holds the song together, while the twinkly notes add sparkle.

🥬 The Concept: Positional Bias in dLLMs

What it is: Even though dLLMs can write in any order, many decoders still prefer certain positions (often early ones), like a habit that won’t go away.
How it works: (1) Use token confidence to pick positions. (2) Early positions get favored. (3) The model behaves almost like left-to-right. (4) Global planning power is underused.
Why it matters: If the model can’t truly choose freely, we lose the benefit of better planning and fill-in abilities. 🍞 Anchor: It’s like having permission to color anywhere on the page but always starting at the top-left corner because that’s the habit.

🍞 Hook: When you listen to a song, you can feel the bass (low rumble) and the sparkle (high notes). Both matter, but they do different jobs.

🥬 The Concept: Frequency-Domain Analysis

What it is: A way to look at signals (like hidden states in a model) by separating slow-changing parts (low frequency) from fast-changing parts (high frequency).
How it works: (1) Take the sequence of hidden states. (2) Apply a Fourier transform to see its frequencies. (3) Inspect energy in low vs. high bands. (4) Relate bands to meaning in text.
Why it matters: Without this view, you only see the words in order, not the hidden rhythm that carries structure vs. details. 🍞 Anchor: Like using an equalizer to understand which parts of a song are bass vs. treble, we can see which parts of text signals are global vs. local.

🍞 Hook: First draw the outline of a cat, then add whiskers. Starting with whiskers makes the drawing hard to place.

🥬 The Concept: Low- vs. High-Frequency Components in Text

What it is: Low frequency = global structure and long-range links; High frequency = local detail and sharp changes.
How it works: (1) Compute each token’s low/high-frequency energy. (2) Notice keywords like if/return are low-frequency. (3) Notice variable names and numbers are high-frequency. (4) Plan generation from structure to detail.
Why it matters: If you add details before the plan, you often need to erase and redo, causing errors. 🍞 Anchor: In code, words like if and return (structure) come first; names like gcd or 47 (details) come later.

The world before: Autoregressive models wrote left-to-right and were strong but struggled with using future context. dLLMs promised true arbitrary-order generation to plan better, but common decoders still acted positional—like wearing running shoes but walking.

The problem: How can we schedule which tokens to write when, using a principled, internal signal, not hand-made rules or extra reward models?

Failed attempts: Confidence-only unmasking, rule-based position boosts (PC-Sampler), and reward-guided sampling (RWS) help but either re-impose order, require extra models, or depend on dataset-specific tuning.

The gap: No one was using the model’s own frequency signature to decide a smart order—structure first, details next.

The stakes: Better math chains, fewer coding logic slips, stronger fill-in-the-middle editing, and more reliable long-context planning connect directly to homework help, coding assistants, and document drafting.

02Core Idea

🍞 Hook: When building a sandcastle, you press the big bucket mold first (shape), then carve windows and draw seashell lines (details).

🥬 The Concept: The Aha! Moment

What it is: The key insight is that a dLLM’s hidden states split into low-frequency parts (structure) and high-frequency parts (details), so we should guide decoding to go from structure to detail.
How it works: (1) Analyze frequencies of hidden states. (2) Start generation favoring low-frequency energy (global plan). (3) Slide attention toward high-frequency energy (local details) over time. (4) Blend this guidance with the model’s own confidence.
Why it matters: Without this guidance, dLLMs slip back into positional habits and miss their superpower of global planning. 🍞 Anchor: Like sketching a stick-figure pose before drawing eyes and shoelaces, the model first writes connective, structural words, then fills names and numbers.

Three analogies for the same idea:

Drawing analogy: Outline first (low frequency), shading later (high frequency).
Music analogy: Bassline first (keeps the song stable), then add riffs and trills (sparkle and detail).
Puzzle analogy: Build the edges and big chunks first, then place the tiny sky pieces.

Before vs. After:

Before: Decoders often followed token confidence that favored early positions, acting semi left-to-right and weakening non-autoregressive planning.
After: FourierSampler schedules decoding by frequency—first lock the backbone, then refine details—improving coherence and accuracy.

🍞 Hook: Imagine a flashlight whose beam slowly narrows and moves, lighting up the parts you need next.

🥬 The Concept: FourierSampler

What it is: A decoding strategy that slides a window over frequency bands during generation to prioritize structure first, then details.
How it works: (1) Take hidden states in a block. (2) Transform to frequency space. (3) Keep a band that starts low and slides higher each step. (4) Score tokens by energy in this band and mix with confidence. (5) Unmask top-scoring tokens.
Why it matters: Without a smart schedule, details can appear too soon and conflict with later logic. 🍞 Anchor: In code, words like if/else rise early; variable names and exact numbers finalize later.

🍞 Hook: You know how a teacher might say, “Focus on the big idea, not the exact phrasing—yet.”

🥬 The Concept: Translated Fourier Score

What it is: A score that tells which token positions are most active in the currently highlighted frequency band.
How it works: (1) Filter hidden states by the sliding band. (2) Measure each token’s energy under this filter. (3) Normalize within the block. (4) Use it to rank which tokens to write now.
Why it matters: Without this score, the decoder can’t follow the structure-to-detail plan step by step. 🍞 Anchor: It’s like giving extra points to the puzzle pieces that match the frame you’re working on right now.

🍞 Hook: Think of a smart volume knob that turns the music up when the room is noisy and down when it’s quiet.

🥬 The Concept: Adaptive Fourier Calibrator

What it is: A controller that adjusts how strongly the frequency score influences decoding, based on how confident the model already is.
How it works: (1) Look at the model’s confidence spread across positions. (2) If the model’s already sure, rely less on frequency guidance. (3) If it’s unsure, rely more. (4) Smoothly update this each step.
Why it matters: Without adaptation, guidance could overpower the model when it’s already right, or be too weak when it’s confused. 🍞 Anchor: Like a coach who steps in with more advice when a player hesitates, and backs off when they’re on a hot streak.

Why it works (intuition, no equations):

Low-frequency signals are steady across positions, carrying the skeleton—great for early planning. High-frequency signals are spiky and precise—great for late polishing. By sliding from low to high, we reduce wrong early commitments and trim the search space for details.

Building blocks:

Frequency analysis of hidden states → Translated Fourier Score → Adaptive Fourier Calibrator → Structure-to-detail schedule → Better coherence and accuracy.

03Methodology

At a high level: Input prompt → dLLM proposes hidden states for a block → FourierSampler scores positions (frequency band + model confidence) → Unmask top positions → Repeat sliding band from structure to detail → Output text.

🍞 Hook: Picture reading with a ruler that slides down the page so you focus exactly where you should.

🥬 The Concept: Decoding Blocks

What it is: The model works on chunks (blocks) of positions at a time, updating several tokens per step.
How it works: (1) Choose a block size B. (2) Within each block, take S steps. (3) Each step, pick which masked tokens to reveal. (4) Move to the next block.
Why it matters: Without blocks, the model either moves too slowly (one token) or loses structure (too many at once). Blocks balance context and control. 🍞 Anchor: Like cleaning your room corner by corner instead of all at once.

Step-by-step recipe:

Gather the hidden states

What happens: For the current block, collect the last-layer hidden states (a matrix: positions × features).
Why this step: These are the model’s “thoughts” we’ll analyze; without them, we can’t read the frequency signals.
Example: In a code task, this includes positions for if, return, fib, and numbers.

Transform to frequency space

What happens: Apply a real-valued Fourier transform along the sequence dimension to split slow vs. fast-changing parts.
Why this step: Frequency space reveals structure (low) vs. detail (high); skipping it hides the signal we need.
Example: The spectrum shows strong low-frequency energy around if/return and spikes for variable names.

Slide a frequency window

What happens: Keep a band of frequencies whose width is a fraction ρ of the full spectrum; its start position moves steadily from low to high as steps advance.
Why this step: This enforces a structure-to-detail schedule across steps; without it, the model might jump to details too soon.
Example: Early steps highlight the bass-like low band; later steps move the spotlight to treble-like high bands.

Go back to token space (filtered)

What happens: Inverse-transform the filtered spectrum to get a “band-focused” hidden state for each position.
Why this step: We need per-token signals in regular space to score which positions to fill now.
Example: After low-band filtering, tokens like if/elif have strong energy; fib or 47 are weaker now.

Compute the Translated Fourier Score

What happens: For each position, measure its energy under the filtered hidden state and normalize within the block to get a score.
Why this step: This is the priority list for the current band; without it, we can’t decide which tokens match the current plan.
Example: Early steps: conjunctions, control-flow words top the list. Late steps: numbers and variable names rise.

Measure vanilla confidence

What happens: For each masked position, read the model’s top probability (its plain confidence).
Why this step: We respect the model’s own beliefs; ignoring them can override good instincts.
Example: If the model is certain about “return” at a spot, we shouldn’t fight it.

Adapt guidance strength (Adaptive Fourier Calibrator)

What happens: Compute how spread out those confidences are across positions. If spread is big (model knows priorities), weaken frequency guidance; if small (uncertain), strengthen it.
Why this step: Right-sized help avoids both bossiness and passivity.
Example: When many positions look equally uncertain, let frequency guidance lead more.

Fuse the scores and unmask

What happens: Final score = model confidence + adaptive weight × Translated Fourier Score. Unmask the highest-scoring positions.
Why this step: This combines internal certainty with the structure-to-detail plan; skipping fusion loses balance.
Example: Early: unmask if/elif. Later: unmask fib, n, and exact digits.

Repeat across steps and blocks

What happens: Advance the window, refresh scores, and keep decoding until the block (and then the whole sequence) is complete.
Why this step: The sliding window gently walks from big plan to fine polish.
Example: In a long solution, paragraphs and key connectors appear early, formulas and constants finalize later.

🍞 Hook: Like adding salt only when a soup needs it.

🥬 The Concept: The Secret Sauce

What it is: A principled, training-free controller that turns internal frequency patterns into a decoding schedule.
How it works: (1) Read hidden-state frequencies. (2) Slide from low to high. (3) Score tokens per step. (4) Adapt how much to trust the frequency hint.
Why it matters: Without a principled signal, decoders rely on rules or extra reward models; this one is built-in and general. 🍞 Anchor: It’s like using the song’s own beat to know when to bring in each instrument.

Concrete example with data:

Prompt: “Write a Python function to compute Fibonacci.”
Early steps (low band): def, if, return, else appear—laying out control flow.
Later steps (high band): fib, n, base-case numbers 0 and 1, and the exact addition come in—pinning down specifics.
Result: Clean structure first, correct details next, fewer backtracks.

04Experiments & Results

🍞 Hook: Imagine a school relay race: we compare runners (decoders) fairly on the same track (datasets) and see who finishes best.

🥬 The Concept: The Test Setup

What it is: A head-to-head comparison of decoding strategies on math and code tasks using well-known benchmarks.
How it works: (1) Models: LLaDA1.5-8B, LLaDA-8B-Instruct, SDAR-4B-Chat, SDAR-1.7B-Chat. (2) Baselines: PC-Sampler (rules) and RWS (reward-guided). (3) Tasks: GSM8K, MATH, MBPP, HumanEval, Countdown. (4) Same evaluation framework and settings.
Why it matters: Without fair tests, we can’t trust who really improved. 🍞 Anchor: Like testing bicycles on the same hill to compare gears, not on different roads.

The competition:

Baselines: Confidence-only decoding (original), PC-Sampler (position-aware rules), RWS (reward-weighted sampling). Also compared to similar-size autoregressive models (Llama, Qwen) for context.

The scoreboard (with context):

LLaDA1.5-8B + FourierSampler: Up to +20.4% relative gain on MBPP, +6.8% on MATH, +14.1% on Countdown; best average across tasks. In effect, moving from a solid B to an A-, beating PC-Sampler and RWS.
LLaDA-8B-Instruct + FourierSampler: Consistent gains across tasks; +16.0% on MBPP and +13.8% on Countdown. On average, the method raises performance to the top among non-training decoders.
SDAR-4B-Chat + FourierSampler: +13.0% on MBPP, +7.4% on HumanEval, +26.5% on Countdown; best average, surpassing RWS.
SDAR-1.7B-Chat + FourierSampler: +14.5% on HumanEval, a huge +45.1% on Countdown; strong average uplift.
Notably, FourierSampler helps LLaDA1.5-8B bridge and even surpass similarly sized autoregressive models (e.g., Llama3.1-8B-Instruct) on average—big news for non-autoregressive text generation.

Surprising and insightful findings:

Bigger decoding blocks help: With more tokens in view, the frequency analysis becomes clearer, so structure-first guidance shines brighter. At B=64 and above, gains grow, sometimes even reversing the usual big-block slowdown.
Structure-to-detail is visible: Heatmaps show structural words (if, return, else) getting picked early, while variables and numbers lock in later—exactly the intended schedule.
Language parts match frequencies: Conjunctions and prepositions (skeleton words) live in low frequency; nouns (specific objects) live in high frequency. This aligns cleanly with the theory.

Bottom line with meaning:

The average improvements are not tiny tweaks; they’re like raising a team’s season average from mid-table to top tier, and doing so without extra training or reward models.

05Discussion & Limitations

🍞 Hook: No tool is magic; even a great map doesn’t drive the car for you.

🥬 The Concept: Honest Assessment

What it is: Understanding where FourierSampler shines and where to be careful.
How it works: (1) Strengths: training-free, principled, general across dLLM types, large gains in math/code. (2) Limits: depends on clear frequency patterns, benefits grow with reasonable block sizes, adds some compute for FFTs. (3) Needs: access to last-layer hidden states, a stable block decoder, and a few hyperparameters (ρ, β range).
Why it matters: Knowing boundaries helps you choose the right jobs for the tool and avoid surprises. 🍞 Anchor: You wouldn’t use a paint roller for tiny lettering; it’s great for walls, not signatures.

Limitations (be specific):

Domain coverage: Tested mainly on math and code; narrative or creative writing needs more study.
Short outputs: Very short sequences offer poor frequency resolution, so benefits shrink.
Block sensitivity: Too-small blocks fragment the signal; very large blocks may raise latency.
Language variety: Morphologically rich languages or tokenizations with different rhythms may shift frequency patterns.
Rigid formats: Tasks needing exact token-by-token order (e.g., strict streaming protocols) may not suit structure-first scheduling.

Required resources:

Compute: Extra Fourier transforms per step add overhead (still lightweight relative to model forward passes), and experiments used GPUs like H200.
Access: You need internals (hidden states) during decoding to compute scores.

When NOT to use:

Ultra-low-latency settings where any overhead is unacceptable.
Streaming left-to-right chat where strict autoregressive order is required.
Tiny prompts/answers where frequency bands can’t be reliably estimated.

Open questions:

Can the frequency window be learned per layer or per task instead of fixed sliding?
Can training-time objectives encourage clearer structure/detail separation?
How does this interact with positional encodings and attention heads?
Can we auto-tune block size and ρ online?
What about multilingual and multimodal dLLMs—do the same spectral rules hold?

06Conclusion & Future Work

Three-sentence summary:

This paper discovers that dLLM hidden states carry structure in low frequencies and details in high frequencies.
Using this, FourierSampler guides decoding from structure to detail via a sliding frequency window and an adaptive calibrator that balances guidance with model confidence.
The result is strong, consistent gains on math and code tasks across multiple dLLMs, often outpacing rule-based and reward-guided decoders—and even matching or beating similar-size autoregressive models.

Main achievement:

A principled, training-free decoding schedule that unlocks non-autoregressive potential by leveraging the model’s own frequency signature, introduced alongside the Translated Fourier Score and Adaptive Fourier Calibrator.

Future directions:

Learnable or per-layer frequency schedules; integration with post-training to amplify spectral separation; automatic block/ρ selection; expansion to narrative writing, multilingual setups, and multimodal diffusion LMs.

Why remember this:

It reframes “what to write next” as “which frequencies to trust now,” turning hidden rhythms into a roadmap. That simple lens—structure first, details later—can make non-autoregressive generation clearer, stronger, and more reliable in practice.

Practical Applications

•Plug FourierSampler into existing dLLMs to boost math reasoning without retraining.
•Use structure-first decoding for code assistants to generate control flow before filling variable names.
•Improve text infilling/editing by locking paragraph structure first, then polishing exact phrasing.
•Enhance long-context tasks (summaries/outlines) by prioritizing connective tissue and section headers early.
•Guide multi-step planning (e.g., tool-use plans) by drafting the plan skeleton first, then inserting parameters.
•Stabilize data-to-text reports by anchoring global templates before inserting numbers and entities.
•Reduce error cascades in program synthesis by deferring brittle literals until logic is fixed.
•Support educational tutors that produce organized solution structures before numerical details.
•Speed human-in-the-loop review by surfacing structural tokens first for early validation.
•Adapt decoding strength on-the-fly for uncertain prompts using the Adaptive Fourier Calibrator.

Version: 1