Self-Rewarding Sequential Monte Carlo for Masked Diffusion Language Models

Ziwei Luo; Ziqi Jin; Lei Wang; Lidong Bing; Thomas B. Schön

Self-Rewarding Sequential Monte Carlo for Masked Diffusion Language Models

Intermediate

Ziwei Luo, Ziqi Jin, Lei Wang et al.2/2/2026

arXiv PDF

Key Summary

•The paper introduces a new way to sample text from masked diffusion language models that is smarter and less greedy.
•Instead of trusting only the most confident token at each step, it runs several parallel guesses (called particles) and rewards the whole path that looks consistently good.
•This reward is the model’s own trajectory-level confidence, so no extra training or external reward model is needed.
•A classic algorithm called Sequential Monte Carlo (SMC) is used to repeatedly resample better particles and drop weaker ones.
•On standard text quality tests, the method greatly lowers generative perplexity, which means the text looks more like real data.
•On math and coding benchmarks (like GSM8K, MATH, HumanEval, MBPP), it improves accuracy across two diffusion LLMs (LLaDA-1.5 and Dream-7B).
•It stays robust even when sampling temperature changes, avoiding repetition failures that some baselines suffer at low temperatures.
•The method converts extra parallel compute at inference time directly into better answers, with strong gains at 2–4 particles.
•It’s general, plug-and-play, and works with existing masked diffusion models and block diffusion decoders.
•Bottom line: more thoughtful exploration at inference time leads to higher-quality, more reliable text without retraining.

Why This Research Matters

Better sampling at inference time helps models write clearer essays, solve math problems more accurately, and produce cleaner code, all without retraining. Because it’s self-rewarding, teams don’t need to design or tune special reward models for each task. The method is robust to different sampling temperatures, reducing failures like repetition loops. It keeps diversity high, so you get quality without turning all outputs into the same bland text. It scales naturally with available compute, turning parallelism into better results. This makes diffusion-based language models more competitive with autoregressive models, broadening practical choices for developers.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine trying to finish a crossword puzzle where many squares are blank. If you always fill in only the square you’re most confident about next, you might get stuck in a corner and miss better answers that fit the whole puzzle.

🥬 The Concept: Generative Modeling

What it is: Teaching a computer to create new things (like text) that look real.
How it works: It studies lots of examples, learns patterns, then produces new samples following those patterns.
Why it matters: Without generative modeling, computers can only copy or classify; they can’t create. 🍞 Anchor: A music app learns jazz styles and then composes a new jazz song that sounds like the pros.

🍞 Hook: You know how a foggy window slowly clears, and you can see the picture behind it better and better?

🥬 The Concept: Diffusion Processes

What it is: A method that starts from noise (or masks) and gradually reveals a clean signal by reversing a noisy process.
How it works:
1. Add noise/masks step by step during training so the model learns how things get hidden.
2. At generation time, start from full noise/masks.
3. Remove the noise/masks in steps, each time predicting missing parts.
Why it matters: If you remove noise without guidance, you get nonsense. Diffusion gives a careful recipe to clean things up. 🍞 Anchor: Like un-scrambling a blurred photo, one gentle swipe at a time, to get the original image back.

🍞 Hook: Think of a sentence with many [MASK] blanks. You keep guessing the hidden words until the sentence makes sense.

🥬 The Concept: Masked Diffusion Language Models (MDLMs)

What it is: Language models that generate text by repeatedly unmasking tokens using a diffusion-like process.
How it works:
1. Start with a sequence of [MASK] tokens.
2. The model predicts likely words for some masked spots.
3. Keep the confident ones; re-mask uncertain ones; repeat until done.
Why it matters: It allows parallel updates and can be efficient, but needs a smart way to decide what to keep or retry. 🍞 Anchor: Like filling in a Mad Libs page: reveal a few good words, hide doubtful ones, repeat until the story reads well.

🍞 Hook: You know when choosing fruit, you grab the ones that look the freshest? That’s focusing on the best-looking option right now.

🥬 The Concept: Confidence-based Sampling (token-level)

What it is: A rule that keeps only the highest-probability token guesses at each step and remasks the rest.
How it works:
1. For each masked position, compute probabilities for every word.
2. Select the most confident guesses.
3. Freeze those; keep working on the others.
Why it matters: It’s simple and fast, but greedy—so it can lock in early choices that later turn out to be wrong for the whole sentence. 🍞 Anchor: You pick the shiniest apple first, but later discover it doesn’t fit your recipe’s flavor, and now you’re stuck.

The World Before: Autoregressive (AR) models write text left-to-right, one token at a time, and they’ve been very strong. Diffusion for text (especially MDLMs and block-diffusion variants) emerged as an exciting alternative: generate multiple parts in parallel, refine, and finish. But most MDLMs used the greedy, token-level confidence rule: at each step, only keep the most confident tokens. This speeds things up but causes a problem—myopic (short-sighted) choices.

The Problem: Because the model commits to locally confident tokens, it can’t explore different promising paths. If an early token is locked in and later conflicts with the rest of the sentence, the model has a hard time escaping that dead end. Diversity shrinks, and errors snowball.

Failed Attempts: People tried inference-time guidance using external reward signals (like “be fluent,” “stay safe,” “follow a format”). Those can help, but they require hand-crafted rewards, training extra models, and tuning per task. That’s heavy and not general.

The Gap: We need a general, training-free way to explore multiple trajectories and prefer whole-path quality, not just step-by-step confidence.

The Stakes: In everyday life, this affects code generation (bugs vs. clean code), math problem solving (step errors vs. correct chains), and long-form writing (coherent plots vs. tangled stories). If we only trust the shiniest token at each moment, we risk beautiful starts with broken endings. We need a method that looks at the entire journey and keeps the paths that stay solid all the way through.

02Core Idea

🍞 Hook: Imagine four teams racing through a maze. Instead of only letting the team that looks best now keep going, you check which team’s entire path seems safest overall—and then you copy that team’s plan more often.

🥬 The Concept: The “Aha!” Moment

What it is: Run several parallel generations (particles), measure how confident each complete path is so far (trajectory-level confidence), and repeatedly resample to keep and copy the best paths.
How it works:
1. Start multiple masked diffusion runs at once.
2. After each step, compute a whole-path confidence score (product of the kept tokens’ probabilities at that step, accumulated over time).
3. Give higher weight to paths with higher trajectory confidence; resample so good paths get more copies.
4. Repeat until fully unmasked; pick the best path at the end.
Why it matters: You stop being greedy about just this token and become wise about the entire trajectory. 🍞 Anchor: Like a spelling bee team tournament: teams with steady, round-after-round performance get more teammates promoted, so the overall squad improves.

Three Analogies:

Hiking guides: Several hikers try different trails up the mountain. At checkpoints, you favor the groups that stayed on stable ground the whole way, not just who jumped fastest at one point.
Baking batches: You bake multiple trays of cookies. After each stage, you keep more of the trays that look consistently right (not just one perfect cookie on a tray of duds).
Detective work: Several detectives follow different clue chains. You back the team whose whole story holds together, not the one with a single flashy clue.

🍞 Hook: You know how you don’t judge a movie by a single scene—you judge the full storyline?

🥬 The Concept: Trajectory-level Confidence

What it is: A score that measures how strong the whole generation path has been so far, not just the latest guess.
How it works:
1. At each step, look at the tokens you decided to keep.
2. Multiply their confidences into the path’s running score.
3. Use this score to decide which paths to copy (resample) and which to drop.
Why it matters: Without it, you over-trust flashy single tokens and miss globally coherent answers. 🍞 Anchor: A coach chooses players not by one great play, but by steady performance across the entire game.

🍞 Hook: Think of a classroom activity where several student groups try the same challenge, and after each round, you duplicate the groups with the best progress.

🥬 The Concept: Sequential Monte Carlo (SMC) with Particles

What it is: A method that keeps many candidates (particles) alive, scores them, resamples the stronger ones, and repeats.
How it works:
1. Keep N parallel candidates.
2. After each step, compute weights (here, trajectory-level confidence).
3. Resample: pick more copies of high-weight candidates.
4. Continue stepping until done; choose the top-weight result.
Why it matters: Without resampling, you waste effort on weak paths or let all weight collapse onto one path too early. 🍞 Anchor: Like a science fair where the best projects get extra helpers and time each round, so the overall quality goes up.

Before vs. After:

Before: MDLMs mostly used token-level confidence. Fast but greedy; easily stuck with early mistakes; limited exploration.
After: Self-rewarding SMC considers the whole journey. It explores, rescues promising but initially less-confident paths, and consistently improves text quality and reasoning.

Why It Works (intuition, no equations):

Each particle is a full attempt at the text. At every step, we score how trustworthy the newly accepted tokens are and multiply that into the particle’s weight. This turns the model’s own probabilities into a self-reward: stable, consistent paths keep winning. By resampling, we concentrate compute on these winners without needing any outside reward model.

Building Blocks:

Particles: parallel candidate generations.
Token-level confidence: the model’s probability for kept tokens at a step.
Trajectory-level confidence: multiply-and-accumulate those confidences across steps.
Resampling: copy better particles more often; drop weaker ones.
Adaptive resampling: only resample when needed (using effective sample size) to avoid too much randomness.
Gumbel-Max trick: a stable way to sample discrete tokens with a temperature knob.

🍞 Anchor: Instead of betting your lunch money on one horse after the first furlong, you spread your bets, keep watching the whole race, and keep shifting your money to horses that run steadily well. That’s how you finish richer—here, with better text.

03Methodology

At a high level: Input → Initialize particles → Repeat [Resample → Propagate (predict/unmask) → Re-weight] → Output the highest-weight sequence.

🍞 Hook: Picture starting a jigsaw puzzle with several friends. Each friend works on a copy. After each minute, you look at whose picture is coming together best and assign more helpers to that friend.

🥬 The Concept: Initialization

What it is: Start N particles, each a fully masked sequence, all with equal weight.
How it works:
1. Choose number of particles N (e.g., 2–4 for strong gains).
2. Create N sequences of [MASK] tokens.
3. Set all weights to 1/N.
Why it matters: Without multiple starting points, you can’t explore different trajectories in parallel. 🍞 Anchor: Give four teams the same blank crossword and the same starting time.

Step-by-step recipe:

Resample (sometimes):
- What happens: You pick particles proportionally to their current weights to form a new set of N particles (good ones may be duplicated).
- Why it exists: It moves compute toward promising paths and avoids weight degeneracy (where a few particles dominate and others don’t matter).
- Example: Suppose weights are [0.1, 0.5, 0.3, 0.1]; after resampling, the particle with 0.5 might appear twice; the 0.3 might also reappear; the 0.1 ones might vanish.
Propagate (masked diffusion step):
- What happens: a) For each particle, the model predicts token distributions for masked spots. b) Decide which tokens to keep this round using a remasking policy:
  - Top-k or threshold: keep the confident ones, remask the rest. c) Apply the diffusion transition: keep unmasked tokens, remask low-confidence ones, and sample newly accepted tokens.
- Why it exists: This is the actual text-construction step—predict and accept some tokens each round.
- Example: In “If [MASK], [MASK] outside,” the model proposes “sunny” with 0.95 and “go” with 0.90. If threshold = 0.9, accept both; otherwise, maybe keep just “sunny.”
Re-weight (trajectory-level confidence):
- What happens: Multiply the particle’s weight by the product of the probabilities of the newly accepted tokens this step. This accumulates into a path score.
- Why it exists: It rewards whole-path stability; particles that repeatedly make strong, confident choices gain more weight.
- Example: If you accepted three tokens this round with confidences 0.92, 0.88, 0.90, multiply the weight by 0.92×0.88×0.90.
Repeat until all tokens are unmasked or steps are done.
Output: Choose the particle with the highest final weight.

🍞 Hook: You know how you only call for more helpers when your best teams start to fall behind? No need to reshuffle every minute.

🥬 The Concept: Adaptive Resampling (Effective Sample Size)

What it is: A rule to resample only when needed, based on how spread out the weights are.
How it works:
1. Compute effective sample size (ESS) from the weights.
2. If ESS is low (e.g., below N/2), resample now; else, skip resampling this step.
Why it matters: Resampling too often adds extra randomness; too rarely leaves you stuck with weak paths. ESS balances this. 🍞 Anchor: A teacher rearranges groups only when a few groups start dominating and others can’t keep up.

🍞 Hook: Choosing from many options can be messy. What if you had a fair, simple “draw” that still prefers better choices?

🥬 The Concept: Gumbel-Max Trick

What it is: A stable way to sample a token from probabilities using random “tickets” (Gumbel noise) and a temperature knob.
How it works:
1. For each token option, add a random Gumbel number to its logit.
2. Divide by temperature τ (τ=0 is argmax; τ>0 introduces diversity).
3. Pick the largest adjusted value.
Why it matters: It avoids brittle, fully greedy choices and lets SR-SMC explore without chaos. 🍞 Anchor: Like drawing raffle tickets where better options get a head start, but surprises can still happen.

Concrete walk-through with tiny data:

Input: A masked sentence of length 6: [M][M][M][M][M][M]. N=3 particles.
Round 1: Each particle predicts tokens. Using threshold 0.9, suppose each accepts 2 tokens. Re-weight by multiplying those two confidences. If particle B’s accepted tokens were 0.96 and 0.94, it gets higher weight.
ESS check: If weights are very uneven, resample so B gets duplicated. Now maybe we have particles A, B, B.
Round 2+: Continue. A might find a surprisingly good phrase and overtake B later. SR-SMC keeps that possibility alive.
Finish: Pick the path with the highest cumulative weight. That’s your output text.

Secret Sauce (why this is clever):

It turns the model’s own token probabilities into a path-level self-reward—no extra reward model or training.
It uses a principled SMC loop (resample → propagate → re-weight) that is well-known to reduce variance and sharpen the search.
It converts parallel compute directly into better sampling quality: more particles → more exploration → better odds of globally coherent text.

Practical settings from the paper:

Particles: default 4.
Resampling frequency: adaptive; for some models every 128 steps; for block diffusion, per block.
Temperature: often τ=1; the method stays robust across τ.
Compatible remasking policies: top-k or threshold.
Works with MDLMs, block diffusion LMs (BD3-LMs), and diffusion LLMs (LLaDA-1.5, Dream-7B).

04Experiments & Results

The Test: Two categories.

Text sample quality (OpenWebText-trained models): Measure generative perplexity (lower is better—like getting closer to the real data). Also track entropy (don’t collapse diversity) and function evaluations (NFE) as a rough cost.
Reasoning and coding (diffusion LLMs): Evaluate accuracy on GSM8K and MATH (math reasoning) and HumanEval and MBPP (code generation). Test at lengths 256 and 512 with block size 32.

The Competition: Baselines include standard MDLM and BD3-LMs (block diffusion), auto-regressive models, and other diffusion samplers (SEDD, SSD-LM). For diffusion LLMs (LLaDA-1.5, Dream-7B), compare standard parallel decoding vs. SR-SMC.

The Scoreboard (made meaningful):

OpenWebText sample quality (length 1024):
- MDLM baseline: Gen. PPL ≈ 46.8 → with SR-SMC ≈ 25.8. That’s like going from a C to a solid A in matching real text patterns.
- BD3-LMs (block sizes 16/8/4): e.g., L′=16 baseline ≈ 33.4 → SR-SMC ≈ 21.1. Similar jumps hold for longer length 2048.
Diversity intact: Entropy stays high (near data’s level), showing SR-SMC improves quality without collapsing variety.
Diffusion LLMs (LLaDA-1.5, Dream-7B):
- Across GSM8K, MATH, HumanEval, MBPP, SR-SMC improves accuracy consistently.
- Average gains at L=256: LLaDA-1.5 from ~49.3 to ~52.1; Dream-7B from ~51.9 to ~56.4. Think of this as increasing a team’s win rate by several percentage points across four different tournaments.

Surprising/Notable Findings:

Scaling particles helps: Moving from 1 to 2–4 particles steadily boosts accuracy. The sweet spot often around N=3–4.
Overtake phenomenon: In about 24–31% of blocks, a particle that wasn’t leading at the start wins by the end. This proves SR-SMC isn’t just copying the early leader—it allows comeback paths when a non-greedy choice pays off later.
Temperature robustness: Some baselines (e.g., Dream-7B) collapse at low temperature (repetition issues). SR-SMC remains stable across a wide range of τ, showing its exploration-and-resampling loop avoids degenerate, repetitive loops.
Longer generations: Gains persist or even grow for longer sequences (e.g., length 512), suggesting path-level resampling reduces error accumulation.

Context on costs:

NFEs reflect compute; SR-SMC converts extra parallel inference into better results. With modest N (e.g., 2–4), you get strong quality gains without any retraining.

Takeaway of Results:

SR-SMC is a general, training-free, plug-in inference method that improves masked diffusion sampling quality, boosts math/coding accuracy, maintains diversity, and behaves robustly under different sampling temperatures.
It narrows the gap between diffusion-based text generators and strong autoregressive baselines, especially with block diffusion, without heavy engineering or external reward models.

05Discussion & Limitations

Limitations:

Extra inference compute: Running N particles costs more than one pass. Although N=2–4 already gives strong gains, very large N may not be practical for latency-sensitive applications.
Objective mismatch: Trajectory confidence uses the model’s own likelihood. That’s task-agnostic and simple, but it doesn’t directly optimize for correctness or human preferences. In safety-critical settings, you may still want external rewards or checks.
Dependency on remasking policy: While SR-SMC works with top-k or threshold remasking, poor policy choices could still bottleneck exploration.

Required Resources:

A pretrained masked diffusion model or diffusion LLM.
Modest extra compute for N parallel particles (GPU RAM for parallel decoding and caches; the paper tested on NVIDIA H200/A800 setups).
Implementation of adaptive resampling (ESS) and Gumbel-Max token sampling.

When NOT to Use:

Ultra-low-latency scenarios where even 2–4× extra inference cost is unacceptable (e.g., real-time edge devices).
Tasks that absolutely require optimization of specific external metrics (e.g., formal verification) where a dedicated reward model or post-checkers are non-negotiable.
Very small models with weak base likelihoods; if the underlying model is too underpowered, trajectory confidence may not be discriminative enough to help.

Open Questions:

Better proposals: Can we design "look-ahead" or "twisted" proposals that explore even more smartly than the current bootstrap approach?
Hybrid rewards: How best to blend self-reward (likelihood) with lightweight task-specific signals for further gains without heavy engineering?
Budgeting compute: What is the optimal schedule for N and resampling frequency as sequence length grows or difficulty changes?
Theory-to-practice gap: How do formal SMC variance-reduction guarantees translate to large-scale, block-wise diffusion LLMs across diverse tasks?
Safety and alignment: Can trajectory-level confidence be combined with safety filters to preserve quality while reducing harmful outputs?

06Conclusion & Future Work

Three-sentence summary: This paper introduces a self-rewarding Sequential Monte Carlo method for masked diffusion language models that rewards entire generation paths using the model’s own trajectory-level confidence. By running multiple particles in parallel and repeatedly resampling the stronger paths, it turns extra inference compute directly into higher-quality, more reliable text. It achieves consistent gains across text quality, math reasoning, and coding tasks without retraining or external rewards.

Main achievement: A general, plug-and-play, training-free inference-time scaling algorithm that unifies diffusion sampling and remasking with a principled SMC view, using trajectory-level confidence as importance weights.

Future directions: Smarter proposal distributions (look-ahead/twisted transitions), lightweight blended rewards, adaptive particle schedules, and safety-aware variants. Exploring these could further improve efficiency, robustness, and controllability.

Why remember this: It shifts the mindset from greedy, token-by-token confidence to whole-path confidence, enabling masked diffusion models to explore wisely and finish strong. With minimal engineering overhead, it boosts performance across domains, making diffusion-based language generation more practical and competitive in real-world applications.

Practical Applications

•Boost math reasoning accuracy in educational tutors by sampling multiple solution paths and picking the most consistent one.
•Improve code generation assistants by exploring several implementations in parallel and keeping the most coherent program.
•Enhance long-form writing tools to maintain plot coherence across chapters by favoring globally consistent trajectories.
•Stabilize enterprise report drafting where accuracy and consistency matter (e.g., finance summaries, legal drafts).
•Increase robustness in data-to-text generation (e.g., product descriptions) without retraining, by smarter inference.
•Support safer outputs by combining SR-SMC with lightweight filters that downweight low-confidence or policy-violating paths.
•Accelerate prototyping: plug SR-SMC into existing MDLMs or block diffusion models to get better results fast.
•Mitigate repetition failures in low-temperature decoding by keeping multiple candidates and resampling away from loops.
•Improve structured output tasks (e.g., JSON/XML generation) by preferring paths that keep format-consistent tokens.
•Assist interactive writing: let users spend extra compute on tricky paragraphs (increase particles) for higher quality.

Version: 1