VAR RL Done Right: Tackling Asynchronous Policy Conflicts in Visual Autoregressive Generation

Shikun Sun; Liao Qu; Huichao Zhang; Yiheng Liu; Yangyang Song; Xian Li; Xu Wang; Yi Jiang; Daniel K. Du; Xinglong Wu; Jia Jia

VAR RL Done Right: Tackling Asynchronous Policy Conflicts in Visual Autoregressive Generation

Intermediate

Shikun Sun, Liao Qu, Huichao Zhang et al.1/5/2026

arXiv PDF

Key Summary

•Visual Autoregressive (VAR) models draw whole grids of image tokens at once across multiple scales, which makes standard reinforcement learning (RL) unstable.
•The paper identifies the root cause as asynchronous policy conflicts: early and late steps behave very differently because they handle wildly different numbers of tokens.
•They upgrade GRPO (a popular RL method) with three pieces: Value as Middle Return (VMR), Per-Action Normalization Weighting (PANW), and Mask Propagation (MP).
•VMR inserts a smart, structure-preserving intermediate reward so early steps get useful feedback without changing the true best policy.
•PANW rebalances learning by downweighting steps that output many tokens, preventing high-resolution steps from dominating.
•MP focuses updates on the exact spatial tokens that actually affect the final reward, reducing noise across space and time.
•On text rendering (CVTG-2K), their method boosts Word Accuracy from 0.5536 to 0.7841 and NED from 0.7816 to 0.9081 while keeping CLIPScore high.
•On HPSv3, it reaches state-of-the-art among diffusion-centric baselines across many categories, showing gains beyond text rendering.
•Ablations show prefix-focused training, moderate sample count (K=2), mask propagation, and fine-grained alternation give the best stability–performance trade-offs.
•Overall, carefully shaping rewards and balancing step-wise updates fixes RL instability in VAR and yields better, more faithful images.

Why This Research Matters

This work makes fast, multi-scale image generators much more reliable by fixing the main cause of RL instability in VAR. Clearer, more faithful text in images helps everyday tasks like posters, packaging, educational graphics, and UI mockups. The method also boosts general visual quality and prompt following, making creative tools more predictable and useful. Because it preserves optimality while improving training stability, teams can align models confidently without fear of breaking them. The ideas—mid-horizon rewards, step balancing, and targeted credit—can generalize to video, 3D, and other complex generators.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Top Bread (Hook): Imagine building a LEGO city. You start with big base plates (coarse details), then add roads and buildings (medium details), and finally place tiny signs and flowers (fine details). If someone judges your city only at the very end, it’s hard to know which step helped or hurt the final look.

🥬 Filling (The Actual Concept):

What it is: This paper studies how to teach Visual AutoRegressive (VAR) image generators with reinforcement learning (RL) so they follow goals (like correct text) without becoming unstable.
How it works (story of the field):
1. Before VAR, two stars dominated: diffusion models (great quality but many steps) and raster-scan AR models (predicting tokens one-by-one, slower at inference).
2. VAR changed the game by drawing token grids in parallel, scale by scale (coarse-to-fine). This is fast and aligns with modern vision transformers.
3. But when people tried to fine-tune VAR with RL (like GRPO), things got shaky: early and late steps looked too different (tiny vs huge token counts), so training became unstable.
Why it matters: Without fixing this, RL can make images worse or wobble during training, especially with limited on-policy data compared to pretraining.

🍞 Bottom Bread (Anchor): Think of coaching a choir that starts with small groups and ends with a massive chorus. If you only grade the final song, the small groups never get clear feedback and the big chorus overwhelms everything. You need fair scoring for each group and a way to trace which voices shaped the result.

🍞 Top Bread (Hook): You know how a puppy learns tricks? It tries something and gets a treat (reward). Over time, it does more of what gets treats.

🥬 Filling (Reinforcement Learning):

What it is: RL is a way for models to learn by trying actions and receiving rewards, favoring choices that lead to better outcomes.
How it works:
1. Model acts (draws tokens).
2. Environment gives a reward (e.g., text is readable).
3. Model updates to prefer actions that got higher rewards.
Why it matters: RL aligns image generation with goals like correct spelling, style, and user preferences.

🍞 Bottom Bread (Anchor): If the goal says “Write ‘VOTE BY MAIL’ clearly,” the model gets more reward when the letters are correct and readable.

🍞 Top Bread (Hook): Imagine painting in stages: first big blobs, then shapes, then tiny highlights. Each stage handles a very different number of brushstrokes.

🥬 Filling (VAR Models):

What it is: VAR models generate images by predicting grids of discrete tokens at multiple resolutions, from coarse to fine, in parallel per scale.
How it works:
1. Start at a small grid (e.g., 64×64 tokens) capturing big layout.
2. Move to 128×128, 256×256… adding details.
3. At each scale, predict all tokens in that grid simultaneously, conditioned on previous scales.
Why it matters: This is faster than drawing one token at a time and matches how modern backbones process images.

🍞 Bottom Bread (Anchor): It’s like sketching a poster: you block the layout, add shapes, then write sharp text—each pass refines the picture.

🍞 Top Bread (Hook): Picture two tug-of-war teams pulling at different times and strengths. If they aren’t coordinated, the rope jerks unpredictably.

🥬 Filling (Asynchronous Policy Conflict):

What it is: When early and late steps in VAR face very different tasks (few tokens vs many tokens), a single RL policy update can push them in conflicting directions.
How it works:
1. Early steps decide global layout with few tokens.
2. Late steps write fine details with many tokens.
3. If you treat all steps the same, late steps (with more tokens) dominate gradients, drowning out early steps—causing instability and slow convergence.
Why it matters: Unbalanced learning makes the model wobble or misalign with goals like correct text.

🍞 Bottom Bread (Anchor): If the poster’s background color (early) keeps changing while the text painter (late) tries to correct letters, they fight and the final poster looks messy.

🍞 Top Bread (Hook): Think of grading a group project by comparing team members within the same group to decide who improved more.

🥬 Filling (GRPO):

What it is: Group Relative Policy Optimization is an RL method that learns by ranking multiple candidate outputs per prompt and nudging the policy toward better ones, without training a separate value model.
How it works:
1. Sample several candidates for a prompt.
2. Score them with a reward model.
3. Push the policy toward higher-ranked samples, while staying close to the old policy.
Why it matters: It’s simple, efficient, and strong—but in VAR, step imbalance breaks its stability.

🍞 Bottom Bread (Anchor): For “Write ‘FISHING CHALLENGE’,” GRPO prefers the image whose letters are most correct, then updates the policy toward that style.

🍞 Top Bread (Hook): Like adjusting a recipe carefully so each version stays close to the last, avoiding wild swings.

🥬 Filling (KL-Regularized RL):

What it is: A stability leash that keeps the new policy close to the old one while still chasing higher rewards.
How it works:
1. Measure how much the new policy differs from the old.
2. Penalize moving too far.
3. Balance exploration (reward) with stability (penalty).
Why it matters: Prevents the model from drifting into bad habits during RL, especially with few samples.

🍞 Bottom Bread (Anchor): When fixing text, it stops the model from suddenly changing art style or composition too drastically.

The World Before → The Problem → Failed Attempts → The Gap → Stakes:

Before: Diffusion and AR were strong; VAR sped things up with parallel grids.
Problem: RL on VAR was unstable because different scales contributed very unevenly; final-only rewards made early steps guessy.
Failed attempts: Vanilla GRPO across full scales; training only at partial prefixes sometimes worked better, hinting at step conflict.
Gap: We need RL that gives early steps helpful, low-variance feedback, balances gradients by token count, and focuses credit on the truly responsible tokens.
Stakes: This directly impacts everyday experiences—clearer in-image text (menus, signs, UI mockups), better following prompts, and more reliable image generation for design, education, and accessibility.

02Core Idea

🍞 Top Bread (Hook): Imagine a relay race. If the first runner never gets split-time feedback until the very end, they can’t pace well; if the last runner is ten times faster, their stretch dominates the coach’s notes.

🥬 Filling (The "Aha!" Moment):

One-sentence insight: Split the RL learning into two coordinated stages with a trusted mid-race score (VMR), rebalance each stage by how many actions it took (PANW), and highlight only the truly responsible actions (MP).

Multiple Analogies (3 ways):

Cooking: VMR is tasting the stew halfway (early feedback), PANW is measuring ingredients so big spoons don’t overpower small ones, MP is only salting the spots that need it.
Orchestra: VMR is a mid-rehearsal check, PANW ensures the loud brass doesn’t drown the strings, MP shines a spotlight on the sections that shape the finale.
School project: VMR is a midterm that matches the final, PANW gives every teammate fair grading weight, MP comments only on the parts a student wrote.

Before vs After:

Before: One giant RL objective at the end; high-resolution steps with many tokens dominated updates; early steps lacked guidance; training curves jittered.
After: Two-stage GRPO with a middle return gives dense, low-variance feedback to early steps; PANW evens out per-step influence; MP directs gradients to the exact spatial-temporal areas that affected the reward. Training stabilizes and converges faster.

Why It Works (intuition):

Early decisions shape everything downstream. If they only get end-of-process feedback, it’s noisy and unfair. VMR gives them an on-mission, structure-preserving reward—like a trusted midterm that predicts the final—so they learn the right foundations.
Steps differ by action count. PANW normalizes contribution per action, so a 1024×1024 grid (huge) doesn’t squash a 128×128 grid (small). This balances gradients and KL across scales.
Not all tokens matter equally for the reward. MP finds the paths from reward back to responsible tokens and gates updates to those areas, slashing variance in both space and time.

Building Blocks (with Sandwich explanations):

🍞 Top Bread (Hook): You know how getting a midterm grade helps you adjust before the final? 🥬 Filling (Value as Middle Return, VMR):

What it is: A structure-preserving intermediate reward at a chosen middle step that splits learning into prefix (early) and suffix (late) stages without changing what the true best policy is.
How it works:
1. Pick a middle step (e.g., 128×128).
2. Estimate a soft value at that step by sampling a few continuations and log-averaging their rewards.
3. Train prefix to maximize this middle value; train suffix to maximize final reward, each with GRPO and KL.
Why it matters: Early steps get dense, low-variance feedback that matches the final objective; the best overall policy remains unchanged in theory. 🍞 Bottom Bread (Anchor): When writing a poster, a mid-check at the 128×128 stage tells you if the layout sets you up for perfect lettering later.

🍞 Top Bread (Hook): If one teammate writes 5 pages and another writes 50, you shouldn’t let the 50-page section completely dominate the grade. 🥬 Filling (Per-Action Normalization Weighting, PANW):

What it is: A per-step weight that scales with the number of tokens at that step to balance each step’s influence.
How it works:
1. Compute grid size (h×w) at each step.
2. Apply a decay (exponent α) so large grids don’t swamp the loss.
3. Normalize across steps so gradients are comparable.
Why it matters: Prevents high-resolution steps from overpowering updates and drowning early steps. 🍞 Bottom Bread (Anchor): Your grading rubric divides points per page so both the short intro and long body get fair influence.

🍞 Top Bread (Hook): When you find a spelling mistake in a poster, you trace which strokes caused the wrong letter. 🥬 Filling (Mask Propagation, MP):

What it is: A mechanism that follows the reward signal back through the model’s multi-scale hierarchy to select the exact tokens most responsible for the final score.
How it works:
1. Start from reward-determining outputs (e.g., OCR-detected text boxes).
2. Propagate masks backward from fine to coarse scales.
3. Gate rewards and gradients so only relevant tokens get updated.
Why it matters: Focuses credit where it’s due, reducing noise and stabilizing learning across space and time. 🍞 Bottom Bread (Anchor): If “CHALLENGE” is misspelled, the mask highlights those letters and their earlier building blocks so updates fix the right pieces.

03Methodology

At a high level: Prompt and current state → Sample candidate token grids (coarse to fine) → Compute rewards (final and middle) → Two-stage GRPO updates with PANW and Mask Propagation → Updated VAR policy.

Step 1. Formalize VAR as a simple RL problem (deterministic MDP).

🍞 Top Bread (Hook): Think of a choose-your-own-adventure book where each page you pick is fixed once chosen. 🥬 Filling (Deterministic MDP for VAR):

What it is: A way to model the generation as states (what’s drawn so far), actions (next token grid), and deterministic transitions (the next state is exactly what you append).
How it works:
1. State: all previously generated grids up to a step.
2. Action: produce the next grid (e.g., 256×256 tokens) in parallel.
3. Transition: new state is just old state plus that grid—no randomness from the environment.
4. Reward: given at the end (final image), like OCR-based text accuracy or a human preference score.
Why it matters: This framing lets us use GRPO and analyze optimal policies with a KL leash. 🍞 Bottom Bread (Anchor): You build a poster layer-by-layer; each new layer becomes part of the fixed design, and only at the end do you get graded.

Step 2. Two-stage GRPO with Value as Middle Return (VMR).

🍞 Top Bread (Hook): You wouldn’t wait until the concert ends to tell the opening act how they did. 🥬 Filling (VMR in practice):

What it is: Insert a middle return at step m (e.g., 128×128) to split training into prefix (1…m−1) and suffix (m…T−1) subproblems.
How it works:
1. Suffix stage: From state at step m, roll out to the end for multiple candidates; score by the final reward; apply GRPO with KL to improve later steps.
2. Middle return estimation: For each state at step m, take a small number of on-policy rollouts (e.g., K=2) and compute a soft, log-mean-exp of their terminal rewards (a risk-sensitive average that’s stable and informative).
3. Prefix stage: Train the early steps to maximize that middle value using GRPO with KL, so early decisions directly chase a reliable predictor of final success.
Why it matters: Early steps get rich feedback, training becomes stable, and theory guarantees we don’t change the family-optimal policy. 🍞 Bottom Bread (Anchor): Like giving a midterm that perfectly anticipates the final, the early chapters can be improved with confidence.

Step 3. Balance step influence with Per-Action Normalization Weighting (PANW).

🍞 Top Bread (Hook): If one player takes 100 shots and another takes 10, you might compare their shooting percentages, not just total points. 🥬 Filling (PANW details):

What it is: A per-step weight inversely related to the number of tokens (with a decay exponent α) that balances gradients across resolutions.
How it works:
1. For each step t, compute grid size (h_t×w_t).
2. Compute a weight that reduces the dominance of large grids (best α around 0.6–0.8 in ablations).
3. Normalize across steps in a batch so updates are comparable.
Why it matters: Without PANW, late, high-res steps overwhelm updates; with PANW, early structure gets learned. 🍞 Bottom Bread (Anchor): Your coach normalizes stats so benchwarmers and starters both get fair feedback.

Step 4. Aim updates precisely with Mask Propagation (MP).

🍞 Top Bread (Hook): When fixing a misspelled word on a poster, you don’t repaint the sky—you fix the letters. 🥬 Filling (MP mechanics):

What it is: A spatiotemporal mask that starts from the reward’s source (like OCR text regions) and flows backward across scales to gate learning.
How it works:
1. Build initial masks from outputs tied to the reward (e.g., detected word boxes from OCR).
2. Propagate these masks from fine to coarse levels through the model’s hierarchy.
3. Use masks to gate intermediate rewards and gradients so only relevant tokens get strong updates.
Why it matters: Cuts noise, stabilizes credit assignment, and yields better text fidelity. 🍞 Bottom Bread (Anchor): If “GOTCHA!” is wrong, masks highlight those letters and their earlier scaffolding.

Step 5. Rewards for specific tasks.

🍞 Top Bread (Hook): If you’re grading spelling in a poster contest, you check completeness, correctness, and penalize extra or missing letters. 🥬 Filling (Text rendering reward design with OCR):

What it is: A reward that blends completeness (did the required words appear?), similarity (are characters correct and in order?), and a length mismatch penalty (avoid over/under-generation), weighted by OCR confidence.
How it works:
1. Use OCR to read predicted words and confidences.
2. Completeness: count required words, discount duplicates by minimum confidence.
3. Similarity: use string similarity (like edit distance) times confidence to reward near-matches.
4. Penalty: penalize extra/missing characters to prevent cheating by spamming letters.
5. Combine to get the final reward.
Why it matters: Encourages correct, readable text without gaming the metric. 🍞 Bottom Bread (Anchor): The model gets more points for exactly writing “VOTE BY MAIL,” fewer for “VOT BY MIAL,” and loses points for random extra letters.

Step 6. Training and sampling recipe.

Candidates per prompt (group size): 16; prompts per update (batch): 16.
On-policy updates up to ~1,200; learning rates around 1e-6 to 1e-5 with AdamW; no CFG during training (CFG=5 at sampling/evaluation).
Alternation: do three prefix GRPO updates for each suffix update (fine-grained alternation works better than coarse-grained).
Middle step m: best around 128×128 or 256×256; default m=128×128.
Middle value estimation: small K=2 is best trade-off.

The Secret Sauce:

VMR gives early steps reliable, structure-preserving guidance.
PANW equalizes step influence despite huge action-count differences.
MP sharpens who gets credit, across both space and time. Together, they remove the main sources of RL instability in VAR and make alignment stick.

04Experiments & Results

The Test (what and why):

Text rendering (CVTG-2K): Tests if the model can generate images with accurate, readable text—crucial for posters, packaging, and UI mocks. Metrics: Word Accuracy (strict), NED (character similarity), CLIPScore (semantic alignment).
Human Preference Score (HPSv3): Uses a learned preference model to judge visual quality and prompt faithfulness across many categories, approximating human taste.

The Competition (baselines):

NextFlow (VAR-based) before RL.
Vanilla GRPO on VAR (unstable baseline).
Diffusion-centric strong baselines (Flux-dev, Kolors, Playground-v2.5, etc.).

The Scoreboard (with context):

CVTG-2K: • NextFlow-RL vs. NextFlow: Word Accuracy jumps from 0.5536 to 0.7841 (like going from a mid C to a solid A-), NED from 0.7816 to 0.9081 (clearer letters), CLIPScore slightly up from 0.8068 to 0.8224 (keeps meaning while fixing text). • Among diffusion baselines, NextFlow-RL is competitive or better on text fidelity while preserving semantics—showing that VAR+RL can match or beat larger diffusion models for text rendering.
HPSv3: • Overall (“All”) rises from 8.43 to 10.64—a big leap that users would notice as sharper, cleaner images that follow prompts better. • Category wins or near-wins: Architecture, Animals, Natural Scenery, Plants, Food, Others; second-best on Characters and Design. This means improvements aren’t limited to text—they generalize to many visual styles and subjects.

Surprising Findings:

Prefix wins: Training the prefix (early steps) yields the largest gains—once the foundation is right, later steps can polish effectively.
Small K is sweet: K=2 middle-value samples work best; higher K surprisingly can add variance due to trajectory heterogeneity and interactions with masking.
Fine-grained alternation: Doing several prefix updates per suffix update (3:1) beats coarse cycles (e.g., 300 prefix then 100 suffix). Frequent, localized updates resolve step conflicts faster.
Mask Propagation matters: Turning on MP boosts text fidelity without hurting CLIPScore—confirming that targeted credit assignment pays off.
Best middle step: m around 128×128 or 256×256 gives strong results; earlier middle returns reduce variance for early credit assignment.

What it looks like (qualitative):

Before: Letters often jumbled, missing, or with wrong glyphs; long phrases suffer misordering.
After: Corrected character order, fewer missing/extraneous letters, better readability across fonts and layouts; globally, images look cleaner with crisper structure and better adherence to prompts.

Takeaway in plain terms:

The method turns a wobbly coach (vanilla GRPO on VAR) into a calm, fair one: early players get mid-game feedback (VMR), everyone’s voice is balanced (PANW), and only the right notes are corrected (MP). Scores shoot up in both text accuracy and general image appeal.

05Discussion & Limitations

Limitations (be specific):

Hyperparameter sensitivity: Choosing the middle step m and decay α matters; picking m too late can underperform, and α outside 0.6–0.8 can over/under-normalize.
Reward dependence: OCR-based rewards and HPSv3 guide improvements, but any bias or blind spot in these reward models can steer learning in unintended ways.
Compute and latency: On-policy sampling with K rollouts and group size 16 per prompt increases cost; running a fast reward service (OCR/HPS) is necessary to keep throughput high.
Task specificity: Mask Propagation relies on deriving accurate reward-linked masks (e.g., text boxes). For tasks without clear spatial hooks, MP may need redesign.
Theoretical scope: VMR preserves family-optimality under the VAR factorization, but guarantees outside this family (or with different architectures) aren’t established here.

Required Resources:

A capable VAR backbone (e.g., NextFlow/TokenFlow) and tokenizer.
On-policy RL infrastructure for GRPO (grouped sampling, ranking, KL control).
Fast reward models/services (OCR, HPSv3) and GPU memory for multi-candidate rollouts.
Implementation of PANW and MP integrated with the model’s multi-scale hierarchy.

When NOT to Use:

If you can only afford off-policy or single-sample training with no reliable reward signals.
If your task has no meaningful intermediate structure to anchor masks, and rewards are extremely sparse and noisy.
If late-stage details are all that matter and early structure is fixed by design (then VMR gives less benefit).

Open Questions:

Adaptive middle step: Can we learn or schedule m dynamically per prompt to match content difficulty?
General reward shaping: Beyond OCR/HPS, how to design robust, low-bias rewards for style, safety, and complex compositional goals?
Broader architectures: How well do VMR, PANW, and MP transfer to video VAR, 3D generation, or hybrid AR–diffusion models?
Efficient credit assignment: Can we approximate mask propagation with learned attention maps or gradient-free signals to reduce overhead?
Theory under stochastic dynamics: The analysis assumes deterministic transitions; what changes when environment noise or stochastic decoders enter the loop?

06Conclusion & Future Work

Three-Sentence Summary:

VAR models are fast and powerful but hard to align with RL because different scales contribute unevenly, causing asynchronous policy conflicts.
This paper stabilizes RL for VAR by inserting a structure-preserving middle return (VMR), balancing step influence (PANW), and focusing updates on responsible tokens (MP).
The approach sharply improves text fidelity and general image quality, outperforming strong baselines while maintaining semantic alignment.

Main Achievement:

A principled, practical RL recipe for VAR that preserves optimality within the model family and resolves cross-scale instability, demonstrated by large gains on text rendering and human preference metrics.

Future Directions:

Automate picking the middle step and α via meta-learning or adaptive schedules; extend reward design beyond OCR/HPS to composition and safety; scale MP to video and 3D with temporal–spatial consistency.

Why Remember This:

It shows that aligning fast, multi-scale generators isn’t about bigger models alone—it’s about fair feedback, balanced updates, and precise credit. With the right structure (VMR), balance (PANW), and focus (MP), RL can reliably turn VAR’s speed into faithful, high-quality results.

Practical Applications

•Generate marketing posters with accurate, readable slogans and product names.
•Create UI mockups where button labels and menus are spelled correctly.
•Design packaging images with precise brand text and nutritional info.
•Produce educational diagrams with legible labels and formulas.
•Refine storyboards or comics where captions and speech bubbles must be clear.
•Localize images across languages by aligning with OCR-based multilingual rewards.
•Improve product listing photos that include sharp, correct on-image text.
•Boost adherence to complex prompts in concept art and advertising visuals.
•Enhance signage and wayfinding mockups with correct text placement.
•Pre-press checks: use the reward to automatically fix text fidelity before printing.

Version: 1