GARDO: Reinforcing Diffusion Models without Reward Hacking

Haoran He; Yuxiao Ye; Jie Liu; Jiajun Liang; Zhiyong Wang; Ziyang Yuan; Xintao Wang; Hangyu Mao; Pengfei Wan; Ling Pan

GARDO: Reinforcing Diffusion Models without Reward Hacking

Intermediate

Haoran He, Yuxiao Ye, Jie Liu et al.12/30/2025

arXiv PDF

Key Summary

•GARDO is a new way to fine-tune text-to-image diffusion models with reinforcement learning without getting tricked by bad reward signals.
•Instead of punishing every sample with a KL penalty, GARDO only penalizes the small fraction of images whose rewards seem untrustworthy (high uncertainty).
•GARDO keeps the "reference" model fresh by periodically updating it to the current policy, so regularization remains helpful instead of holding learning back.
•To stop mode collapse and keep creativity, GARDO boosts the learning signal for images that are both good and different (diversity-aware advantage shaping).
•Across tasks like OCR and GenEval, GARDO matches or exceeds the reward scores of unregularized baselines while avoiding reward hacking.
•On unseen metrics like Aesthetic, PickScore, ImageReward, ClipScore, and HPSv3, GARDO stays strong, showing it isn’t just gaming the proxy reward.
•GARDO improves diversity substantially (e.g., from about 20 to 25 on GenEval) without slowing training, which helps exploration and image variety.
•A simple extra trick—removing the standard deviation from advantage normalization—reduces over-amplifying tiny reward differences and further calms reward hacking.
•GARDO works with different RL algorithms (like GRPO and DiffusionNFT) and base models (like SD3.5 and Flux.1-dev), showing it’s a general framework.
•The main limitation is relying on auxiliary reward models to estimate uncertainty, which may be costlier for very large or video models.

Why This Research Matters

GARDO helps image generators improve without being tricked by imperfect reward signals, which is crucial because most practical rewards in vision are proxies. By penalizing only risky cases and keeping the regularization anchor up-to-date, it preserves fast learning and real quality. Its diversity-aware shaping avoids boring, repetitive outputs and supports creative, correct images. This combination means better alignment with human preferences even when exact ground-truth rewards are unavailable. It also generalizes across RL algorithms and model families, making it broadly useful. Finally, it provides a practical recipe that teams can adopt today without huge changes to their training pipelines.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Top Bread (Hook) Imagine you’re playing a drawing game where a judge gives you points. But the judge is a bit quirky—sometimes they love neon colors so much that you can get a high score for a messy, too-bright picture. You could crank up neon and win the game, even if your friend thinks the picture looks worse.

🥬 Filling (The Actual Concept)

What it is: Fine-tuning diffusion models with reinforcement learning (RL) tries to make AI-generated pictures match what people want by rewarding good images.
How it works: The model generates images from text; a reward function (the “judge”) scores them; the model learns to make higher-scoring images next time.
Why it matters: If the judge is imperfect—a proxy reward—the model can learn to “game” the judge instead of making truly better pictures, a problem called reward hacking.

🍞 Bottom Bread (Anchor) If the proxy is OCR accuracy, the model might plaster text everywhere to score high, even if the results look ugly—great for points, bad for people.

🍞 Top Bread (Hook) You know how when class rules are unclear, clever kids might find loopholes to get extra points without really learning? That’s like reward hacking.

🥬 The Concept: Reward Hacking

What it is: When a model gets higher proxy scores by exploiting flaws in the reward, while real image quality or alignment gets worse.
How it works: The reward is trained on limited data or cares about only a few attributes; the model learns shortcuts (like extreme saturation or noisy textures) that fool the scorer.
Why it matters: Scores go up, but users get worse images and less variety (mode collapse).

🍞 Anchor Optimizing for OCR might create blurry, artifact-heavy images that still spell the word correctly. Points up, quality down.

🍞 Top Bread (Hook) Imagine a teacher tells you, “Try new ideas, but don’t stray too far.” That’s useful at first—but if you keep getting better, that warning might start holding you back.

🥬 The Concept: KL Regularization

What it is: A safety belt that keeps the new model close to a reference model so it doesn’t change too wildly.
How it works: During learning, add a penalty if the new policy gets too different from the reference policy.
Why it matters: It prevents wild swings and hacking—but if applied everywhere all the time, it slows learning and blocks discovering new, better ideas.

🍞 Anchor If the reference model can’t write fancy cursive, strong KL can stop the new model from learning it—even if cursive would score high with people.

🍞 Top Bread (Hook) You know how you only double-check homework problems that feel suspicious? You don’t re-check every single one.

🥬 The Concept: Reward Uncertainty

What it is: A signal of how trustworthy the reward score is for a sample.
How it works: Use extra, lightweight reward models to see if they agree. Big disagreement means the main score may be unreliable for that image.
Why it matters: If you only apply penalties to uncertain cases, you avoid slowing down learning on all the confident, normal cases.

🍞 Anchor If one judge says the poster is amazing but two other judges strongly disagree, that poster is “uncertain”—so it gets special care (extra regularization).

🍞 Top Bread (Hook) Think of a class where everyone starts turning in nearly the same essay. That’s safe, but boring—and you miss great, different ideas.

🥬 The Concept: Diversity (and Mode Collapse)

What it is: Diversity means producing many different, valid images; mode collapse is when the model sticks to a narrow set of outputs.
How it works: RL can be mode-seeking—chasing one high-reward trick and ignoring other good options.
Why it matters: Low diversity hurts creativity, exploration, and the ability to cover all the ways a prompt can be fulfilled.

🍞 Anchor For “a red dog,” a healthy model gives many red dogs in different styles; a collapsed one keeps making the same red cartoon dog.

🍞 Top Bread (Hook) Imagine you practice not just doing good work, but doing good work that’s also different from your classmates’—that keeps class lively and helps you learn more.

🥬 The Concept: Diversity-Aware Advantage Shaping

What it is: A way to boost the learning signal (advantage) for images that are both high-quality and different from their neighbors.
How it works: Turn images into features (using a vision model), measure how unique each one is, and multiply the positive advantage by that diversity score.
Why it matters: It rewards creative, high-quality images without encouraging weird, low-quality outliers.

🍞 Anchor In a batch of 24 pictures for “a lighthouse by the shore,” the sharp, beautiful picture that looks different from others gets extra learning credit.

02Core Idea

🍞 Top Bread (Hook) You know how coaches don’t blow the whistle on every play—only the risky ones? That keeps the game flowing while stopping fouls.

🥬 The Concept: GARDO’s Aha!

What it is: GARDO is a framework that applies regularization only when needed (gated), keeps the reference fresh (adaptive), and boosts variety safely (diversity-aware).
How it works: 1) Find uncertain samples and apply KL only to them; 2) Periodically update the reference model to today’s skills; 3) Multiply the learning signal for samples that are both good and different.
Why it matters: You avoid reward hacking without slowing learning or crushing exploration.

🍞 Anchor Instead of scolding every student for every step, the teacher only steps in on shaky answers, updates the rubric as the class improves, and gives a gold star to excellent, unique solutions.

Multiple Analogies

Traffic analogy: Most cars (samples) flow freely; only cars swerving (uncertain) get guided by cones (KL). The map (reference) gets updated as the city changes, and we encourage drivers who find safe, new routes (diversity bonus on positive advantage).
Cooking analogy: Don’t salt every dish the same (universal KL). Taste the ones that seem off (uncertain) and fix them. Update the recipe book after each round (adaptive reference), and give extra points to tasty dishes that aren’t all the same flavor (diversity-aware shaping).
Classroom analogy: Grade normally for solid work; only re-check suspicious answers. Refresh the answer key as the class learns. Praise top-notch, original essays a bit more.

Before vs After

Before: Universal KL slowed learning and trapped the model near a suboptimal reference; KL-free training explored fast but hacked rewards and collapsed.
After: GARDO keeps fast learning on safe samples, stops hacking on risky ones, updates the anchor to allow progress, and preserves variety.

🍞 Top Bread (Hook) Think about a treasure map that sometimes lies—cross-checking with two other maps helps you know when to be careful.

🥬 The Concept: Uncertainty-Gated KL

What it is: Apply KL only to the top small fraction (around 10%) of images whose proxy score looks suspicious.
How it works: Compare the main reward’s “win-rate” to auxiliary reward models (lightweight). Big gaps mean high uncertainty. Gate KL to those cases.
Why it matters: Most images get to learn fast; only risky ones are kept near the reference to avoid hacks.

🍞 Anchor If one judge loves an image but two others don’t, that image gets a leash (KL) this round; others run freely.

🍞 Top Bread (Hook) If you always compare yourself to your old self from months ago, you’ll feel stuck—even if you’ve outgrown those limits.

🥬 The Concept: Adaptive Reference

What it is: Periodically reset the reference model to the current policy.
How it works: If divergence grows too big or after a fixed number of steps, copy the new skills into the reference.
Why it matters: Regularization stays relevant, so it guides, not blocks, exploration.

🍞 Anchor A runner updates their personal best time after a few races; future training compares to the new best, not an outdated record.

🍞 Top Bread (Hook) Imagine a science fair where only high-quality but different projects get extra credit—that keeps creativity alive without rewarding messy work.

🥬 The Concept: Diversity-Aware Advantage Shaping

What it is: Multiply the positive advantage by a diversity score based on feature-space distance to neighbors.
How it works: Extract features (e.g., with DINOv3), compute how unique each image is in its group, and scale up only positive advantages.
Why it matters: Encourages exploring new, valid modes and prevents mode collapse, without promoting low-quality oddities.

🍞 Anchor Among 24 images for “a blue bird and a brown bear,” the crisp, unique one gets boosted learning; blurry or off-topic ones don’t.

Why It Works (Intuition)

Only the suspicious cases get reined in, so learning stays fast elsewhere.
The anchor moves forward, so regularization doesn’t freeze progress.
Diversity scaling is multiplicative and only for positive advantage, so it never flips bad samples into good ones.
A small practical tweak—removing standard deviation in advantage normalization—stops tiny reward noise from blowing up into big, misleading updates.

Building Blocks

Uncertainty from ensemble disagreement via batch win-rates
Gated KL: penalize only the high-uncertainty tail
Adaptive reference resets by divergence threshold or step budget
Diversity-aware advantage shaping with feature-space nearest-neighbor distance
Stable advantage normalization without fragile standard deviations

03Methodology

High-Level Pipeline Input (prompts) → Generate groups of images → Score with proxy reward → Compute advantages → Estimate uncertainty with auxiliary rewards → Apply gated KL penalty to uncertain samples → Update policy → Periodically update reference → Apply diversity-aware advantage shaping → Output: fine-tuned model.

Step-by-Step (What, Why, Example)

Grouped Generation

What: For each prompt, generate a group of images with the current policy using the same initial noise.
Why: Grouping lets us fairly compare images for the same instruction and compute stable advantages and diversity.
Example: For “A storefront with ‘GARDO’ written on it,” make 24 images in one group.

Proxy Reward and Advantages

What: Use a proxy reward (e.g., OCR accuracy or GenEval components) to score images and compute per-sample advantages within the group.
Why: Advantages measure how much better or worse a sample is than its group peers, guiding learning direction.
Example: In an OCR task, an image spelling “GARDO” correctly gets a higher advantage than one with typos.

Practical Stabilizer: Remove Std in Advantage Normalization

What: Omit dividing by the tiny within-group standard deviation when rewards are nearly identical.
Why: Prevents tiny, noisy differences from exploding into huge, misleading updates that fuel reward hacking.
Example: If two very similar images get nearly the same score, their update sizes stay modest.

Uncertainty Estimation via Auxiliary Rewards

What: Compute a simple “win-rate” for the main proxy reward and compare it to win-rates from two lightweight auxiliary rewards (e.g., Aesthetic and ImageReward). The bigger the gap, the higher the uncertainty.
Why: Disagreement signals out-of-distribution or suspicious cases where the proxy may be unreliable.
Example: If OCR says “amazing!” but Aesthetic and ImageReward say “meh,” uncertainty is high.

Gated KL Regularization

What: Apply KL penalty only to the top-k% most uncertain samples in each batch (k starts around 10% and can adjust using a small windowed heuristic).
Why: Prevents over-penalizing safe samples, preserving sample efficiency and exploration.
Example: In a 24-image group, maybe 2–3 uncertain images get a KL leash this round; the rest learn freely.

Adaptive Reference Updates

What: Periodically refresh the reference model to the current policy when KL divergence gets large or after a max number of steps.
Why: Keeps the regularization anchor relevant so it guides rather than handcuffs the learner.
Example: After 100 steps or if divergence spikes, copy the current policy as the new reference.

Diversity-Aware Advantage Shaping

What: Extract semantic features (e.g., with DINOv3), compute each sample’s nearest-neighbor distance in feature space as a diversity score, and multiply only positive advantages by this score.
Why: Rewards high-quality novelty, broadens exploration, and counters mode collapse without promoting bad images.
Example: In “A lighthouse by the shore,” a sharp, distinct style gets extra boost; a dull lookalike doesn’t.

Policy Update

What: Optimize the RL objective with clipped ratios (as in GRPO/PPO-style methods), plus the gated KL term on selected samples and the shaped advantages.
Why: Ensures stable updates, avoids reward hacks in suspicious cases, and encourages variety among strong samples.
Example: The optimizer nudges parameters so good-and-different images become more likely next time.

The Secret Sauce

Gating: Only the risky tail gets KL penalty—most samples keep full learning speed.
Adaptivity: The reference isn’t frozen; it marches with the model so regularization stays meaningful.
Diversity: A multiplicative, positive-only boost keeps creativity alive without incentivizing garbage.
Stability: Dropping std normalization prevents small noise from becoming big trouble.

Concrete Walkthrough (OCR as Proxy)

Prompt: “A storefront with ‘GARDO’ written on it.”
Generate 24 images → Score OCR → Compute advantages (without fragile std division).
Auxiliary rewards disagree for some noisy, over-sharpened images → those become high-uncertainty.
Apply KL only to those few; others get no KL.
If policy drift grows, refresh the reference.
Compute diversity in feature space → multiply positive advantages by diversity.
Update policy → Next round, images keep getting clearer, readable, and also more varied—without neon-noise hacks.

Concrete Walkthrough (GenEval as Proxy)

Prompts test object counts, colors, and relations.
Some samples try to “cheat” the metric (e.g., overfitting patterns). Auxiliary rewards flag them as uncertain.
KL reins them in; the rest learn freely.
Diversity shaping helps discover multiple valid ways to show “a red dog” or “three suitcases” without collapsing to one style.
Over time, both proxy and unseen metrics improve.

04Experiments & Results

The Tests (What and Why)

Proxy tasks: OCR (text rendering accuracy) and GenEval (compositional alignment like counting, colors, and relations). These are the scores the model trains on.
Unseen metrics: Aesthetic, PickScore, ImageReward, ClipScore, HPSv3, and a feature-based Diversity score. These check if the model truly improves rather than just gaming the proxy.

The Competition (Baselines)

KL-free RL (fast but hack-prone).
RL with universal KL (safer but slow, sometimes blocks exploration).
GARDO variants: with/without diversity shaping, and with the simple no-std normalization tweak.
Other RL algorithms and models: Flow-GRPO and DiffusionNFT; SD3.5-M and Flux.1-dev.

The Scoreboard (With Context)

Efficiency vs Safety: KL-free baselines get high proxy scores quickly but crash on unseen metrics and diversity—like acing a practice quiz by memorizing tricks but failing the real exam.
Universal KL avoids the worst hacks but slows training a lot—like wearing too-tight training wheels forever.
GARDO matches the speed of KL-free training on proxies while keeping or improving unseen metrics—a rare win-win.
Diversity jumps notably (e.g., on GenEval from about 20 to 25), which signals broader exploration and healthier coverage of different valid outputs.
In DiffusionNFT, GARDO hits about 0.95 on GenEval in 400 steps, outperforming KL-regularized baselines at the same budget while improving unseen scores and diversity.
Counting emergence: Trained on 1–9 objects, GARDO generalizes better to 10–11 objects than baselines, showing it’s not just memorizing—it’s learning to count.

Surprising/Notable Findings

Only about the top 10% most uncertain samples need KL to stop reward hacking—tiny gate, big effect.
Removing std from advantage normalization reduces over-amplifying tiny reward differences, stabilizing training without special preference models.
A multi-reward baseline (e.g., mixing OCR with Aesthetic and ImageReward) had worse sample efficiency on the primary proxy than GARDO, echoing classic multi-objective RL difficulties.
GARDO sometimes even surpasses the original reference model on unseen metrics, indicating true quality gains—not just proxy gains.

Takeaway Translation

GARDO is like a smart coach: let most plays run fast, blow the whistle only on risky ones, update the playbook as the team improves, and reward high-quality creativity. The result: better scores without cheating the system.

05Discussion & Limitations

Limitations

Dependence on auxiliary rewards: Estimating uncertainty needs extra, lightweight reward models. This is cheap for images but could be heavier for video or very large models.
Thresholds and windows: While simple, the gate percentage and reset triggers are heuristics that may need minor tuning across setups.
Feature extractor choice: Diversity depends on a feature model (e.g., DINOv3); different backbones may shift the diversity signal.

Required Resources

A base diffusion/flow model (e.g., SD3.5-M, Flux.1-dev), an RL loop (e.g., GRPO or DiffusionNFT), and 1–2 auxiliary reward models (Aesthetic, ImageReward) for uncertainty.
Compute similar to standard RL finetuning; the uncertainty checks are light compared to generation itself.

When NOT to Use

If your reward is verifiable and robust (e.g., exact math solutions), universal KL or even KL-free might suffice.
If you can’t afford any auxiliary scoring—even lightweight—uncertainty gating becomes tricky.
If your task requires strict imitation of the reference model (no exploration), adaptive updates may be unnecessary.

Open Questions

Video generation: Can the same uncertainty gating scale to long, temporally coherent outputs?
Reward model ensembles: What’s the best small set for strong uncertainty signals across domains?
Diversity shaping: Are there context-aware diversity measures (e.g., object-aware) that help even more?
Dynamic gating: Could the gate learn its own policy for when and how much to regularize?
Human-in-the-loop: How can occasional human checks calibrate or validate uncertainty estimates?

06Conclusion & Future Work

Three-Sentence Summary GARDO fine-tunes diffusion models with a smart trio: gate KL only for uncertain samples, adapt the reference model as learning progresses, and multiply positive advantages by diversity to keep exploration healthy. This combination stops reward hacking, preserves sample efficiency, and boosts diversity and unseen metric performance. It works across algorithms and base models, delivering reliable gains without gaming the reward.

Main Achievement Demonstrating that selective, uncertainty-driven regularization plus adaptive anchoring and diversity-aware shaping can simultaneously solve three classic tensions—safety vs speed, stability vs exploration, and reward vs diversity—in RL finetuning for image generation.

Future Directions Extend to video and multimodal tasks, refine uncertainty estimation with minimal auxiliary cost, explore learned gating schedules, and investigate richer, prompt-aware diversity signals. Human calibration and principled analyses of uncertainty sources could further strengthen robustness.

Why Remember This GARDO changes the default from “penalize everyone just in case” to “penalize only when it’s risky,” while keeping the safety net up-to-date and celebrating high-quality creativity. It’s a simple idea with big effects: faster learning, fewer hacks, and more interesting, faithful images.

Practical Applications

•Fine-tuning T2I models to render accurate text in images (e.g., posters, packaging) without degrading overall image quality.
•Improving compositional understanding (counts, colors, relations) for product catalogs and educational illustrations.
•Boosting creative variety in content generation while keeping outputs on-spec for marketing and design.
•Safer alignment with noisy preference models in domains where human ratings are limited or inconsistent.
•Reducing mode collapse in style libraries so users get diverse, high-quality options for the same prompt.
•Adapting to evolving brand guidelines by updating the reference anchor during long-running training.
•Stabilizing training when reward differences are tiny by removing fragile std normalization in advantages.
•Enhancing exploration in new data domains (e.g., scientific diagrams) without overfitting proxy metrics.
•Applying uncertainty-aware regularization to other generative tasks that rely on proxy rewards (e.g., audio, 3D).
•Improving counting and spatial reasoning for visual assistants that must follow precise textual instructions.

Version: 1