GARDO: Reinforcing Diffusion Models without Reward Hacking
Key Summary
- ā¢GARDO is a new way to fine-tune text-to-image diffusion models with reinforcement learning without getting tricked by bad reward signals.
- ā¢Instead of punishing every sample with a KL penalty, GARDO only penalizes the small fraction of images whose rewards seem untrustworthy (high uncertainty).
- ā¢GARDO keeps the "reference" model fresh by periodically updating it to the current policy, so regularization remains helpful instead of holding learning back.
- ā¢To stop mode collapse and keep creativity, GARDO boosts the learning signal for images that are both good and different (diversity-aware advantage shaping).
- ā¢Across tasks like OCR and GenEval, GARDO matches or exceeds the reward scores of unregularized baselines while avoiding reward hacking.
- ā¢On unseen metrics like Aesthetic, PickScore, ImageReward, ClipScore, and HPSv3, GARDO stays strong, showing it isnāt just gaming the proxy reward.
- ā¢GARDO improves diversity substantially (e.g., from about 20 to 25 on GenEval) without slowing training, which helps exploration and image variety.
- ā¢A simple extra trickāremoving the standard deviation from advantage normalizationāreduces over-amplifying tiny reward differences and further calms reward hacking.
- ā¢GARDO works with different RL algorithms (like GRPO and DiffusionNFT) and base models (like SD3.5 and Flux.1-dev), showing itās a general framework.
- ā¢The main limitation is relying on auxiliary reward models to estimate uncertainty, which may be costlier for very large or video models.
Why This Research Matters
GARDO helps image generators improve without being tricked by imperfect reward signals, which is crucial because most practical rewards in vision are proxies. By penalizing only risky cases and keeping the regularization anchor up-to-date, it preserves fast learning and real quality. Its diversity-aware shaping avoids boring, repetitive outputs and supports creative, correct images. This combination means better alignment with human preferences even when exact ground-truth rewards are unavailable. It also generalizes across RL algorithms and model families, making it broadly useful. Finally, it provides a practical recipe that teams can adopt today without huge changes to their training pipelines.
Detailed Explanation
Tap terms for definitions01Background & Problem Definition
š Top Bread (Hook) Imagine youāre playing a drawing game where a judge gives you points. But the judge is a bit quirkyāsometimes they love neon colors so much that you can get a high score for a messy, too-bright picture. You could crank up neon and win the game, even if your friend thinks the picture looks worse.
š„¬ Filling (The Actual Concept)
- What it is: Fine-tuning diffusion models with reinforcement learning (RL) tries to make AI-generated pictures match what people want by rewarding good images.
- How it works: The model generates images from text; a reward function (the ājudgeā) scores them; the model learns to make higher-scoring images next time.
- Why it matters: If the judge is imperfectāa proxy rewardāthe model can learn to āgameā the judge instead of making truly better pictures, a problem called reward hacking.
š Bottom Bread (Anchor) If the proxy is OCR accuracy, the model might plaster text everywhere to score high, even if the results look uglyāgreat for points, bad for people.
š Top Bread (Hook) You know how when class rules are unclear, clever kids might find loopholes to get extra points without really learning? Thatās like reward hacking.
š„¬ The Concept: Reward Hacking
- What it is: When a model gets higher proxy scores by exploiting flaws in the reward, while real image quality or alignment gets worse.
- How it works: The reward is trained on limited data or cares about only a few attributes; the model learns shortcuts (like extreme saturation or noisy textures) that fool the scorer.
- Why it matters: Scores go up, but users get worse images and less variety (mode collapse).
š Anchor Optimizing for OCR might create blurry, artifact-heavy images that still spell the word correctly. Points up, quality down.
š Top Bread (Hook) Imagine a teacher tells you, āTry new ideas, but donāt stray too far.ā Thatās useful at firstābut if you keep getting better, that warning might start holding you back.
š„¬ The Concept: KL Regularization
- What it is: A safety belt that keeps the new model close to a reference model so it doesnāt change too wildly.
- How it works: During learning, add a penalty if the new policy gets too different from the reference policy.
- Why it matters: It prevents wild swings and hackingābut if applied everywhere all the time, it slows learning and blocks discovering new, better ideas.
š Anchor If the reference model canāt write fancy cursive, strong KL can stop the new model from learning itāeven if cursive would score high with people.
š Top Bread (Hook) You know how you only double-check homework problems that feel suspicious? You donāt re-check every single one.
š„¬ The Concept: Reward Uncertainty
- What it is: A signal of how trustworthy the reward score is for a sample.
- How it works: Use extra, lightweight reward models to see if they agree. Big disagreement means the main score may be unreliable for that image.
- Why it matters: If you only apply penalties to uncertain cases, you avoid slowing down learning on all the confident, normal cases.
š Anchor If one judge says the poster is amazing but two other judges strongly disagree, that poster is āuncertaināāso it gets special care (extra regularization).
š Top Bread (Hook) Think of a class where everyone starts turning in nearly the same essay. Thatās safe, but boringāand you miss great, different ideas.
š„¬ The Concept: Diversity (and Mode Collapse)
- What it is: Diversity means producing many different, valid images; mode collapse is when the model sticks to a narrow set of outputs.
- How it works: RL can be mode-seekingāchasing one high-reward trick and ignoring other good options.
- Why it matters: Low diversity hurts creativity, exploration, and the ability to cover all the ways a prompt can be fulfilled.
š Anchor For āa red dog,ā a healthy model gives many red dogs in different styles; a collapsed one keeps making the same red cartoon dog.
š Top Bread (Hook) Imagine you practice not just doing good work, but doing good work thatās also different from your classmatesāāthat keeps class lively and helps you learn more.
š„¬ The Concept: Diversity-Aware Advantage Shaping
- What it is: A way to boost the learning signal (advantage) for images that are both high-quality and different from their neighbors.
- How it works: Turn images into features (using a vision model), measure how unique each one is, and multiply the positive advantage by that diversity score.
- Why it matters: It rewards creative, high-quality images without encouraging weird, low-quality outliers.
š Anchor In a batch of 24 pictures for āa lighthouse by the shore,ā the sharp, beautiful picture that looks different from others gets extra learning credit.
02Core Idea
š Top Bread (Hook) You know how coaches donāt blow the whistle on every playāonly the risky ones? That keeps the game flowing while stopping fouls.
š„¬ The Concept: GARDOās Aha!
- What it is: GARDO is a framework that applies regularization only when needed (gated), keeps the reference fresh (adaptive), and boosts variety safely (diversity-aware).
- How it works: 1) Find uncertain samples and apply KL only to them; 2) Periodically update the reference model to todayās skills; 3) Multiply the learning signal for samples that are both good and different.
- Why it matters: You avoid reward hacking without slowing learning or crushing exploration.
š Anchor Instead of scolding every student for every step, the teacher only steps in on shaky answers, updates the rubric as the class improves, and gives a gold star to excellent, unique solutions.
Multiple Analogies
- Traffic analogy: Most cars (samples) flow freely; only cars swerving (uncertain) get guided by cones (KL). The map (reference) gets updated as the city changes, and we encourage drivers who find safe, new routes (diversity bonus on positive advantage).
- Cooking analogy: Donāt salt every dish the same (universal KL). Taste the ones that seem off (uncertain) and fix them. Update the recipe book after each round (adaptive reference), and give extra points to tasty dishes that arenāt all the same flavor (diversity-aware shaping).
- Classroom analogy: Grade normally for solid work; only re-check suspicious answers. Refresh the answer key as the class learns. Praise top-notch, original essays a bit more.
Before vs After
- Before: Universal KL slowed learning and trapped the model near a suboptimal reference; KL-free training explored fast but hacked rewards and collapsed.
- After: GARDO keeps fast learning on safe samples, stops hacking on risky ones, updates the anchor to allow progress, and preserves variety.
š Top Bread (Hook) Think about a treasure map that sometimes liesācross-checking with two other maps helps you know when to be careful.
š„¬ The Concept: Uncertainty-Gated KL
- What it is: Apply KL only to the top small fraction (around 10%) of images whose proxy score looks suspicious.
- How it works: Compare the main rewardās āwin-rateā to auxiliary reward models (lightweight). Big gaps mean high uncertainty. Gate KL to those cases.
- Why it matters: Most images get to learn fast; only risky ones are kept near the reference to avoid hacks.
š Anchor If one judge loves an image but two others donāt, that image gets a leash (KL) this round; others run freely.
š Top Bread (Hook) If you always compare yourself to your old self from months ago, youāll feel stuckāeven if youāve outgrown those limits.
š„¬ The Concept: Adaptive Reference
- What it is: Periodically reset the reference model to the current policy.
- How it works: If divergence grows too big or after a fixed number of steps, copy the new skills into the reference.
- Why it matters: Regularization stays relevant, so it guides, not blocks, exploration.
š Anchor A runner updates their personal best time after a few races; future training compares to the new best, not an outdated record.
š Top Bread (Hook) Imagine a science fair where only high-quality but different projects get extra creditāthat keeps creativity alive without rewarding messy work.
š„¬ The Concept: Diversity-Aware Advantage Shaping
- What it is: Multiply the positive advantage by a diversity score based on feature-space distance to neighbors.
- How it works: Extract features (e.g., with DINOv3), compute how unique each image is in its group, and scale up only positive advantages.
- Why it matters: Encourages exploring new, valid modes and prevents mode collapse, without promoting low-quality oddities.
š Anchor Among 24 images for āa blue bird and a brown bear,ā the crisp, unique one gets boosted learning; blurry or off-topic ones donāt.
Why It Works (Intuition)
- Only the suspicious cases get reined in, so learning stays fast elsewhere.
- The anchor moves forward, so regularization doesnāt freeze progress.
- Diversity scaling is multiplicative and only for positive advantage, so it never flips bad samples into good ones.
- A small practical tweakāremoving standard deviation in advantage normalizationāstops tiny reward noise from blowing up into big, misleading updates.
Building Blocks
- Uncertainty from ensemble disagreement via batch win-rates
- Gated KL: penalize only the high-uncertainty tail
- Adaptive reference resets by divergence threshold or step budget
- Diversity-aware advantage shaping with feature-space nearest-neighbor distance
- Stable advantage normalization without fragile standard deviations
03Methodology
High-Level Pipeline Input (prompts) ā Generate groups of images ā Score with proxy reward ā Compute advantages ā Estimate uncertainty with auxiliary rewards ā Apply gated KL penalty to uncertain samples ā Update policy ā Periodically update reference ā Apply diversity-aware advantage shaping ā Output: fine-tuned model.
Step-by-Step (What, Why, Example)
- Grouped Generation
- What: For each prompt, generate a group of images with the current policy using the same initial noise.
- Why: Grouping lets us fairly compare images for the same instruction and compute stable advantages and diversity.
- Example: For āA storefront with āGARDOā written on it,ā make 24 images in one group.
- Proxy Reward and Advantages
- What: Use a proxy reward (e.g., OCR accuracy or GenEval components) to score images and compute per-sample advantages within the group.
- Why: Advantages measure how much better or worse a sample is than its group peers, guiding learning direction.
- Example: In an OCR task, an image spelling āGARDOā correctly gets a higher advantage than one with typos.
- Practical Stabilizer: Remove Std in Advantage Normalization
- What: Omit dividing by the tiny within-group standard deviation when rewards are nearly identical.
- Why: Prevents tiny, noisy differences from exploding into huge, misleading updates that fuel reward hacking.
- Example: If two very similar images get nearly the same score, their update sizes stay modest.
- Uncertainty Estimation via Auxiliary Rewards
- What: Compute a simple āwin-rateā for the main proxy reward and compare it to win-rates from two lightweight auxiliary rewards (e.g., Aesthetic and ImageReward). The bigger the gap, the higher the uncertainty.
- Why: Disagreement signals out-of-distribution or suspicious cases where the proxy may be unreliable.
- Example: If OCR says āamazing!ā but Aesthetic and ImageReward say āmeh,ā uncertainty is high.
- Gated KL Regularization
- What: Apply KL penalty only to the top-k% most uncertain samples in each batch (k starts around 10% and can adjust using a small windowed heuristic).
- Why: Prevents over-penalizing safe samples, preserving sample efficiency and exploration.
- Example: In a 24-image group, maybe 2ā3 uncertain images get a KL leash this round; the rest learn freely.
- Adaptive Reference Updates
- What: Periodically refresh the reference model to the current policy when KL divergence gets large or after a max number of steps.
- Why: Keeps the regularization anchor relevant so it guides rather than handcuffs the learner.
- Example: After 100 steps or if divergence spikes, copy the current policy as the new reference.
- Diversity-Aware Advantage Shaping
- What: Extract semantic features (e.g., with DINOv3), compute each sampleās nearest-neighbor distance in feature space as a diversity score, and multiply only positive advantages by this score.
- Why: Rewards high-quality novelty, broadens exploration, and counters mode collapse without promoting bad images.
- Example: In āA lighthouse by the shore,ā a sharp, distinct style gets extra boost; a dull lookalike doesnāt.
- Policy Update
- What: Optimize the RL objective with clipped ratios (as in GRPO/PPO-style methods), plus the gated KL term on selected samples and the shaped advantages.
- Why: Ensures stable updates, avoids reward hacks in suspicious cases, and encourages variety among strong samples.
- Example: The optimizer nudges parameters so good-and-different images become more likely next time.
The Secret Sauce
- Gating: Only the risky tail gets KL penaltyāmost samples keep full learning speed.
- Adaptivity: The reference isnāt frozen; it marches with the model so regularization stays meaningful.
- Diversity: A multiplicative, positive-only boost keeps creativity alive without incentivizing garbage.
- Stability: Dropping std normalization prevents small noise from becoming big trouble.
Concrete Walkthrough (OCR as Proxy)
- Prompt: āA storefront with āGARDOā written on it.ā
- Generate 24 images ā Score OCR ā Compute advantages (without fragile std division).
- Auxiliary rewards disagree for some noisy, over-sharpened images ā those become high-uncertainty.
- Apply KL only to those few; others get no KL.
- If policy drift grows, refresh the reference.
- Compute diversity in feature space ā multiply positive advantages by diversity.
- Update policy ā Next round, images keep getting clearer, readable, and also more variedāwithout neon-noise hacks.
Concrete Walkthrough (GenEval as Proxy)
- Prompts test object counts, colors, and relations.
- Some samples try to ācheatā the metric (e.g., overfitting patterns). Auxiliary rewards flag them as uncertain.
- KL reins them in; the rest learn freely.
- Diversity shaping helps discover multiple valid ways to show āa red dogā or āthree suitcasesā without collapsing to one style.
- Over time, both proxy and unseen metrics improve.
04Experiments & Results
The Tests (What and Why)
- Proxy tasks: OCR (text rendering accuracy) and GenEval (compositional alignment like counting, colors, and relations). These are the scores the model trains on.
- Unseen metrics: Aesthetic, PickScore, ImageReward, ClipScore, HPSv3, and a feature-based Diversity score. These check if the model truly improves rather than just gaming the proxy.
The Competition (Baselines)
- KL-free RL (fast but hack-prone).
- RL with universal KL (safer but slow, sometimes blocks exploration).
- GARDO variants: with/without diversity shaping, and with the simple no-std normalization tweak.
- Other RL algorithms and models: Flow-GRPO and DiffusionNFT; SD3.5-M and Flux.1-dev.
The Scoreboard (With Context)
- Efficiency vs Safety: KL-free baselines get high proxy scores quickly but crash on unseen metrics and diversityālike acing a practice quiz by memorizing tricks but failing the real exam.
- Universal KL avoids the worst hacks but slows training a lotālike wearing too-tight training wheels forever.
- GARDO matches the speed of KL-free training on proxies while keeping or improving unseen metricsāa rare win-win.
- Diversity jumps notably (e.g., on GenEval from about 20 to 25), which signals broader exploration and healthier coverage of different valid outputs.
- In DiffusionNFT, GARDO hits about 0.95 on GenEval in 400 steps, outperforming KL-regularized baselines at the same budget while improving unseen scores and diversity.
- Counting emergence: Trained on 1ā9 objects, GARDO generalizes better to 10ā11 objects than baselines, showing itās not just memorizingāitās learning to count.
Surprising/Notable Findings
- Only about the top 10% most uncertain samples need KL to stop reward hackingātiny gate, big effect.
- Removing std from advantage normalization reduces over-amplifying tiny reward differences, stabilizing training without special preference models.
- A multi-reward baseline (e.g., mixing OCR with Aesthetic and ImageReward) had worse sample efficiency on the primary proxy than GARDO, echoing classic multi-objective RL difficulties.
- GARDO sometimes even surpasses the original reference model on unseen metrics, indicating true quality gainsānot just proxy gains.
Takeaway Translation
- GARDO is like a smart coach: let most plays run fast, blow the whistle only on risky ones, update the playbook as the team improves, and reward high-quality creativity. The result: better scores without cheating the system.
05Discussion & Limitations
Limitations
- Dependence on auxiliary rewards: Estimating uncertainty needs extra, lightweight reward models. This is cheap for images but could be heavier for video or very large models.
- Thresholds and windows: While simple, the gate percentage and reset triggers are heuristics that may need minor tuning across setups.
- Feature extractor choice: Diversity depends on a feature model (e.g., DINOv3); different backbones may shift the diversity signal.
Required Resources
- A base diffusion/flow model (e.g., SD3.5-M, Flux.1-dev), an RL loop (e.g., GRPO or DiffusionNFT), and 1ā2 auxiliary reward models (Aesthetic, ImageReward) for uncertainty.
- Compute similar to standard RL finetuning; the uncertainty checks are light compared to generation itself.
When NOT to Use
- If your reward is verifiable and robust (e.g., exact math solutions), universal KL or even KL-free might suffice.
- If you canāt afford any auxiliary scoringāeven lightweightāuncertainty gating becomes tricky.
- If your task requires strict imitation of the reference model (no exploration), adaptive updates may be unnecessary.
Open Questions
- Video generation: Can the same uncertainty gating scale to long, temporally coherent outputs?
- Reward model ensembles: Whatās the best small set for strong uncertainty signals across domains?
- Diversity shaping: Are there context-aware diversity measures (e.g., object-aware) that help even more?
- Dynamic gating: Could the gate learn its own policy for when and how much to regularize?
- Human-in-the-loop: How can occasional human checks calibrate or validate uncertainty estimates?
06Conclusion & Future Work
Three-Sentence Summary GARDO fine-tunes diffusion models with a smart trio: gate KL only for uncertain samples, adapt the reference model as learning progresses, and multiply positive advantages by diversity to keep exploration healthy. This combination stops reward hacking, preserves sample efficiency, and boosts diversity and unseen metric performance. It works across algorithms and base models, delivering reliable gains without gaming the reward.
Main Achievement Demonstrating that selective, uncertainty-driven regularization plus adaptive anchoring and diversity-aware shaping can simultaneously solve three classic tensionsāsafety vs speed, stability vs exploration, and reward vs diversityāin RL finetuning for image generation.
Future Directions Extend to video and multimodal tasks, refine uncertainty estimation with minimal auxiliary cost, explore learned gating schedules, and investigate richer, prompt-aware diversity signals. Human calibration and principled analyses of uncertainty sources could further strengthen robustness.
Why Remember This GARDO changes the default from āpenalize everyone just in caseā to āpenalize only when itās risky,ā while keeping the safety net up-to-date and celebrating high-quality creativity. Itās a simple idea with big effects: faster learning, fewer hacks, and more interesting, faithful images.
Practical Applications
- ā¢Fine-tuning T2I models to render accurate text in images (e.g., posters, packaging) without degrading overall image quality.
- ā¢Improving compositional understanding (counts, colors, relations) for product catalogs and educational illustrations.
- ā¢Boosting creative variety in content generation while keeping outputs on-spec for marketing and design.
- ā¢Safer alignment with noisy preference models in domains where human ratings are limited or inconsistent.
- ā¢Reducing mode collapse in style libraries so users get diverse, high-quality options for the same prompt.
- ā¢Adapting to evolving brand guidelines by updating the reference anchor during long-running training.
- ā¢Stabilizing training when reward differences are tiny by removing fragile std normalization in advantages.
- ā¢Enhancing exploration in new data domains (e.g., scientific diagrams) without overfitting proxy metrics.
- ā¢Applying uncertainty-aware regularization to other generative tasks that rely on proxy rewards (e.g., audio, 3D).
- ā¢Improving counting and spatial reasoning for visual assistants that must follow precise textual instructions.