RealGen: Photorealistic Text-to-Image Generation via Detector-Guided Rewards

Junyan Ye; Leiqi Zhu; Yuncheng Guo; Dongzhi Jiang; Zilong Huang; Yifan Zhang; Zhiyuan Yan; Haohuan Fu; Conghui He; Weijia Li

RealGen: Photorealistic Text-to-Image Generation via Detector-Guided Rewards

Intermediate

Junyan Ye, Leiqi Zhu, Yuncheng Guo et al.11/29/2025

arXiv PDF

Key Summary

•RealGen is a new way to make computer-made pictures look so real that they can fool expert detectors and even careful judges.
•It uses two helpers: a language model to rewrite short, vague prompts into rich, detailed ones, and a diffusion model to draw the picture.
•A special reward, called a Detector Reward, comes from smart detectors that spot common AI artifacts like plastic-looking skin and weird noise patterns.
•RealGen learns, using reinforcement learning (GRPO), to create images that “escape” both visible-artifact detectors and deep feature-level detectors.
•A new test set, RealBench, scores realism automatically with two methods: Detector-Scoring and Arena-Scoring (pairwise battles judged by an MLLM).
•Across many tests, RealGen beats strong general models (like GPT-Image-1, Qwen-Image) and realism-focused models (like FLUX-Krea).
•The prompt-rewriting LLM already helps a lot, and training the diffusion model with detector rewards boosts photorealism even more.
•Unlike human preference rewards (which can bias toward artsy styles), detector rewards directly reduce AI-looking artifacts.
•On held-out judges (not used during training), RealGen still wins big, showing it learned real realism, not just how to game one detector.
•This matters because people want AI images that look truly natural for uses like product photos, portraits, and design previews.

Why This Research Matters

RealGen pushes AI images closer to true photographs, which helps people trust what they see in products, fashion, homes, and education. Photorealistic images save time and money for mockups and previews without hiring photographers for every shot. By using detectors as teachers, the approach scales better than collecting endless human ratings and avoids bias toward cartoonish or stylized looks. RealBench offers a fair, automatic way to compare realism so progress can be measured quickly and reproducibly. Strong results on held-out judges mean the method generalizes and isn’t just gaming one specific detector. This improves the quality bar for creative tools, advertising, and design while opening the door to more lifelike video and 3D in the future.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Top Bread (Hook): Imagine you’re drawing a portrait with colored pencils. If you smooth everything too much, the face can look like a plastic doll—shiny forehead, blurry pores, perfect-but-fake skin. People can tell it isn’t real.

🥬 Filling (The Actual Concept): What the world looked like before RealGen

What it is: Text-to-image (T2I) models turn words into pictures, and modern ones are very good at matching the words but often still look a bit fake.
How it works (before this paper): Big generative models like diffusion models read your prompt (like “a girl smiling in sunlight”) and then gradually ‘paint’ an image. They can place objects right and use the right colors, but small details like human skin texture, lighting grit, and tiny imperfections often look off.
Why it matters: Even when the picture matches the prompt, you still see the “AI look”—oily skin, plastic shine, over-smooth textures—so it doesn’t feel like a real photo.

🍞 Bottom Bread (Anchor): Think of a cookie that looks perfect but tastes odd—pretty, but not convincing. That’s how many T2I images felt: correct, but not truly real.

🍞 Top Bread (Hook): You know how teachers grade your essay for content and also for spelling and grammar? Good ideas aren’t enough—you also need the “tiny details” right.

🥬 The Concept: Photorealism and AI Artifacts

What it is: Photorealism means images look indistinguishable from real photos, down to pores, shadows, noise, and small imperfections. AI artifacts are the telltale signs that give away fakery.
How it works: People can see visible artifacts (like waxy skin), while machines can also catch hidden, feature-level patterns (weird frequency noise) you don’t notice at first glance.
Why it matters: Without fixing both visible and hidden artifacts, images still feel fake, even if the content is correct.

🍞 Bottom Bread (Anchor): If a detective can always find the hidden clue that says “this is fake,” then it’s not photorealistic yet.

🍞 Top Bread (Hook): Imagine trying to bake a perfect cake by guessing what customers like—some prefer purple frosting, others love red sprinkles. You might please tastes but still bake a cake with the wrong texture.

🥬 The Concept: Human Preference Rewards (why they fell short)

What it is: Earlier methods trained models using human preference scores that reward what people “like,” not necessarily what looks real.
How it works: People pick the prettier image; models learn to chase these choices, which can bias toward bright, stylized, or cartoony looks.
Why it matters: A plain, real-life snapshot can be highly realistic yet score lower for aesthetics, so the model learns style over realism.

🍞 Bottom Bread (Anchor): It’s like winning a beauty pageant but failing a lie detector. Pretty isn’t always real.

🍞 Top Bread (Hook): Think of a spelling checker for images that says, “This looks fake because of shiny plastic skin.” What if we used that checker as a coach?

🥬 The Concept: Detector-Guided Rewards (what was missing)

What it is: Use strong fake-image detectors as judges to give a realism score. Higher scores mean fewer AI artifacts.
How it works: Two kinds of detectors help: a semantic detector (spots visible issues like greasy skin) and a feature detector (spots deep, hidden artifacts in pixel patterns). Their scores guide learning.
Why it matters: This gives an objective, scalable, human-free way to aim directly at realism, not just style.

🍞 Bottom Bread (Anchor): If the goalie (detector) keeps blocking your shots, learn to shoot where the goalie can’t reach—soon you score like a pro, and your images stop looking fake.

🍞 Top Bread (Hook): You know how a good recipe starts with the right shopping list? Better prompts are like better lists for image models.

🥬 The Concept: Prompt Optimization with an LLM

What it is: An LLM rewrites short, vague prompts into richer, more specific ones that help the image model add real-world details.
How it works: The LLM adds natural imperfections (like “faint pores,” “subtle uneven lighting,” “slight motion blur”) and useful context (like “shot on iPhone,” “overcast afternoon”).
Why it matters: Simple prompts give too little info, so models default to smooth, generic, fake-looking outputs.

🍞 Bottom Bread (Anchor): Saying “draw a dog” is vague; saying “a golden retriever with slightly messy fur in soft evening light” gets the texture and light right.

Putting it all together, RealGen mixes better prompts, a strong diffusion painter, and detector rewards that directly measure real-vs-fake signals. It also brings RealBench, a new way to score realism without needing human votes every time, so we can test progress quickly and fairly. The real stakes: ads, product photos, portraits, design mockups, and education all benefit when AI-made images look truly natural and trustworthy.

02Core Idea

🍞 Top Bread (Hook): Imagine a soccer team that improves by training against the best defenders. If you can score on them, you’ll score on anyone.

🥬 The Concept: The “Aha!”

What it is (one sentence): RealGen trains a text-to-image system to make photos so realistic that even strong fake-image detectors have trouble catching them, using the detectors’ feedback as a reward.
How it works (like a recipe):
1. An LLM rewrites your prompt to add realistic details.
2. A diffusion model generates images from this richer prompt.
3. Two detectors judge the image: a semantic one (visible issues) and a feature one (hidden patterns).
4. Their scores become a Detector Reward.
5. Using GRPO (a kind of reinforcement learning), the LLM and the diffusion model learn to reduce artifacts and boost realism.
Why it matters: Without direct realism coaching, models drift toward pretty-but-fake looks; detector rewards push them toward truly photographic results.

🍞 Bottom Bread (Anchor): It’s like practicing piano with a metronome and a tuner—if you can pass both, your performance sounds genuinely professional.

Multiple Analogies (three ways to see it)

Security Check Analogy 🍞 Hook: You know how airports use scanners to catch hidden items? 🥬 Idea: Detectors are scanners for images. If your picture passes both the visible-artifacts scanner and the hidden-patterns scanner, it’s probably real-looking. 🍞 Anchor: A bag that passes all security checks is safe; an image that passes detectors feels real.
Cooking Analogy 🍞 Hook: Chefs taste food and adjust salt, sweetness, and texture. 🥬 Idea: RealGen tastes images with detectors. If it’s too glossy (fake), reduce the gloss; if the texture is off, adjust noise and detail. 🍞 Anchor: Like correcting a too-sugary sauce, the model fixes oily skin and odd noise until it tastes (looks) right.
Exam + Study Guide Analogy 🍞 Hook: You use a study guide to prepare for a tough exam. 🥬 Idea: The detectors are the exam. The LLM is your study guide that rewrites your notes (prompts) to cover realistic details you might miss. 🍞 Anchor: If you ace the detector exam, your image likely appears truly photographic.

Before vs After 🍞 Hook: Think of a camera with a smudged lens (before) vs a clean lens (after). 🥬 Idea: Before—models matched prompts but looked plasticky. After—models keep the content but gain real textures, true lighting, and believable imperfections. 🍞 Anchor: The same portrait gains skin pores, subtle shadows, and natural reflections—suddenly it “clicks” as real.

Why It Works (intuition, not equations) 🍞 Hook: If you train with the toughest coach, you get stronger fast. 🥬 Idea: Detectors encode patterns that separate real from fake. Optimizing to beat both visible and hidden tests teaches the generator to mimic nature’s messiness—fine-grained textures, realistic noise, balanced lighting, and small flaws. 🍞 Anchor: Like learning accents from native speakers, the model picks up the tiny cues that make a photo feel authentic.

Building Blocks (small pieces)

🍞 Hook: You know how you add spice last to get the taste just right? 🥬 LLM Prompt Rewriting: The LLM adds exact details (camera hints, lighting, imperfections) so the painter (diffusion) knows what to paint. 🍞 Anchor: “A face” becomes “a face with faint pores, soft window light, slightly uneven skin.”
🍞 Hook: Sketch first, then paint in layers. 🥬 Diffusion Generation: Start from noise, then gradually de-noise into a photo, using the prompt as guidance. 🍞 Anchor: Like sharpening a blurry photo step by step until it’s crisp.
🍞 Hook: A referee calls both obvious fouls and sneaky ones. 🥬 Two Detectors: Semantic (visible issues like oily sheen) and feature-level (hidden pixel-frequency traces). 🍞 Anchor: Passing both refs means you’re playing clean—and look real.
🍞 Hook: Practice, feedback, improve, repeat. 🥬 GRPO RL Optimization: Turn detector feedback into a reward and update the LLM and diffusion model to do better next time. 🍞 Anchor: Each setup makes the next image more photorealistic.

03Methodology

High-level Overview: Input → LLM Rewriter → Diffusion Generator → Detectors Score (semantic + feature + alignment) → GRPO Update (Stage 1: LLM; Stage 2: Diffusion) → Output

🍞 Top Bread (Hook): Imagine building a LEGO city: first plan the map (LLM), then construct buildings (diffusion), and finally have inspectors check safety (detectors). If inspectors complain, you fix and rebuild smarter next time (RL).

🥬 Step-by-step (what happens, why it exists, example)

Step 0: Ingredients and Setup

What: Use Qwen3-4B as the LLM, FLUX.1-dev (with LoRA) as the diffusion model. Do a short supervised warm-up (SFT) so each component knows the basic routine.
Why: Cold-start ensures the LLM can “think-plan-then-generate” prompts and the diffusion model follows prompts reasonably well.
Example: Before RL, the LLM already adds small details; the diffusion model already draws decent photos but still with AI shine.

Step 1: LLM Prompt Rewriting (Stage 1 RL) 🍞 Hook: You know how a good grocery list makes cooking easier?

What: The LLM turns a short prompt into a rich, realistic description by adding context (lighting, materials, imperfections, camera hints).
How: For each user text x, the LLM samples several rewrites y1..yN. The diffusion model generates images I from each y. Detectors score each I, producing rewards.
Why: Simple prompts have low information. The LLM adds detail so the diffusion model stops guessing and stops defaulting to plastic-like priors.
Example: “A young girl with light brown hair and green eyes” → “...wearing a pink T-shirt, standing outdoors by a glass window on an overcast afternoon; natural light, faint pores, slight skin sheen, subtle reflections on glass.” 🍞 Anchor: Better shopping list → better dish; better prompt → more realistic image.

Step 2: Diffusion Image Generation 🍞 Hook: Think of un-blurring a foggy window to see the scene behind it.

What: Diffusion starts from noise and iteratively de-noises to form an image that matches the rewritten prompt.
Why: This gradual process is great at fine detail—edges, textures, lighting—if guided well.
Example: Each step sharpens skin texture, balances highlights, and forms natural reflections. 🍞 Anchor: From snowy TV static to a clear photo, step by step.

Step 3: Detector Reward (semantic, feature, alignment) 🍞 Hook: Two inspectors check both what you can see and what you can’t.

What:
- Semantic detector (Forensic-Chat) judges visible artifacts: oily skin, odd blurs, weird hands.
- Feature detector (OmniAID) judges hidden, pixel-level traces: abnormal frequencies and noise.
- Text-image alignment (Long-CLIP) keeps the image faithful to the prompt.
Why: Fixing only visible flaws may still leave hidden fingerprints. Fixing only hidden traces may let visible flaws slip. Alignment prevents drifting off-prompt to game detectors.
Example: A portrait with natural pores (semantic pass) and realistic sensor-like noise (feature pass) still matches “pink T-shirt” and “glass reflections” (alignment pass). 🍞 Anchor: Like passing both an eye test and an X-ray, and still answering the right questions.

Step 4: GRPO Reinforcement Learning for the LLM (Stage 1) 🍞 Hook: Practice with a coach who gives a score after every try.

What: Convert the detector scores into a normalized advantage signal. Update the LLM policy so future rewrites earn higher rewards (i.e., produce more realistic images).
Why: Without RL, the LLM won’t systematically learn which words (like “soft window light,” “faint pores,” “shot on iPhone”) increase realism.
Example: Over time, the LLM learns to include small, truthful imperfections and avoid buzzwords that cause plastic shine. 🍞 Anchor: Each graded draft helps the next essay get better—same for prompts.

Step 5: GRPO Reinforcement Learning for the Diffusion Model (Stage 2) 🍞 Hook: Now the painter, not just the planner, gets coaching.

What: Freeze the LLM (keep good prompts). Train the diffusion model using Flow-GRPO-style exploration: run full denoising to evaluate realism; explore a short span of steps to improve efficiency and stability.
Why: Training only on partial/noisy steps can mislead detectors (they’re not trained on noisy intermediates). Evaluating on full clean samples keeps the reward meaningful. Short-span exploration still lets the model learn.
Example: The model learns to avoid over-smooth cheeks and to add subtle micro-texture without banding or plastic glare. 🍞 Anchor: Don’t grade a cake batter; judge the baked cake, then teach the baker which mixing step to tweak.

Step 6: Score Normalization and Fusion 🍞 Hook: If one judge uses 1–10 and another uses 1–100, you must rescale before averaging.

What: Normalize each reward dimension within the batch (semantic, feature, alignment), then sum to get a single advantage.
Why: Balanced feedback prevents the model from gaming only one detector while ignoring others or the prompt.
Example: A photo that’s aligned but still oily won’t get a high combined score; the model learns to improve both. 🍞 Anchor: Like weighing math and English grades fairly before finalizing the report card.

Secret Sauce (what’s clever) 🍞 Hook: To learn to hide, study the best detectives. 🥬 The twist is using both a visible-artifact detector and a hidden-feature detector as the reward, plus a prompt-fidelity guardrail. Together with two-stage GRPO (first the writer, then the painter), the system converges toward images that feel real to humans and remain robust to unseen judges. 🍞 Anchor: Train against the toughest refs now, so you can play clean in any stadium later.

Concrete Mini-Example (numbers kept simple)

Input prompt: “A cozy living room.”
LLM rewrites: “A cozy living room at dusk, warm lamplight, slight shadow falloff, soft fabric grain on the couch, faint lens noise, wooden table with subtle scratches, shot on a phone camera.”
Diffusion draws image.
Detectors say: semantic=high (no plastic shine), feature=high (natural noise), alignment=high (matches text).
RL updates: Keep using “faint lens noise,” “soft fabric grain,” and “subtle scratches” cues; reduce any over-smoothing the detector spotted.
Next time: Even cozier—and more like a real photo.

04Experiments & Results

🍞 Top Bread (Hook): Imagine a science fair where robots paint pictures. Instead of people voting for the prettiest, we bring expert detectives and a fair judge to pick which pictures look most like real photos.

🥬 The Test: What was measured and why

RealBench created: 1,000 real photos with captions across seven categories, many portraits (a hard test for skin and lighting) to focus on photorealism.
Two evaluation styles:
1. Detector-Scoring: How often advanced detectors call the image “real” (higher is better). Includes semantic (Forensic-Chat and a held-out judge, GPT-5) and feature-level (OmniAID and a held-out expert Effort) detectors.
2. Arena-Scoring: Pairwise battles—two images, one choice. GPT-5 plays the judge and must pick the more realistic image, including battles versus actual real photos.
Why: This cuts human-labor costs, reduces bias toward artsy styles, and gives a stable, scalable realism test.

🍞 Bottom Bread (Anchor): It’s like testing a coin with both a magnifying glass (visible flaws) and a metal scanner (hidden patterns), then having a referee choose which coin looks legit in head-to-head matchups.

The Competition: Who was compared

Strong general models: GPT-Image-1, Qwen-Image, FLUX-Pro, Nano-Banana, SDXL, SD-3.5-Large, etc.
Realism-focused baselines: FLUX-Krea, SRPO.
Our method: RealGen with and without the LLM rewrite component (Ours and Ours*).

Scoreboard Highlights (with context)

Detector-Scoring: Ours* achieved standout results across detectors, including held-out ones (e.g., higher “real” probabilities with GPT-5 and Effort), indicating generalization beyond the training rewards. Think of getting an A+ from your teacher and also from a substitute you’ve never met.
Arena-Scoring vs Real Photos: RealGen reached around a 50% win rate, meaning that about half of the time, the judge considered RealGen’s output as realistic as an actual photo—this is like tying the champion in a fair match.
Arena-Scoring vs Other Models: RealGen scored the highest overall win rate in head-to-head battles, consistently chosen as more realistic across open-source peers.
HPDv2 “Photo” subset: RealGen not only led detector realism but also stayed competitive on aesthetic preference scores (HPSv3), showing it can be realistic and pleasing.

Surprising/Notable Findings

LLM prompt optimization alone gives a big jump: Better prompts reduce fake-looking defaults.
Detector rewards beat human-preference rewards for realism: PickScore/HPSv2.1 tended to push toward stylized looks and still left oily skin; detector rewards targeted the actual artifacts.
Held-out success: Excelling on unseen judges (GPT-5 and Effort) suggests RealGen learned broad realism, not just how to cheat a particular detector.

🍞 Bottom Bread (Anchor): If you can pass both your school’s exam and a surprise pop quiz from a guest teacher, you really know the material—not just the answer key.

05Discussion & Limitations

🍞 Top Bread (Hook): Even the best race cars have limits—they’re fast on smooth tracks but may struggle on dirt roads.

🥬 Limitations and Honest Assessment

Limitations:
1. Domain gaps: Some rare scenes or extreme lighting may still show artifacts the detectors didn’t catch during training.
2. Prompt dependence: If the input is extremely vague or conflicting, results can drift or look plain; the LLM helps, but can’t invent perfect context for everything.
3. Detector shift risk: If future detectors evolve, the reward target moves; ongoing updates may be needed to stay state-of-the-art.
4. Compute cost: Two-stage RL (LLM then diffusion) with detectors is GPU-heavy and needs careful engineering to keep training stable.
Required Resources:
- A capable LLM (e.g., Qwen3-4B), a strong diffusion base model (e.g., FLUX.1-dev with LoRA), multiple GPUs (paper used 8×H200), detector models (Forensic-Chat, OmniAID, plus held-out ones for eval), and curated prompts/data.
When NOT to Use:
1. If you want stylized or cartoon art intentionally—preference rewards might fit better than detector rewards here.
2. If you can’t afford RL compute; standard fine-tuned diffusion may suffice for basic tasks.
3. If your evaluation must be purely human-aesthetic, not photorealism-targeted.
Open Questions:
1. Can we add video realism with temporal detectors for flicker-free, real-looking clips?
2. How to balance aesthetics and realism adaptively per user intent?
3. Can we design detector rewards robust to novel camera sensors or post-processing pipelines?
4. How to ensure fairness and reduce bias in both detectors and generated portraits across diverse populations?

🍞 Bottom Bread (Anchor): Like tuning a musical instrument, you’ll need occasional retuning (updating detectors and RL) to keep RealGen playing in perfect pitch as the stage changes.

06Conclusion & Future Work

🍞 Top Bread (Hook): Think of learning to paint by practicing against expert art forgers and the best art detectives; if they can’t tell it’s fake, you’ve mastered realism.

🥬 Three-Sentence Summary

RealGen trains a prompt-rewriting LLM and a diffusion image generator using Detector Rewards from both visible-artifact and feature-level fake-image detectors.
With GRPO reinforcement learning, it steadily reduces telltale AI artifacts while preserving text alignment, producing images that look truly photographic.
RealBench evaluates realism automatically using Detector-Scoring and Arena-Scoring, and RealGen tops strong baselines on held-out judges, showing broad generalization.

Main Achievement

A practical, scalable “detection-for-generation” pipeline: use detectors not just to catch fakes but to teach the generator how to stop looking fake.

Future Directions

Extend to video and 3D, add richer detectors (e.g., lighting/physics realism), adaptively trade off between aesthetics and realism, and continue improving fairness and robustness.

Why Remember This

RealGen flips the script: the very tools meant to spot AI images become the coaches that teach models to look real. That simple twist—turning detection into a learning reward—pushes text-to-image closer to the original dream: pictures that are indistinguishable from reality.

Practical Applications

•Create realistic product photos for e-commerce (textures, reflections, and lighting that match real catalogs).
•Generate natural-looking portraits for concept art, casting previews, or photography planning.
•Produce convincing architectural interior/exterior mockups to visualize materials and lighting before construction.
•Make lifelike advertising visuals for quick A/B testing without full photo shoots.
•Design movie and TV previsualization frames with true-to-life skin, props, and environments.
•Auto-rewrite user prompts in creative apps to improve realism without manual prompt engineering.
•Benchmark and select the most realistic text-to-image model using RealBench without human raters.
•Augment training sets for perception systems with near-photographic synthetic data (with proper labeling and ethics).
•Rapidly prototype packaging and branding visuals that look like real photos on shelves.
•Create educational illustrations that accurately mimic real-world textures and lighting (e.g., science models, museum exhibits).

Version: 1