PaCo-RL: Advancing Reinforcement Learning for Consistent Image Generation with Pairwise Reward Modeling

Bowen Ping; Chengyou Jia; Minnan Luo; Changliang Xia; Xin Shen; Zhuohang Dang; Hangwei Qian

PaCo-RL: Advancing Reinforcement Learning for Consistent Image Generation with Pairwise Reward Modeling

Intermediate

Bowen Ping, Chengyou Jia, Minnan Luo et al.12/2/2025

arXiv PDF

Key Summary

•This paper teaches image models to keep things consistent across multiple pictures—like the same character, art style, and story logic—using reinforcement learning (RL).
•The authors build a special reward model, PaCo-Reward, that judges how consistent two images are by answering 'Yes' or 'No' and explaining why.
•They create a large, diverse training set (PaCo-Dataset) by smartly pairing sub-images from grids, then add human rankings and short reasoning to guide learning.
•Their RL recipe, PaCo-GRPO, speeds up training by using lower-resolution images during learning but keeps high-resolution quality at test time.
•They also invent a 'log-tamed' way to blend multiple rewards so no single reward overwhelms the others, keeping training steady.
•Across two big tasks—Text-to-Image Set generation and Image Editing—the method beats strong baselines on consistency without hurting how well prompts are followed.
•PaCo-Reward matches human preferences better than prior open models, improving correlation by about 8–15% on benchmarks.
•The whole system nearly doubles training efficiency while delivering state-of-the-art consistency.
•This makes it practical to build storyboards, characters, product lines, and instruction sequences that really stick together visually.
•The approach is scalable, interpretable (thanks to the 'Yes/No + reasons' format), and works with existing generation models.

Why This Research Matters

Many real projects need multiple images that clearly belong together—think storyboards, brand kits, or how-to guides. This work makes it practical to train image models that keep identity, style, and logic aligned across a set, without requiring massive labeled datasets. The reward model is fast, interpretable, and closely tracks how people judge visual consistency. The RL method is efficient, so teams can train on realistic budgets and timelines. Together, this opens the door to reliable, scalable multi-image pipelines for education, entertainment, design, and e-commerce.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: You know how comic books keep a hero looking the same from page to page—same face, same costume, and the story makes sense from one panel to the next? If any page suddenly changed the hero’s look or the plot jumped around, it would feel wrong.

🥬 The concept (Visual Consistency): Visual consistency means multiple images stick to the same identity (who it is), style (how it looks), and logic (what happens next makes sense). How it works: (1) Decide which things must stay the same (character’s face, font style, setting). (2) Let other things vary (poses, backgrounds, steps in a process). (3) Check that every new image matches the rules. Why it matters: Without consistency, characters drift, fonts wobble, and step-by-step stories break; the result feels untrustworthy.

🍞 Anchor: A children’s story about a dentist looks right only if the dentist keeps the same face and outfit across scenes, and the daily events connect.

Before this paper: Image models got great at making one pretty picture from text, but struggled to make several pictures that belong together. Supervised training—which needs huge labeled datasets—wasn’t enough because: (1) Almost no large datasets explicitly tell models what ‘consistency’ looks like across many images. (2) Human taste is subtle: we often ‘feel’ when fonts match or a storyboard flows, but that’s hard to write as simple rules. (3) Editing is tricky: change one attribute, but keep everything else—style, identity—unchanged.

🍞 Hook: Imagine teaching a robot chef to cook by giving it a thumbs-up or thumbs-down after each dish. Even without a full recipe book, it can learn from feedback.

🥬 The concept (Reinforcement Learning, RL): RL is a way for models to learn by trying things and receiving rewards. How it works: (1) The model makes images. (2) A judge (reward model) scores them. (3) The model adjusts to get higher scores. Why it matters: Without RL, you’d need tons of labeled pairs; RL learns from feedback signals instead, capturing human-like preferences.

🍞 Anchor: The model draws four café menus; a reward model says which pair looks more consistently ‘chalkboard’. The model tweaks itself to earn more ‘Yes’.

Past attempts and why they struggled: (1) Generic reward models judged aesthetics or prompt following (did the image match the text?) but not whether images matched each other in identity/style/logic. (2) Image similarity tools (like simple feature comparisons) missed the human, multi-faceted sense of ‘these belong together’. (3) RL for images was costly: sampling many large images per step is slow and expensive, and juggling multiple goals (consistency and prompt alignment) can go unstable when one reward dominates.

The gap: We needed (1) a reward model trained specifically to judge pairwise visual consistency the way people do, and (2) an RL method that is efficient and stable even when generating multiple images per prompt and blending multiple rewards.

Real stakes—why you should care: (1) Storytelling: keep characters and scenes steady across pages. (2) Product lines: same brand style across posters, mugs, and T-shirts. (3) Education: clear step-by-step diagrams that grow logically. (4) Editing photos: add abs or change expression without altering identity or style. (5) Enterprise workflows: save time and cost by reducing manual curation to fix inconsistencies.

02Core Idea

🍞 Hook: Imagine a music judge who compares two singers at a time, says ‘Singer A fits the genre better—Yes’, and briefly explains why. Do this many times and you teach what ‘consistent style’ sounds like—no giant rulebook needed.

🥬 The concept (Key Insight): Turn visual consistency into a simple pairwise ‘Yes/No + short reason’ judging game, then use RL to steer the image model toward more ‘Yes’ answers—fast and stably—even for multi-image tasks. How it works: (1) Build a dataset that naturally produces many meaningful image pairs. (2) Train a reward model that decides if two images are consistent, answering ‘Yes/No’ in the same next-token style language models use. (3) Plug this judge into an efficient RL loop that (a) learns on lower resolution to save time, and (b) softly balances multiple rewards so none runs away. Why it matters: Without this, models either ignore consistency or become too slow/unstable to train well.

🍞 Anchor: To make four café menu images match, the reward model keeps saying ‘Yes’ for consistent chalkboard fonts and ‘No’ for oddball ones; the generator gradually locks into a unified style.

Three analogies for the same idea: (1) Judge at a bake-off: taste two cupcakes, pick the one closer to the theme, and explain the choice; over many rounds, bakers learn the style. (2) School rubric: a teacher compares two essays for tone and structure; brief feedback guides future essays. (3) Sports drills: scrimmage two plays, pick the more ‘on-strategy’ one; the team learns what ‘cohesive play’ looks like.

Before vs After: Before, reward models focused on ‘Is this one image pretty and on-prompt?’ After, the system asks ‘Do these images belong together?’ and optimizes for that. Before, RL was heavy and brittle for multi-image tasks; after, training uses smaller images for speed and a log-tamed mixer to keep rewards balanced.

🍞 Hook: You know how stories flow when each scene follows logically from the last?

🥬 The concept (Autoregressive Scoring): The reward model answers ‘Yes’ or ‘No’ as its very next word, then gives a short reason—exactly how language models naturally talk. How it works: (1) Feed two images (and the prompt) into a vision-language model. (2) Ask it to output the first token—‘Yes’ or ‘No’. (3) Optionally, continue with a couple of reason tokens. Why it matters: Without aligning the reward to the model’s natural next-token prediction, we’d bolt on clunky heads or long explanations that slow RL.

🍞 Anchor: When shown two dentist images, the judge outputs ‘Yes’ immediately if the faces, scrubs, and clinic vibe match; then adds a short reason.

Building blocks (the idea in pieces):

PaCo-Dataset: A smart way to create many consistent/inconsistent pairs by slicing image grids and pairing sub-figures, then adding human rankings and short, reasoned annotations.
PaCo-Reward: A pairwise, ‘Yes/No + brief reason’ reward model that maps the probability of ‘Yes’ to a consistency score.
PaCo-GRPO: An RL recipe that (a) trains at lower resolution to save compute, (b) blends multiple rewards using a log-tamed aggregator to avoid domination, and (c) keeps sampling diverse enough to explore better solutions.

🍞 Anchor: For a sketch-to-drawing progression, the judge keeps an eye on step order and style, saying ‘Yes’ only when the sequence grows sensibly while staying in the same art style.

03Methodology

High-level flow: Input (prompt and, if editing, a reference image) → Generate a small batch of candidate images (often a 2×2 grid) → Score pairs for consistency with PaCo-Reward (and prompt alignment with another scorer) → Combine rewards with log-taming → Update the generator with PaCo-GRPO → Output higher-consistency images.

🍞 Hook: Imagine cutting a photo collage into squares and mixing pieces between collages to ask, ‘Do these two squares feel like they belong to the same theme?’

🥬 The concept (PaCo-Dataset via Sub-figure Pairing): It’s a dataset built by slicing grid images and pairing sub-figures to create many consistency comparisons. How it works: (1) Write diverse prompts (e.g., by an assistant model). (2) Generate 2×2 image grids with strong internal coherence. (3) Slice grids into four sub-images. (4) Pair sub-images across grids of the same prompt to make tons of pairs spanning identity, style, and logic. (5) Add human rankings and concise reasoning. Why it matters: Without many varied, realistic pairs, the reward model can’t learn human-like consistency.

🍞 Anchor: Four café-menu panels from different seeds are cross-paired so the dataset learns when chalkboard fonts truly match.

🍞 Hook: Think of a referee who answers first—‘Yes’ or ‘No’—then gives a one-sentence call.

🥬 The concept (PaCo-Reward: Pairwise ‘Yes/No + reason’): A vision-language reward model trained to judge if two images are consistent. How it works: (1) Input two images (and task-aware instructions). (2) The model outputs the first token: ‘Yes’ or ‘No’. (3) It then emits a short chain-of-thought reason. (4) During training, a weighted objective emphasizes the first token (decision) while still learning from reasons. Why it matters: Without fast, aligned decisions, RL becomes slow or misaligned; this design makes the score the probability of ‘Yes’, which fits perfectly with next-token prediction.

🍞 Anchor: Show two portraits meant to be the same person; the model outputs ‘No’ if hair color and facial structure drift, then briefly explains.

🍞 Hook: Training a team is faster on a small field, as long as they play matches on a full field later.

🥬 The concept (Resolution-Decoupled Training): Train with lower-resolution images to save compute, but generate high-res at test time. How it works: (1) During RL, sample smaller images (e.g., 512×512 grids) for quick scoring. (2) Use these rewards to update the model. (3) At inference, run full resolution (e.g., 1024×1024) for quality. Why it matters: Without this, sampling many high-res images per RL step is too slow and costly.

🍞 Anchor: A storyboard model practices on small panels but delivers crisp, final comic pages.

🍞 Hook: When judging a talent show, you don’t want one loud judge to drown out everyone else.

🥬 The concept (Log-Tamed Multi-Reward Aggregation): A safer way to combine multiple rewards (e.g., consistency and prompt alignment). How it works: (1) Measure how ‘wobbly’ each reward is (coefficient of variation). (2) If a reward swings too big, gently compress it with a log transform. (3) Average the adjusted rewards and normalize advantages. Why it matters: Without taming, one reward can dominate, pushing the model to ignore other important goals.

🍞 Anchor: If the consistency reward starts shouting, log-taming turns down its volume so prompt alignment still counts.

🍞 Hook: Imagine practicing plays that add a tiny bit of randomness so a team explores more strategies, but with rules that keep learning stable.

🥬 The concept (PaCo-GRPO for Image Generators): An RL update rule adapted to image models that adds controlled sampling noise for exploration and clips updates for stability. How it works: (1) Generate a small group of image candidates per prompt. (2) Score them with PaCo-Reward and a prompt-alignment scorer. (3) Compute a balanced advantage for each sample. (4) Update the generator with a clipped policy objective (relative to the previous policy), optionally watching a KL to stay near a reference. Why it matters: Without stable, exploratory updates, training can either stagnate or spiral.

🍞 Anchor: The model tries several menu designs per prompt, gets balanced feedback, and nudges weights toward the ones that look coherent and on-prompt.

Concrete example with data: For Text-to-ImageSet, we prompt: “Generate four images of the same dentist in scrubs in various medical scenarios.” The model samples a 2×2 grid at 512×512. PaCo-Reward compares sub-images pairwise, producing ‘Yes’ probabilities and short reasons (e.g., ‘Yes—face shape, hair, and outfit match’ or ‘No—different jawline and scrubs color’). CLIP-T scores prompt alignment. The log-tamed aggregator blends the two. The RL step pushes the generator to make future samples that increase both consistency and prompt faithfulness. After training, we generate at 1024×1024 and see noticeably steadier identity and style.

Secret sauce: (1) Pairwise, autoregressive ‘Yes/No + reason’ scoring aligns rewards with how language-vision models naturally operate—fast and interpretable. (2) Resolution-decoupled training slashes RL cost without hurting final quality. (3) Log-tamed aggregation keeps multi-objective learning balanced, preventing reward domination.

04Experiments & Results

🍞 Hook: Think of a science fair where judges test if your machine does what it promises—consistently and better than last year’s projects.

🥬 The concept (The Test): The authors evaluate two things: (1) Is PaCo-Reward a better judge of visual consistency than existing reward models? (2) Does using PaCo-Reward inside RL actually make generators produce more consistent multi-image sets and better edits, without ruining prompt alignment? How it works: (1) Compare PaCo-Reward with baselines on dedicated benchmarks of human preferences. (2) Plug PaCo-Reward into RL and measure improvements across tasks. Why it matters: Without strong, fair testing, we can’t trust the approach.

🍞 Anchor: If PaCo-Reward agrees with people more often, and RL with PaCo-Reward makes better storyboards, we’ve got a win.

Benchmarks and baselines: They use ConsistencyRank (≈3k ranked instances) and EditReward-Bench (≈3k pairs with labels for Prompt Following and Consistency) to test reward models. For RL tasks, they use T2IS-Bench for Text-to-ImageSet and GEdit-Bench for Image Editing. Baselines include CLIP-I, DreamSim, InternVL3.5-8B, Qwen2.5-VL-7B, the EditScore family, and proprietary models like GPT-4/5 and Gemini 2.5.

Results—Reward modeling (RQ1): PaCo-Reward-7B outperforms prior open models by sizeable margins—around 8–15% better alignment with human preferences depending on the metric. Notably, on ConsistencyRank, plain large multimodal models lag behind simpler similarity tools, showing consistency is a special skill. After fine-tuning on PaCo-Dataset, PaCo-Reward flips the script and leads across Accuracy, Kendall’s tau, Spearman’s rho, and Top1-Bottom1. On EditReward-Bench, PaCo-Reward-7B beats open baselines and approaches closed models (e.g., GPT-5), indicating broad generalization.

🍞 Hook: Imagine practicing soccer on a smaller field but scoring just as many goals on the big field later.

🥬 The concept (Efficiency and Stability—PaCo-GRPO): The study checks if lower-res training keeps final performance and whether log-tamed blending prevents any reward from overpowering. How it works: (1) Train at 512×512 vs 1024×1024 and compare evaluation curves; (2) Track the ratio between consistency and prompt-alignment rewards with and without log-taming. Why it matters: Without efficiency, training is too slow; without stability, results wobble or collapse.

🍞 Anchor: Training time is cut nearly in half at 512×512, yet final scores catch up to 1024×1024.

Results—Text-to-ImageSet (RQ2): With PaCo-Reward inside RL, FLUX.1-dev shows strong gains in visual consistency across identity, style, and logic, beating open baselines and closing distance to some closed systems. Average gains of roughly 0.10–0.12 on consistency metrics are like going from a solid B to an A- when others hover at B-level.

Results—Image Editing (RQ2): On GEdit-Bench, PaCo-Reward improves semantic consistency and prompt quality together—even for strong editors like Qwen-Image-Edit—suggesting the reward model doesn’t force a trade-off; instead, it navigates toward edits that are faithful yet style/identity-stable.

Ablations—Resolution-decoupled training (RQ3): Curves show that while 256×256 loses too much detail for reliable judging, 512×512 reaches parity with 1024×1024 after more epochs, with higher variance that actually helps exploration. Training time nearly halves (e.g., 6 hours vs 12 hours in a representative setup), giving a practical path to efficient RL.

Ablations—Log-tamed aggregation (RQ3): Without log-taming, the consistency reward dominates after ~50 epochs (ratio > 2.5), causing imbalance. With log-taming, the ratio stays under ~1.8, keeping both consistency and prompt alignment improving together—like an orchestra where the violins don’t drown out the woodwinds.

Surprises: (1) Fine-tuned VLMs on pairwise tasks can beat generic large VLMs and even strong similarity baselines on consistency, showing that the exact training objective matters more than raw size. (2) Lower-res training can still learn high-res behavior, as long as it preserves enough detail for judging.

Scoreboard meaning: Saying ‘Accuracy 0.77’ is nice, but here it means PaCo-Reward picks the human-preferred side much more often than prior open models, giving RL a more trustworthy coach. Saying ‘training efficiency nearly doubled’ means practitioners can actually run these RL loops on real schedules and budgets, not just in theory.

05Discussion & Limitations

🍞 Hook: Even the best recipe needs the right kitchen and ingredients; otherwise, it’s hard to cook a feast.

🥬 The concept (Limitations): What it is: The method works best with capable base generators, several GPUs, and careful setup of data and prompts. How it works (constraints): (1) Compute: RL with groups of images and VLM-based rewards still needs strong hardware; the paper reports H100-class GPUs. (2) Data realism: While sub-figure pairing scales well, coverage still depends on the prompts and seeds used; missing edge cases may limit generalization. (3) Resolution floor: Very low training resolutions (e.g., 256×256) can undercut reward reliability. (4) Reward scope: PaCo-Reward focuses on consistency; if you care about other nuanced factors (e.g., subtle lighting aesthetics), you may need extra rewards. Why it matters: Without enough compute or the right data mix, results can be weaker or training unstable.

🍞 Anchor: If your kitchen only has a microwave, a seven-course meal is tough—even with a great recipe.

Required resources: (1) A solid base image model (e.g., FLUX or similar). (2) A VLM backbone for PaCo-Reward (e.g., 7B-class). (3) GPUs for parallel image sampling plus a fast VLM serving stack. (4) Time to curate or extend prompts and verify reward prompts.

When not to use: (1) If you only need single images with no cross-image ties, simpler alignment methods may suffice. (2) If compute is severely limited, pure RL loops may be overkill; consider offline finetuning with static pairwise data. (3) If the domain has almost no shared structure (e.g., completely unrelated images), pairwise consistency scoring adds little.

Open questions: (1) Can we learn consistency without any paired data, using only clever self-training? (2) How to generalize to videos where temporal consistency adds more constraints? (3) Can we personalize consistency (e.g., specific brand or artist’s micro-style) with a tiny number of examples? (4) How to auto-select the best mix of rewards beyond log-taming—learned weighting instead of fixed rules? (5) Can we push efficiency further with adaptive resolution or partial scoring (only score tough pairs)?

06Conclusion & Future Work

Three-sentence summary: This paper introduces PaCo-RL, which teaches image generators to keep identity, style, and logic consistent across multiple images using pairwise ‘Yes/No + reason’ rewards. The key ingredients are PaCo-Reward—a specialized, autoregressive, pairwise judge trained on a large, smartly built dataset—and PaCo-GRPO—an RL recipe that speeds training with low-res images and stabilizes multi-reward learning with log-tamed aggregation. The result is state-of-the-art consistency with better efficiency and steady prompt alignment.

Main achievement: Showing that a simple, fast, and interpretable pairwise reward—aligned with next-token prediction—combined with resolution-decoupled RL and log-taming, can reliably deliver consistent multi-image generations and robust edits.

Future directions: Expand to video (temporal consistency), personalize to specific brands or characters with minimal data, learn automatic reward weighting, and explore adaptive-resolution and sparse-scoring tricks to cut compute further.

Why remember this: It reframes ‘consistency’ as a practical, pairwise judging game that language-vision models excel at, and it packages RL so teams can actually use it—faster training, steadier learning, and images that finally belong together.

Practical Applications

•Automated storyboarding where characters and scenes remain visually consistent across panels.
•Brand asset generation (logos, packaging, mockups) that share a unified style across products.
•Educational step-by-step diagrams with logical progression and consistent visuals.
•Product catalogs where different views and variants keep identity and style aligned.
•Photo editing that alters attributes (e.g., add abs, change expression) without breaking identity or style.
•UI theme packs that maintain the same typography and color system across screens.
•Marketing campaigns with matching posters, social tiles, and web banners.
•Children’s book illustration sets where characters and art style stay consistent page to page.
•Process or lifecycle illustrations (e.g., decay, construction, cooking) with coherent step transitions.
•Game asset sheets (multi-view characters, items) with stable identity and art direction.

Version: 1