Enhancing Spatial Understanding in Image Generation via Reward Modeling

Zhenyu Tang; Chaoran Feng; Yufan Deng; Jie Wu; Xiaojie Li; Rui Wang; Yunpeng Chen; Daquan Zhou

Enhancing Spatial Understanding in Image Generation via Reward Modeling

Intermediate

Zhenyu Tang, Chaoran Feng, Yufan Deng et al.2/27/2026

arXiv

Key Summary

•This paper teaches image generators to place objects in the right spots by building a special teacher called a reward model focused on spatial relationships.
•The authors first create SpatialReward-Dataset, 80,000 carefully checked pairs of images: one correct layout and one with a deliberate spatial mistake.
•They train SpatialScore, a reward model that scores how well an image follows the prompt’s left/right/front/behind style instructions.
•SpatialScore beats even leading proprietary systems on a new spatial preference benchmark, showing it really understands multi-object layouts.
•They plug SpatialScore into online reinforcement learning so the image model gets rewarded for correct layouts and penalized for errors.
•A clever top-k filtering trick keeps training fair when some prompts are easy and others are hard, speeding learning and saving compute.
•Across multiple benchmarks (short and long prompts), the RL-tuned generator shows large, consistent gains in spatial accuracy.
•Compared to training with a rule-based evaluator (like GenEval), this method generalizes better to long, complex prompts without weird artifacts.
•The approach uses widely available components (a VLM backbone plus LoRA) and fits into current GRPO-style training for flow/diffusion models.

Why This Research Matters

When people give step-by-step visual instructions—like laying out a desk, arranging a shelf, or designing a room—they expect the AI to follow them exactly. Better spatial understanding means fewer retries, saving time and making AI tools feel reliable and respectful of user intent. This helps creative work (storyboards, posters), practical planning (classroom seating, warehouse layouts), and education (diagrams that match instructions). It can reduce mistakes in sensitive contexts like safety signage or medical illustrations where left/right or front/behind matters. The method is also a blueprint: build a focused judge for the exact skill you care about, then train the generator with it.

Reading Workflow

Turn this paper into a decision

Scan fast. Promote only the papers that survive triage.

No workflow history yet.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine you’re giving directions to draw a scene: “Put the cup to the right of the laptop, with a plant behind it.” If your friend hears every word but mixes up left and right, the drawing looks wrong even if it’s pretty.

🥬 The Concept: Text-to-image models can create beautiful images, but they often struggle with spatial relationships—where things go relative to each other.

What it is: Spatial relationships are instructions like left/right, in front/behind, centered, aligned with edges, and ordered sequences (A left of B, B left of C, etc.).
How it works (in today’s systems): A model reads a prompt and samples an image. If the prompt is simple, it’s often fine. But with long, detailed prompts that include many objects and relations, the model may misplace items.
Why it matters: Without solid spatial understanding, you get images that look plausible but violate instructions, forcing users to resample many times.

🍞 Anchor: If your prompt says, “A mug to the right of the keyboard and a notebook to the left,” but the image shows the mug on the left and the notebook missing, the scene fails the instruction despite looking nice.

🍞 Hook: You know how a teacher’s feedback helps you improve? AIs also need feedback to get better at specific skills like layout.

🥬 The Concept: A reward model is an automatic judge that scores images based on how well they follow the prompt’s rules.

What it is: A learned scorer that outputs a number—higher if the spatial layout matches the text, lower if it doesn’t.
How it works: Show it the prompt and image → extract features (with a visual-language backbone) → output a score (modeled as a distribution) → prefer correct over incorrect.
Why it matters: If the judge can’t see spatial mistakes, it gives the wrong advice, and the generator learns the wrong lessons.

🍞 Anchor: If the judge says a wrong chess move is great, you’ll keep losing games. Similarly, a bad reward model trains bad layouts.

🍞 Hook: Think of older checklists that only count objects and colors but don’t see who’s left of whom.

🥬 The Concept: Many past evaluators (object detectors and broad “text-image alignment” scores) miss fine-grained spatial reasoning.

What it is: Rule-based checks (like “is there a banana?”) or general preference scorers that favor pretty, semantically plausible images.
How it works: Detect items and colors; loosely compare text and image embeddings; or ask VQA-like questions.
Why it matters: They can pass simple prompts but fail on long, multi-object layouts, especially with occlusion or ordering.

🍞 Anchor: A detector might see two bananas and say “good,” even if the prompt said one banana left of a cup and one right of a plate—details the detector didn’t verify.

🍞 Hook: Imagine practicing with worksheets designed exactly for “left/right/front/behind” skills.

🥬 The Concept: SpatialReward-Dataset is a big, carefully checked set of paired examples to teach spatial judging.

What it is: 80k preference pairs: one image that matches complex spatial instructions (the “winner”) and one with a targeted spatial error (the “loser”).
How it works: Write a detailed “perfect” prompt → copy and perturb some relations (swap left/right, change in front/behind) → generate both images with the same model → humans verify that the winner truly follows the prompt and the loser truly breaks it.
Why it matters: The judge learns by contrasting almost-right vs. truly-right images, focusing on spatial details.

🍞 Anchor: Like showing two pictures of a desk—one with the mouse on the right of the keyboard, one on the left—and labeling which is correct.

🍞 Hook: Now give the AI a ruler and a coach.

🥬 The Concept: SpatialScore is the learned judge trained on SpatialReward-Dataset to score spatial accuracy.

What it is: A reward model built on a visual-language backbone that outputs a spatial consistency score for a prompt–image pair.
How it works: Feed prompt + image → encode both → a special “reward token” attends to everything → a small head outputs a score distribution → during training, it learns to rank the correct image above the perturbed one.
Why it matters: This judge is specialized for spatial reasoning and, on their benchmark, even outperforms top proprietary systems.

🍞 Anchor: Given “cup to the right of the laptop,” SpatialScore gives the correct image a high score and the wrong-side image a low score.

🍞 Hook: Think of training with instant feedback as you draw, not waiting until the end.

🥬 The Concept: Online reinforcement learning (RL) lets the image model try, get scored by SpatialScore, and adjust on the fly.

What it is: A learning loop where the generator samples multiple images, the reward model scores them, and the generator updates to prefer higher-scoring samples.
How it works: Sample a group for each prompt → score with SpatialScore → compute “advantages” (how much better than the group average) → update the model with policy gradients (GRPO-style) → repeat.
Why it matters: Directly optimizes for spatial correctness, not just pretty pictures.

🍞 Anchor: Like drawing several desk scenes, keeping the best, learning from mistakes, and improving the next batch.

🍞 Hook: Sometimes classes have many A students (easy test) or few (hard test). A plain average can mislead.

🥬 The Concept: Top-k filtering balances training by focusing on the best and worst samples in each group.

What it is: After scoring a group, keep the top-k and bottom-k images to compute learning signals.
How it works: Rank images by reward → select top and bottom k → normalize advantages using only those → update → this reduces bias when a group is mostly good or mostly bad.
Why it matters: Prevents good samples from getting negative signals in “easy” groups and speeds stable learning while cutting compute.

🍞 Anchor: Grading only the top and bottom few essays in a stack can reveal real differences, avoiding a misleading class average.

02Core Idea

🍞 Hook: You know how maps help you find where things go—left, right, above, below? Imagine giving a map to an AI painter so it stops guessing and starts placing things correctly.

🥬 The Aha! Moment: Train a specialized “spatial judge” (SpatialScore) on carefully constructed preference pairs (SpatialReward-Dataset), then use that judge to directly reward and improve an image generator’s layouts with online RL—plus a top-k trick to keep learning steady and efficient.

Multiple analogies:

Coach and drills: Build drills (dataset) that test foot placement exactly. The coach (SpatialScore) scores each move. The player (image model) practices live (online RL) and gets better at positioning.
GPS for drawing: The dataset provides routes with checkpoints. SpatialScore is the GPS that says “on track” or “recalculate.” RL is your car adjusting speed and direction to follow the route.
Baking with a taster: The dataset supplies recipes (precise layout instructions). SpatialScore is the taster grading each batch. RL is adjusting the mix in real time until the pastry matches the recipe.

Before vs. After:

Before: Models often nailed object presence and style but flipped left/right or missed alignments on long, complex prompts; evaluators prized aesthetics over precise layout.
After: The generator gets explicit spatial feedback; layouts align with multi-object instructions; improvements hold up on both short and long prompts.

Why it works (intuition, no equations):

Focused supervision: Preference pairs pin the judge’s attention on just the spatial relation that changed, removing noise from style differences.
Ranking beats raw scores: Teaching the judge to prefer A over B (when only one relation changes) is more robust than absolute scoring.
Closed training loop: The same kind of feedback (spatial correctness) you care about at test time is used during training.
Top-k stability: Normalizing learning signals using the most informative examples avoids misleading group averages.

Building blocks (with mini Sandwich for each):

🍞 Hook: Like flashcards that differ in one tiny detail. 🥬 SpatialReward-Dataset: 80k pairs; one correct, one perturbed. Step-by-step: write perfect prompt → perturb a relation → generate both with the same model → human-verify. Why: Teaches the judge to spot just the spatial difference. 🍞 Example: Mouse right of keyboard vs. mouse left of keyboard.
🍞 Hook: A referee trained on those flashcards. 🥬 SpatialScore: A VLM-based reward model with a special token that outputs a spatial-consistency score (as a distribution). Why: Gives accurate, low-cost, automatic spatial feedback. 🍞 Example: Scores 9/10 for correct desk scene; 2/10 for the swapped one.
🍞 Hook: Practice, score, adjust, repeat. 🥬 Online RL (GRPO-style): Sample multiple candidates, score them, compute advantages, and update the generator toward higher-scoring layouts. Why: Directly optimizes for what we care about—where things go. 🍞 Example: Generate 24 desk scenes, keep signals from best/worst, and improve.
🍞 Hook: When averages hide truths. 🥬 Top-k filtering: Use top-k and bottom-k to normalize advantages and reduce bias in easy/hard groups; also cuts compute. Why: Prevents good samples from being mistakenly punished. 🍞 Example: In an easy batch with many good images, still keep the top and bottom few to compute fair updates.

Bottom bread (Anchor): After training this way, when you ask for “a cup to the right of the laptop, a notebook left, and a plant behind them,” the model places each object correctly—even in longer, multi-step prompts.

03Methodology

High-level recipe: Prompt → (1) Build spatial preference data → (2) Train SpatialScore judge → (3) Plug judge into online RL → (4) Top-k filter for fair, efficient updates → Output: a generator with stronger spatial understanding.

Step 1: Build SpatialReward-Dataset

What happens: Create 80k preference pairs that differ mainly by one or a few spatial relations. Each pair uses the same base generator but different prompts (perfect vs. perturbed). Humans verify correctness.
Why this step exists: The judge needs crystal-clear examples where only the layout relation changes; otherwise it learns to prefer style or aesthetic differences.
Example: Perfect prompt: “Keyboard centered; mouse to the right aligned with the front edge; notebook to the left.” Perturbed: swap left/right for mouse and notebook. Generate both, then label correct vs. incorrect.

Step 2: Train SpatialScore (the reward model)

What happens: Use a visual-language backbone to read both text and image. Add a special “reward token” that attends to everything; a small head outputs a score distribution. Train it with preference learning so the correct image scores higher than the perturbed one.
Why this step exists: A strong, specialized judge is the core enabler; without it, RL optimizes the wrong signals.
Example: For the desk pair, SpatialScore learns to output a higher score for the right-side mouse image and a lower score for the left-side mouse image given the same prompt.

Step 3: Online RL with SpatialScore

What happens (like a classroom loop):
1. Sample a prompt (often from the same complex-prompt pool used to build the dataset).
2. The policy (image generator; here a flow/diffusion family model) creates a group of images for that prompt.
3. SpatialScore rates each image’s spatial correctness.
4. Compute advantages: how much each image’s score is above/below the group reference.
5. Update the generator with policy gradients (GRPO-style), nudging it toward higher-scoring layouts while keeping it close to a stable reference (via a KL regularizer).
Why this step exists: Directly trains the generator to satisfy spatial relations, not just be pretty or semantically close.
Example: Prompt: “Two apples on the table, red on the left, green on the right; two more on a rack above in the same color order.” The generator proposes 24 images; SpatialScore highlights which ones nailed both rows’ left-right order; the model learns from those.

Mini Sandwiches inside Step 3:

🍞 Hook: Like exploring different drafts by adding a bit of randomness. 🥬 Stochastic sampling for exploration: Convert the generator’s deterministic sampler into a stochastic one (SDE-style) so the model can try diverse layouts each step. Why: RL needs exploration or it can’t discover better placements. 🍞 Example: Slight noise produces variations in object positions, helping the learner compare and improve.
🍞 Hook: Class averages can be unfair when most essays are great—or not. 🥬 Advantage normalization: Compare each image’s score to the group statistics to get a fair learning signal. Why: Without normalization, updates can be unstable across easy vs. hard prompts. 🍞 Example: In a mostly-good batch, we still recognize which are best and which slip up.

Step 4: Top-k filtering (the secret sauce)

What happens: After scoring and ranking the group, keep only the top-k and bottom-k images to compute the group mean/std and to update the policy.
Why this step exists: Prevents a high group average (in easy prompts) from accidentally giving negative signals to still-good samples; also reduces the number of required function evaluations, saving compute.
Example: With group size 24 and 6 denoising steps, filtering to $2×k$ images (e.g., k=6) cuts training NFEs dramatically, yet preserves the most informative signals.

Putting it all together in practice:

Inputs: Long, spatially complex prompts; a base image generator that supports long text; SpatialScore for evaluation.
Process: For each prompt, sample a diverse group; SpatialScore rates; top/bottom-k filtered normalization; GRPO update with a small KL regularizer and LoRA adapters for efficient fine-tuning.
Outputs: A generator that more reliably places objects as instructed across both short and long prompts.

What breaks without each step:

Without the specialized dataset: The judge learns aesthetics instead of relations.
Without preference training: Absolute scores drift; rankings provide sharper supervision.
Without online RL: The generator never directly optimizes for spatial faithfulness.
Without top-k filtering: Easy/hard prompt imbalance biases updates and wastes compute.

Concrete walkthrough:

Prompt: “On a beach: towel centered in front of the umbrella pole; basket left of the towel; volleyball right of the towel.”
The model proposes 24 images with small layout differences.
SpatialScore gives scores: correct left/right placements get high; swapped ones get low.
Keep top-6 and bottom-6, compute fair advantages.
Update the generator. Next round, more candidates land the left/right placements right without sacrificing style.

04Experiments & Results

The test: What did they measure and why?

Pairwise preference accuracy for the reward model: Given a prompt and two images (one correct, one perturbed), does the judge pick the right one? This checks if SpatialScore truly understands spatial relations.
Spatial alignment on generation benchmarks: After RL-tuning, does the image generator place objects correctly on both short and long, multi-object prompts?

The competition: Who was compared against whom?

Reward models and VLMs: PickScore, ImageReward, HPS variants, UnifiedReward, open-source Qwen2.5-VL series (7B→72B), and leading proprietary systems.
Generation baselines: The original base model (Flux.1-dev) and a variant tuned with a rule-based evaluator (GenEval) via Flow-GRPO.

The scoreboard with context:

Reward benchmark (pairwise preference): SpatialScore (7B) achieved around 95.8% accuracy, which is like scoring an A+ when strong competitors got A or B. On their curated evaluation set, it even edged out top proprietary systems.
In-domain SpatialScore metric (generator after RL): Flux.1-dev improved from about 2.18 to 7.81 on their spatial score—think of moving from barely passing to strong proficiency on spatial quizzes.
Out-of-domain spatial sub-dimensions (e.g., DPG-Bench, TIIF-Bench, UniGenBench++): The RL-tuned model showed consistent gains on both short and long prompts, with especially notable improvements on relation-heavy tasks and 2D/3D layout checks.
Versus rule-based GenEval training: The Flow-GRPO model trained on GenEval sometimes improved short, simple prompts but struggled or regressed on long, complex prompts, and occasionally produced artifacts (like floating objects), showing weaker generalization.

Surprising findings:

Specialized beats general: A 7B specialized SpatialScore trained on targeted pairs outperformed much larger, general-purpose open-source VLMs on spatial preference—and, on their benchmark, even top proprietary models.
Long-prompt generalization: Training on human-verified, relation-rich prompts plus online RL generalized better to long instructions than rule-based evaluators.
Efficiency from top-k: Selecting only top/bottom-k for updates not only stabilized learning but also lowered compute (fewer function evaluations per step) without hurting—and sometimes improving—final scores.

Concrete example insight:

Where rule-based checks falter (occlusion, ordering, subtle alignment), SpatialScore still judges correctly. For instance, counting bananas as a “bunch” might pass a detector, but if the prompt requires a precise left/right arrangement with other objects, SpatialScore catches incorrect layouts.

Takeaway: A focused, high-quality reward model transforms RL for image generation: the model doesn’t just include the right objects; it reliably puts them in the right places across complex, multi-object scenes.

05Discussion & Limitations

Limitations:

Video temporal reasoning: The method targets spatial layouts in single images. It doesn’t yet handle sequences where relations change over time (e.g., swap A and C after moving A left of B).
Coverage of visual complexities: While strong on spatial relations, it doesn’t directly optimize for fine-grained textures, global aesthetics, or photorealism beyond what the base model already provides.
Dataset biases: Even with three strong generators and human checks, some scene types or relation styles may be underrepresented.

Required resources:

Training hardware: Reported runs used multiple H20 GPUs (e.g., 8 for reward model training; 32 for online RL). LoRA helps keep adaptation efficient, but RL remains compute-intensive.
Data curation: Building high-quality preference pairs requires careful prompt design, generation, and human verification.

When not to use:

If your prompts rarely include relative positions (e.g., abstract art prompts), the gains may be minimal.
If you need only single-object presence or color checks, simpler, cheaper evaluators may suffice.
Ultra-low compute settings may find online RL too heavy; consider offline reranking with SpatialScore instead.

Open questions:

Video extension: How to couple spatial and temporal rewards so models respect changing relations across frames?
Multi-skill rewards: Can we combine spatial, color, count, and style into a balanced multi-objective RL without over-optimization?
Robustness: How to make the reward model even more resilient to occlusion, clutter, and camera viewpoint changes?
Data scaling: What’s the return on scaling data beyond 80k pairs or diversifying sources/models further?
Safety and bias: How to audit and mitigate any unintended biases introduced by reward modeling in scene composition?

06Conclusion & Future Work

Three-sentence summary:

The paper builds SpatialReward-Dataset—80k human-verified preference pairs focused on spatial relations—and trains SpatialScore, a specialized reward model that accurately scores spatial consistency.
Plugging SpatialScore into online RL (with GRPO-style updates and a top-k filtering trick) teaches an image generator to place objects correctly, especially in long, multi-object prompts.
Experiments show strong, consistent gains across in-domain and out-of-domain benchmarks, often outperforming models trained with rule-based evaluators.

Main achievement:

Proving that a targeted, high-fidelity spatial reward model—trained on adversarial preference pairs—can reliably drive online RL to fix the specific, stubborn problem of spatial misplacement in text-to-image generation.

Future directions:

Extend from static images to videos where spatial relations evolve over time; explore multi-objective reward modeling that balances spatial accuracy with counting, color, and style; study reward robustness to occlusion and viewpoint shifts; and scale/ diversify datasets for broader generalization.

Why remember this:

It shows that “better judging” unlocks “better generation.” With the right, specialized teacher (SpatialScore), image models don’t just include the right things—they put them in the right places, making AI imagery more trustworthy and useful for real-world, instruction-heavy tasks.

Practical Applications

•Interior design mockups that honor exact furniture placement (left/right/behind) from client prompts.
•Product photography generation where items must be arranged in specific orders and alignments.
•Educational diagrams (e.g., science or geography) that precisely follow positional instructions.
•Retail planograms ensuring shelves and items appear in correct locations relative to each other.
•Robotics scene simulation images where object layouts must match task constraints for training.
•UI/UX storyboard generation with widgets placed in exact positions for design reviews.
•Safety posters and wayfinding graphics that require accurate directional relationships.
•Comics and storyboards where characters and props must keep consistent spatial continuity.
•Event layout previews (seating charts, stages, booths) following explicit, multi-object instructions.
•AR preview scenes that place digital objects correctly relative to real-world anchors described in text.

Version: 1