🎓How I Study AIHISA
📖Read
📄Papers📰Blogs🎬Courses
💡Learn
🛤️Paths📚Topics💡Concepts🎴Shorts
🎯Practice
🧩Problems🎯Prompts🧠Review
Search
Everything in Its Place: Benchmarking Spatial Intelligence of Text-to-Image Models | How I Study AI

Everything in Its Place: Benchmarking Spatial Intelligence of Text-to-Image Models

Beginner
Zengbin Wang, Xuecai Hu, Yong Wang et al.1/28/2026
arXivPDF

Key Summary

  • •Text-to-image models draw pretty pictures, but often put things in the wrong places or miss how objects interact.
  • •This paper introduces SpatialGenEval, a big, careful test that checks a model’s spatial smarts using long, information-packed prompts.
  • •It measures 10 kinds of spatial skills, from basic object attributes to tricky ideas like occlusion, motion, and cause and effect.
  • •Each prompt comes with 10 multiple-choice questions so we can see exactly where a model succeeds or fails.
  • •Across 23 leading models, the hardest parts are spatial reasoning tasks like comparison and occlusion, which often hover near random guessing.
  • •Models that use stronger text encoders (like LLM-based encoders) do better at understanding dense prompts and placing things correctly.
  • •They also build SpatialT2I, a 15,400-pair training set created by rewriting prompts to match generated images, which helps models improve.
  • •Fine-tuning popular models with SpatialT2I raises spatial accuracy by about 4–6 percentage points.
  • •The benchmark uses careful evaluation rules, including a ‘None of the above’ choice and 5-vote stability, to avoid lucky guesses.
  • •This work shifts evaluation from ‘what is in the picture’ to ‘where, how, and why things are arranged and interacting’.

Why This Research Matters

When images guide decisions—like assembling furniture, planning a room, or showing a safety procedure—objects must be in the right places and interact realistically. This benchmark makes sure AI can handle those details by testing not only what appears in the picture, but where, how, and why. Designers and advertisers can trust that layouts, sizes, and directions match their intentions. Teachers and students can use AI images that correctly depict cause and effect, like how wind makes a flag wave. Robotics and embodied AI benefit from clearer, more accurate depictions of spatial relationships. E-commerce images become more dependable, reducing confusion about product sizes and placements. Overall, this work helps AI move from pretty pictures to purposeful, trustworthy visuals.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Top Bread (Hook): Imagine asking a friend to draw your room: bed on the left, desk facing the window, a lamp behind the chair, and your cat peeking from under the bed. If they draw everything but put the bed in the middle and the cat on top, it looks nice but it’s not what you asked for.

🥬 Filling (The Actual Concept):

  • What it is: Text-to-image (T2I) models are computer artists that turn descriptions into pictures.
  • How it works: You give a detailed description; the model decodes the words into visual ideas; then it paints an image step by step.
  • Why it matters: If the model misunderstands where things go, even a beautiful image can be wrong for design, safety, or storytelling.

🍞 Bottom Bread (Anchor): When you say 'a red ball under a glass table to the right of a blue chair,' you don’t want the ball on the chair or behind the table.

🍞 Top Bread (Hook): You know how a good Lego build needs pieces placed in the right spots, facing the right directions? Building a scene in an image is just like careful Lego placement.

🥬 Filling (The Problem):

  • What it is: Many T2I models nail the ‘what’ (objects and their attributes) but stumble on the ‘where,’ ‘how,’ and ‘why’ (positions, layouts, motion, and cause-effect).
  • How it works: Current tests often use short prompts and yes/no checks, so models aren’t pushed to follow complex spatial rules.
  • Why it matters: Without hard tests, models look strong but secretly miss the skills needed for complex, real-world scenes.

🍞 Bottom Bread (Anchor): A model might draw two dogs and a bridge (great!), but if the prompt said, 'two small black dogs and a large yellow dog leap across the bridge to the right, one after another,' the order, sizes, and direction must be correct too.

🍞 Top Bread (Hook): Think of a school quiz with only true/false questions. You can guess and still pass. But that won’t tell if you really understand.

🥬 Filling (Failed Attempts):

  • What it is: Older benchmarks focused on short prompts and coarse checks: Is a 'dog' present? Is it 'brown'?
  • How it works: These tests reward spotting objects but rarely challenge spatial reasoning like 'A is twice as tall as B' or 'C blocks D.'
  • Why it matters: Models pass easy tests yet fail on harder, real-life placements and interactions.

🍞 Bottom Bread (Anchor): Passing a yes/no test doesn’t mean you can follow a map.

🍞 Top Bread (Hook): Imagine a treasure map with all the clues tightly packed: landmarks, distances, directions, and what blocks your path. That’s the kind of prompt models actually need to prove they understand space.

🥬 Filling (The Gap):

  • What it is: We lacked a benchmark that uses long, information-dense prompts and checks many spatial skills at once.
  • How it works: A comprehensive test should cover foundations (objects, attributes), perception (position, orientation, layout), reasoning (comparison, proximity, occlusion), and interaction (motion, causality).
  • Why it matters: Without this, we can’t see exactly where models struggle or how to fix it.

🍞 Bottom Bread (Anchor): If a story says 'the taller tree blocks the sun' and 'birds fly toward the mountain,' a good model must show all of that at once, not just pick one detail.

🍞 Top Bread (Hook): Picture a busy kitchen: oven mitt to the right of the stove, pot in front, steam rising causing condensation on the window. If an AI can place and connect all these pieces correctly, it’s closer to truly understanding scenes.

🥬 Filling (Real Stakes):

  • What it is: Spatial intelligence affects design mockups, e-commerce photos, education, robotics, and safety illustrations.
  • How it works: Correct placement and interactions make images useful and trustworthy.
  • Why it matters: When images control expectations in the real world (e.g., assembly, navigation, instructions), wrong spatial details can mislead or cause errors.

🍞 Bottom Bread (Anchor): An architect’s concept image with doors placed behind walls isn’t just funny—it’s unusable.

02Core Idea

🍞 Top Bread (Hook): You know how a good dance judge doesn’t just look at costumes, but scores timing, spacing, formations, and lifts too? One score can’t capture all that skill.

🥬 Filling (The 'Aha!' Moment):

  • What it is: The key insight is to test spatial intelligence with long, tightly packed prompts and grade 10 spatial skills at once using multi-choice questions, then use the same idea to build training data that teaches models to do better.
  • How it works: Create prompts that require many spatial constraints simultaneously; pair each prompt with 10 questions (one per skill); evaluate the generated image only by what’s visible; add guardrails (no prompt to the judge, 'None of the above' option, and multi-vote stability); and construct a rewritten text-image dataset (SpatialT2I) to fine-tune models.
  • Why it matters: This exposes exact weaknesses (like occlusion or comparison) and provides a data-centric path to fix them.

🍞 Bottom Bread (Anchor): It’s like giving a cooking exam where students must bake, plate, time, and season perfectly—and then using their mistakes to build better practice recipes.

🍞 Top Bread (Hook): Imagine three analogies:

  1. Orchestra: Not just correct instruments (objects), but who sits where (layout), who plays louder (comparison), and who cues whom (causality).
  2. Sports play: Players’ positions (position), facing (orientation), formations (layout), who is closest (proximity), who blocks (occlusion), who moves where (motion), and what causes the score (causal).
  3. Lego city: Pieces exist (foundation), are placed precisely (perception), sized and ordered (reasoning), and interact (interaction).

🥬 Filling (Before vs After):

  • Before: Benchmarks checked simple object presence with short prompts; models looked strong but hid spatial gaps.
  • After: SpatialGenEval reveals detailed strengths and weaknesses across 10 sub-domains; models clearly lag on higher-order reasoning.
  • And: SpatialT2I shows that information-dense, aligned data measurably lifts spatial skills via fine-tuning.

🍞 Bottom Bread (Anchor): Previously, a model could get an 'A' for drawing a 'dog' anywhere. Now, it needs to draw the dog under the table, facing left, smaller than the chair, and blocking a shadow—then it gets graded on each part.

🍞 Top Bread (Hook): Think of scoring a dance routine step by step, not just clapping at the end.

🥬 Filling (Why It Works—Intuition):

  • What it is: Dense prompts force the model to juggle multiple constraints; targeted questions isolate each skill; strict judging prevents cheating by text-only clues.
  • How it works: If the image doesn’t visibly show a fact, the judge can pick 'None'—so the model can’t pass by luck.
  • Why it matters: The setup rewards true spatial understanding, not shortcuts.

🍞 Bottom Bread (Anchor): If the prompt says 'red chair is twice as large as the blue chair,' the question asks exactly that. If the picture doesn’t show it, the model loses that point.

🍞 Top Bread (Hook): You know checklists for pilots? This benchmark is a spatial checklist for AI.

🥬 Filling (Building Blocks):

  • What it is: A hierarchical framework: 4 domains and 10 sub-domains.
  • How it works: Foundations (objects, attributes) → Perception (position, orientation, layout) → Reasoning (comparison, proximity, occlusion) → Interaction (motion, causality).
  • Why it matters: This tracks growth from easy to advanced spatial skills and pinpoints exact trouble spots.

🍞 Bottom Bread (Anchor): If a model draws the right car (foundation) but places it behind a wall it should block (occlusion) or forgets it’s smaller than a nearby truck (comparison), we know exactly which skill failed.

03Methodology

At a high level: Scene selection and definitions → Generate a long, information-dense prompt (10 spatial constraints) → Generate 10 matching multi-choice questions (one per skill) → T2I model generates image → MLLM judge answers questions using only the image with a 5-vote rule and 'None' option → Compute accuracy per sub-domain and overall.

🍞 Top Bread (Hook): You know how building a puzzle starts with the border, then sections, then final checks? This pipeline builds and checks scenes the same way.

🥬 Filling (Step-by-step):

  • What it is: A semi-automated, human-in-the-loop recipe to create tough prompts, precise questions, and fair evaluations.
  • How it works: Three stages: prompt creation, question creation, and evaluation.
  • Why it matters: Each stage prevents shortcuts, errors, and leaks, ensuring scores reflect true spatial skill.

🍞 Bottom Bread (Anchor): It’s like writing a riddle, writing the answer key, and making sure the solver only sees the picture, not the riddle’s text.

Stage 1: Information-dense, spatial-aware prompt generation 🍞 Hook: Imagine asking for a sandwich with specific bread, fillings, exact layers, and cut direction. That order matters. 🥬 The Concept: Prompts must pack all 10 spatial sub-domains into one coherent, ~60-word scene description.

  • How it works (recipe):
    1. Pick a scene from 25 real-world scenes (e.g., park, kitchen, library, forest, classroom).
    2. An MLLM (Gemini 2.5 Pro) drafts prompts that include objects, attributes, positions, orientations, layouts, comparisons, proximity, occlusion, motion, and causality.
    3. Human experts refine: remove odd words, fix logic (no impossible cycles), ensure clarity and completeness.
  • Why it matters: Short prompts hide weaknesses; dense prompts surface them. 🍞 Anchor: 'Under a giant oak, five kids sit in a semi-circle facing a storyteller; a kite’s string crosses the sun, casting a shadow on the middle child’s book...'—all in one prompt.

Stage 2: Omni-dimensional multi-choice QAs generation 🍞 Hook: Think of a 10-question pop quiz where each question checks a different skill. 🥬 The Concept: Each prompt gets 10 multiple-choice questions, one per sub-domain, with plausible distractors and a ground-truth answer drawn from the prompt.

  • How it works (recipe):
    1. The MLLM generates 10 QAs guided by domain definitions and examples.
    2. Humans remove any wording that leaks the answer and make questions visually checkable.
    3. Add option E: 'None' so judges can refuse to guess if the image doesn’t match any choice.
  • Why it matters: Fine-grained QAs let us diagnose exact strengths and weaknesses. 🍞 Anchor: 'How are the five birds arranged?' Choices: straight line, scattered, V formation, circle, or None.

Stage 3: Evaluation with an MLLM judge 🍞 Hook: Picture a referee who only watches the instant replay and cannot hear the coach’s plan. 🥬 The Concept: The judge sees only the image and the 10 questions—never the text prompt—to avoid 'answer leakage.'

  • How it works (recipe):
    1. Use a strong open-source judge (Qwen2.5-VL-72B) for reproducibility; also compare with GPT-4o.
    2. Enforce rules: no external knowledge, allow 'None,' and use 5-round voting; a question is correct only with at least 4 identical picks.
    3. Report balanced accuracy per sub-domain and overall.
  • Why it matters: Prevents cheating by memorizing the text and stabilizes scores. 🍞 Anchor: If the image doesn’t clearly show which object is closest, the judge should pick 'None' rather than guess.

Focused Sub-domains (each as mini sandwich)

  • 🍞 Hook: You know how you first check what Lego bricks you have. 🥬 Spatial Foundation (S1 Object Category, S2 Object Attribution): What it is: Can the model draw the right objects with the right attributes. How: Verify category completeness (no missing or extra objects) and attribute binding (colors, materials) to the correct items. Why: If the basics are wrong, everything else collapses. 🍞 Anchor: 'Green snail on the left, brown cracked-shelled snail on the right.'

  • 🍞 Hook: Next, you place bricks exactly where they belong. 🥬 Spatial Perception (S3 Position, S4 Orientation, S5 Layout): What it is: Precise placement, facing, and group arrangements. How: Check absolute/relative positions, facing left/right/up/down, and formations (line, circle, sequence). Why: Without this, scenes feel jumbled. 🍞 Anchor: 'Two kids back-to-back on the rug; five children in a semi-circle around the storyteller.'

  • 🍞 Hook: Now compare sizes and distances, and who hides whom. 🥬 Spatial Reasoning (S6 Comparison, S7 Proximity, S8 Occlusion): What it is: Relative size/quantity, nearest/far, and 3D layering. How: Ask 'twice as big,' 'closest to,' 'partially blocking.' Why: This is where many models fail—harder than just placing items. 🍞 Anchor: 'Red chair appears twice as large as blue chair; vase tips and partially obscures the book.'

  • 🍞 Hook: Finally, catch the action and what causes what. 🥬 Spatial Interaction (S9 Motion, S10 Causality): What it is: Mid-action poses and cause-effect logic. How: Detect jumping, flying, pouring, and what directly caused a result (e.g., sound wave shatters glass). Why: Brings scenes to life logically. 🍞 Anchor: 'Player jumping to spike the volleyball; the sound wave hits glass and it shatters.'

Secret Sauce 🍞 Hook: The best recipes have both the right ingredients and the right tasting test. 🥬 The Concept: Combine dense prompts, omni-QAs, no-leak judging, 'None' option, voting, and a new SpatialT2I dataset that rewrites prompts to match images for training.

  • Why it matters: You can both measure and improve spatial intelligence. 🍞 Anchor: After testing, take the model’s mistakes and turn them into targeted practice data.

04Experiments & Results

🍞 Top Bread (Hook): Imagine a league table where teams aren’t just judged on goals, but also passing accuracy, formation discipline, and smart plays.

🥬 Filling (The Test):

  • What it is: Evaluate 23 top text-to-image models—diffusion, autoregressive, unified, and closed-source—on the 10 spatial sub-domains.
  • How it works: For each of 1,230 prompts across 25 scenes, models generate an image; the judge answers 10 multi-choice questions; accuracy is averaged per sub-domain and overall.
  • Why it matters: This reveals exactly where models shine or stumble.

🍞 Bottom Bread (Anchor): If a model draws all objects correctly but fails the 'who blocks whom' question, we know it struggles with occlusion.

The Competition

  • Diffusion (e.g., SD-XL, FLUX.1, Qwen-Image), Autoregressive (OmniGen2, NextStep-1, Infinity), Unified (Janus-Pro, Bagel, UniWorld-V1), and closed-source (GPT-Image-1, DALL-E-3, Nano Banana, Seed Dream 4.0).

The Scoreboard (with context)

  • Top overall scores are in the low 60s: Seed Dream 4.0 ≈ 62.7%, Qwen-Image ≈ 60.6%, closed-source leaders near 60–63%. This is like barely passing when the test is very hard for everyone.
  • Foundation (objects, attributes) often >70% for top models; but reasoning tasks (comparison, occlusion) often drop below 30%—near random guessing (20%). That’s like getting an A on spelling but a D on story logic.
  • Text encoder strength matters: models with powerful LLM-based encoders outperform those using only standard CLIP encoders on dense prompts.
  • Unified architectures can be parameter-efficient: Bagel (7B) competes with larger diffusion models.

Surprising Findings

  • Motion interaction errors are lower than expected (often under 18%), while relational reasoning (comparison, occlusion) is the main blockade.
  • Open-source vs closed-source gap is shrinking; best open-source approaches are catching up.
  • Judge robustness: Rankings are consistent between Qwen2.5-VL-72B and GPT-4o; a human alignment study shows ~80–84% balanced accuracy alignment with humans.
  • Anti-leak and stability help: Not showing prompts to the judge, adding 'None,' and 5-vote rules reduce guessing; without an image, accuracy falls below random, confirming visual grounding.

Data-Centric Improvement (SpatialT2I)

  • Built 15,400 image-text pairs by rewriting prompts to match generated images while keeping information density. Fine-tuning shows consistent gains: +4.2% on Stable Diffusion-XL, +5.7% on UniWorld-V1, +4.4% on OmniGen2.
  • Ablations suggest higher-quality subsets yield bigger boosts; scaling data steadily improves performance.
  • Prompt rewriting (without retraining) also helps, especially on explicit relations like position and comparison; it helps less with implicit 3D reasoning like occlusion.

🍞 Bottom Bread (Anchor): Think of a student who keeps mixing up 'who’s taller' and 'who’s in front.' The new practice set drills exactly those weak spots until scores go up.

05Discussion & Limitations

🍞 Top Bread (Hook): Even a great map can’t show every pebble on the road, and you still need a good driver.

🥬 Filling (Limitations):

  • What it is: Two main limits—scaling data creation and coverage scope.
  • How it works: The semi-automated, human-in-the-loop process ensures quality but is labor-intensive. The 10 sub-domains are broad but cannot capture every spatial phenomenon (like fluid dynamics or complex deformations).
  • Why it matters: Some real-world spatial challenges remain outside this benchmark’s current reach.

Resources Required

  • To use the benchmark at scale, you need: access to T2I models, a capable MLLM judge (open-source or API), and moderate compute (the paper reports about 1.8 seconds per image on 8×H20 GPUs using vLLM).

When Not to Use

  • If you only care about style or single-object renders.
  • If your domain requires physics beyond everyday scenes (e.g., turbulent fluid flow, advanced mechanics), or purely abstract art unconstrained by spatial logic.

Open Questions

  • How do we best teach 3D and physical reasoning so models understand occlusion and causality more reliably?
  • Can unified models or novel text encoders close the gap further without massive scale?
  • What’s the right balance between prompt engineering, data-centric fine-tuning, and reinforcement learning from spatial feedback?
  • How should this extend to video, where time and motion must remain consistent across frames?

🍞 Bottom Bread (Anchor): We now have a strong gym for training spatial muscles, but the heaviest lifts—true 3D and physics understanding—still need new workout plans.

06Conclusion & Future Work

Three-sentence summary: SpatialGenEval is a comprehensive benchmark that tests text-to-image models on 10 spatial skills using long, information-dense prompts and precise, image-grounded multiple-choice questions. Experiments across 23 leading models show that while basic object drawing is strong, higher-order spatial reasoning—like comparison and occlusion—is the main bottleneck. A data-centric path, SpatialT2I, improves these skills via rewritten prompt-image pairs, yielding consistent gains with fine-tuning.

Main achievement: Turning spatial intelligence into a clear, testable checklist—then using that same structure to create training data that measurably lifts performance.

Future directions: Scale the benchmark and training data; push into text-to-video for spatio-temporal consistency; integrate stronger 3D and physics priors; explore unified models and reinforcement learning guided by spatial feedback; and expand to more nuanced interactions (e.g., deformable objects, fluids, multi-agent scenes).

Why remember this: It shifts the field’s focus from merely drawing the 'what' to understanding the 'where, how, and why'—a crucial step toward AI that can compose scenes like humans do, not just paint them.

Practical Applications

  • •Product staging: Generate catalog images that maintain correct sizes, positions, and occlusions among multiple products.
  • •Interior design mockups: Place furniture with correct orientation and layout to reflect realistic room plans.
  • •Instructional visuals: Create step-by-step assembly images where each part is shown in the right place and order.
  • •Education content: Illustrate physics and geometry scenes with proper motion and causal effects.
  • •Storyboarding and comics: Ensure characters’ positions, facing, and interactions remain consistent panel to panel.
  • •Safety posters: Show cause-and-effect clearly (e.g., why a hazard occurs) with accurate spatial cues.
  • •AR/VR prototyping: Generate scenes where objects’ relative sizes and occlusions match user expectations.
  • •Robotics simulation assets: Produce environment images that respect proximity and barrier relationships.
  • •Urban planning sketches: Visualize layouts (streets, signs, traffic flows) with correct spatial logic.
  • •QA for T2I systems: Use the benchmark to automatically spot and fix common spatial failure modes.
#text-to-image#spatial intelligence#occlusion#spatial reasoning#information-dense prompts#multimodal evaluation#benchmark#fine-tuning#LLM text encoder#multi-choice VQA#comparison and proximity#motion and causality#data-centric training#unified multimodal models#prompt rewriting
Version: 1