GEBench: Benchmarking Image Generation Models as GUI Environments

Haodong Li; Jingwei Wu; Quan Sun; Guopeng Li; Juanxi Tian; Huanyu Zhang; Yanlin Lai; Ruichuan An; Hongbo Peng; Yuhong Dai; Chenxi Li; Chunmei Qing; Jia Wang; Ziyang Meng; Zheng Ge; Xiangyu Zhang; Daxin Jiang

GEBench: Benchmarking Image Generation Models as GUI Environments

Intermediate

Haodong Li, Jingwei Wu, Quan Sun et al.2/9/2026

arXiv

Key Summary

•This paper introduces GEBench, a new test to check if image generation models can act like real app screens that change when you click or type.
•Instead of judging only how pretty a picture is, GEBench scores whether the model followed the instruction, kept the logic of the app, stayed consistent, looked like a real UI, and had good visual quality.
•The benchmark has 700 samples across five tasks: single-step changes, multi-step plans, make-believe (fiction) apps, rare real-app journeys, and precise point-click grounding.
•GE-Score is a five-part report card: Goal Achievement, Interaction Logic, Consistency, UI Plausibility, and Visual Quality.
•Models do well on simple, one-step changes but lose track over multi-step sequences and struggle a lot with clicking exact coordinates.
•Common failures include misreading icons, messy text rendering (especially Chinese), and putting changes in the wrong place on the screen.
•A VLM-as-a-Judge system (three strong vision-language models) scores results and agrees closely with human experts (correlation about 0.99).
•Top commercial models lead the board; open-source models lag, especially on longer plans and grounding.
•The big idea is to shift from 'pretty pictures' to 'working screens' so future AI agents can safely learn in realistic, low-cost GUI worlds.
•GEBench points to clear next steps: better grounding, smarter planning across steps, and stricter text/icon handling.

Why This Research Matters

Apps run our daily lives—paying bills, booking visits, sharing locations. To train helpful AI agents safely, we need realistic, low-cost practice worlds where clicking a button leads to the right next screen. GEBench checks if image generation models can provide those worlds by scoring not just how good screens look, but whether they behave correctly. This reduces the risk of agents learning bad habits from illogical or blurry screens. It also pushes model builders toward better grounding, clearer text, and more stable multi-step planning. In the long run, that means smarter, safer assistants that can truly help people get things done on their devices.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine you’re playing a tablet game where buttons switch screens. You tap “Settings,” and instantly you see sliders and toggles. It’s not a slow morph like a video; it’s a jump to a new screen that still makes sense.

🥬 The Situation (The World Before): For years, AI artists (image generation models) got great at drawing beautiful, detailed pictures from text. Video AIs learned to make smooth, continuous motions, like a cat walking. But apps aren’t like that. Apps are made of screens that change in jumps—click a button, and bam, a new screen appears. People started dreaming: what if we could train helpful AI agents inside fake app worlds made by image models? Agents could practice safely and cheaply, without needing every real app installed. The catch: we had no good way to check if these generated app screens behaved like real ones when you click or type.

🥬 The Problem: Existing tests focused on how nice images look or how smoothly videos move. They did not check the special logic of GUIs (Graphical User Interfaces): did the right screen appear after a tap? Did icons mean what they should? Did the text stay readable? Could the model place new elements exactly where a click happened? In short, we were grading paintings and movies, not checking if buttons and screens worked like real apps.

🥬 Failed Attempts: People tried using classic image scores (like FID or CLIP-score) and video benchmarks to judge these GUI changes. But these tools miss app logic. A picture could look lovely and still be a nonsense screen—wrong popup, broken menu, or changes appearing in the wrong place. Some tried full simulators tied to specific operating systems or apps, but those are expensive to build and not very flexible. They also don’t test whether image models can learn app behavior from pictures and instructions alone.

🥬 The Gap: What was missing was a GUI-specific benchmark that judged: (1) Did the right change happen? (2) Was the step realistic for an app? (3) Did unchanged parts stay stable? (4) Did the UI look like a real app? (5) Was the image clear and readable? Also, real usage needs different kinds of tasks: simple one-step edits, longer five-step plans, make-believe apps (to test imagination), rare real-app situations, and precise coordinate clicks. No existing test covered all of that.

🥬 Why It Matters (Real Stakes): If we want smart assistants that can do things on your phone or computer—like booking a doctor visit, paying a bill, or turning on a setting—they need safe practice worlds. Generative GUI environments are like flight simulators for app agents. But if those worlds are illogical—wrong popups, wobbly layouts, unreadable text—the agents will learn bad habits. That could waste money, cause mistakes (like paying the wrong bill), or create accessibility problems for users who rely on assistive tech.

🍞 Anchor: Think of learning to drive in a simulator. If the simulator shows pretty roads but the steering wheel sometimes controls the radio, you won’t learn real driving. We need tests that make sure the simulator acts like a real car. GEBench is that test—but for app screens.

—

To make the later ideas simple, let’s introduce two prerequisites using the Sandwich pattern.

🍞 Hook: You know how apps have screens with buttons, lists, and text labels? 🥬 The Concept: A Graphical User Interface (GUI) is the screen layout you click and type on—buttons, menus, tabs, and text.

How it works: 1) The app shows a screen. 2) You take an action (tap, type, scroll). 3) The app switches to the correct next screen or updates parts of the screen.
Why it matters: Without a GUI, you can’t easily tell the app what to do or see results. 🍞 Anchor: On your phone, tapping the Wi‑Fi icon opens the Wi‑Fi settings screen—that’s a GUI doing its job.

🍞 Hook: Imagine describing a picture to a super-artist who draws it from your words. 🥬 The Concept: An image generation model makes pictures from instructions (and sometimes from a reference image too).

How it works: 1) Read your text. 2) Understand key objects and layout hints. 3) Paint pixels to match the request.
Why it matters: It lets computers quickly create visual scenes, designs, or edits. 🍞 Anchor: Say “a blue button labeled ‘Search’ at the top.” The model draws a screen with that button.

02Core Idea

🍞 Hook: Picture a school science fair. If we only judge posters by how colorful they are, we might give top prize to something pretty but scientifically wrong. That wouldn’t be fair—or helpful.

🥬 The Aha! Moment (one sentence): Don’t just grade how nice a generated GUI looks—grade whether it behaves like a real app when a user acts.

🥬 Multiple Analogies:

Game Referee: A soccer ref doesn’t judge jersey colors; they judge legal moves and goals. GEBench is the ref for GUI behavior, not just style.
Comic Strip Logic: Each frame in a comic should follow from the last. GEBench checks that GUI frames change logically after actions.
Treasure Map: It’s not enough to have a beautiful map; the X must be where the treasure is. GEBench checks if taps lead to changes in the right spot (grounding).

🥬 Before vs After:

Before: Models got praise for pretty images but weren’t checked for correct app logic. A popup might appear, but not the right one; text might look stylish but be unreadable; icons might be misinterpreted.
After: With GEBench and GE-Score, we evaluate five must-haves: did you reach the goal, was the interaction logical, did the screen stay consistent, does the UI look real, and is it visually clear?

🥬 Why It Works (intuition without math): GUI interactions are discrete jumps caused by actions (clicks, types). A single score can hide important failures, so GEBench splits the grade into five parts—like checking all nuts and bolts on a bike, not just the paint. Using strong Vision-Language Models (VLMs) as judges scales up fair scoring and agrees with human experts a lot, so we can trust the process.

🥬 Building Blocks (explained with Sandwich pattern in the best learning order):

🍞 Hook: Imagine a playground where you can test if a slide is safe, a swing is sturdy, and the sandbox is clean. 🥬 The Concept: GEBench is a benchmark that tests image generation models as GUI environments—can they produce the correct next screens from actions?

How it works: 1) Give a current screen and an instruction (or a coordinate click). 2) Model generates the next screen or 5-step sequence. 3) VLM judges score five dimensions. 4) Scores are combined (GE-Score) and compared across tasks/models.
Why it matters: Without a GUI-specific test, we can’t tell if models behave like real apps. 🍞 Anchor: Start on a home screen, instruction says “Open Settings.” GEBench checks if the next screen really looks like Settings and makes sense.

🍞 Hook: Like a report card with subjects: math, reading, science, art, and PE. 🥬 The Concept: GE-Score is the combined five-part score that measures GUI behavior quality.

How it works: 1) Judge each dimension from 0–5. 2) Normalize to percentages. 3) Average across tasks and samples.
Why it matters: A single “pretty” score hides logic mistakes; GE-Score reveals them. 🍞 Anchor: A model could get high Visual Quality but low Interaction Logic, warning us it looks good but acts wrong.

Now the five GE-Score dimensions: 3) 🍞 Hook: You know when a teacher asks, “Did you answer the question?” 🥬 The Concept: Goal Achievement checks if the exact requested change or final goal happened.

How it works: Inspect the generated screen(s) and confirm the intended result is clearly there.
Why it matters: If the goal isn’t met, the rest doesn’t matter. 🍞 Anchor: Instruction says “Open Wi‑Fi details.” If we see the Wi‑Fi detail page, that’s good Goal Achievement.

🍞 Hook: Pressing an elevator button for Floor 3 should not teleport you to the roof. 🥬 The Concept: Interaction Logic checks that each change fits real app behavior.

How it works: Compare the action and the result: do they match normal UI patterns?
Why it matters: Without logic, agents learn fake rules. 🍞 Anchor: Tapping a tab should switch the tab’s content, not randomly open a settings dialog.

🍞 Hook: When you edit one paragraph, the rest of your document shouldn’t scramble. 🥬 The Concept: Consistency checks that unrelated parts stay stable across frames.

How it works: Look for drift in areas that should be unchanged.
Why it matters: Unnecessary changes confuse agents and users. 🍞 Anchor: Opening a small dropdown shouldn’t shift the whole header bar.

🍞 Hook: A movie set should look believable, not like cardboard scenery. 🥬 The Concept: UI Plausibility asks if UI elements are native-looking and structurally correct.

How it works: Check for proper layering, states, and platform conventions.
Why it matters: Fake-looking UI breaks trust and function. 🍞 Anchor: A modal should sit on top and dim the background; it shouldn’t hide behind the page.

🍞 Hook: Reading a sign only works if the letters are clear. 🥬 The Concept: Visual Quality checks text/icon clarity and artifacts.

How it works: Inspect sharpness and legibility; spot blurs or smears.
Why it matters: If you can’t read it, you can’t use it. 🍞 Anchor: A “Search” button that’s too blurry to read fails Visual Quality.

Key task types: 8) 🍞 Hook: Flip one page in a flipbook. 🥬 The Concept: Single-step transitions test one precise change from an instruction.

How it works: Given a screen and one action, generate the next screen.
Why it matters: It checks fine-grained instruction following. 🍞 Anchor: “Tap the gear icon”: next screen should be Settings.

🍞 Hook: Following a recipe over five steps. 🥬 The Concept: Multi-step trajectories test five-step plans with temporal coherence.

How it works: Start from a goal like “Order a coffee” and show five logical steps.
Why it matters: Real tasks take multiple steps; errors can snowball. 🍞 Anchor: From home → open app → choose drink → pick size → checkout.

🍞 Hook: Build a pretend app from a detailed description. 🥬 The Concept: Fiction-app tests zero-shot creativity while staying plausible.

How it works: No reference screen—only instructions.
Why it matters: Tests imagination plus structure. 🍞 Anchor: “A habit tracker with tabs for Today, Week, Month” should look coherent and app-like.

🍞 Hook: Taking the less-traveled path. 🥬 The Concept: Real-app (rare trajectories) checks long-tail sequences not seen often.

How it works: Follow unusual but valid flows.
Why it matters: Agents must handle edge cases. 🍞 Anchor: “Export app data as CSV, then share” is rarer than “Open home.”

🍞 Hook: Tap exactly here and watch the right thing happen there. 🥬 The Concept: Grounding point localization tests precise coordinate-based changes.

How it works: Provide a point (in [0, 1000] normalized) and expect the correct anchored change.
Why it matters: Without spatial precision, clicks don’t map to the right UI parts. 🍞 Anchor: Clicking [940, 40] should open the top-right menu—not a random popup in the center.

Bonus concept used by the benchmark: 13) 🍞 Hook: Three fair judges scoring a talent show. 🥬 The Concept: VLM-as-a-Judge uses strong vision-language models to grade results.

How it works: Multiple VLMs score each dimension; results align strongly with humans.
Why it matters: Scalable and reliable evaluation. 🍞 Anchor: Scores from GPT‑4o, Gemini‑3, and Qwen3‑VL correlate with humans at ~0.99—that’s tight agreement.

03Methodology

At a high level: Input (current GUI + instruction or point) → Model generates next screen(s) → VLM Judges score five dimensions → Compute GE-Score → Analyze across five task types.

Step-by-step (like a recipe):

Pick a Task Type

What happens: Choose one of five suites—Single-step, Multi-step, Fiction-app, Real-app, or Grounding.
Why it exists: GUIs require different skills (precision, planning, creativity, long-tail reasoning, spatial grounding). Testing all gives a full picture.
Example: Single-step: “Tap ‘Settings’.” Multi-step: “Create a recurring bill reminder.” Grounding: “Click [100, 80].”

Provide Inputs

What happens: The model receives a reference screen and an instruction (except Fiction-app, which has no reference), or a coordinate to click (Grounding).
Why it exists: Realistic GUI changes depend on both the current state and the user’s action.
Example: Start from a phone settings screen + “Enable Bluetooth.” Or start from desktop + “Open Zoom.”

Generate Next State(s)

What happens: The image model outputs either the next screen (Single-step, Grounding) or a five-frame sequence (Multi-step, Fiction-app, Real-app).
Why it exists: We need visual evidence of how the GUI would change.
What breaks without it: We couldn’t test logic or consistency—only text guesses.
Example: For “Open Calendar, add event,” we expect a 5-frame journey ending on an event details page.

Judge with VLMs (VLM-as-a-Judge)

What happens: Three strong vision-language models (e.g., GPT-4o, Gemini-3, Qwen3-VL) independently score each of the five dimensions: Goal, Logic, Consistency, UI, Quality.
Why it exists: Scales human-like judging; reduces single-judge bias; provides rich, rubric-aligned feedback.
What breaks without it: Manual judging is slow, inconsistent, and unscalable; simple image metrics miss logic and readability.
Example: A blurry “Search” label lowers Quality; an unrelated popup after a tap lowers Logic and Goal.

Convert to GE-Score

What happens: Each dimension gets 0–5, normalized to percentages, then averaged across samples/tasks to produce a holistic score.
Why it exists: A combined, interpretable score helps compare models while still keeping dimension-level detail for diagnosis.
Example: A model with 85 in Single-step but 45 in Multi-step clearly struggles with planning.

Analyze Results Across Task Types and Languages

What happens: Compare performance on Chinese vs English subsets, and across all five tasks.
Why it exists: Text rendering and icon semantics may differ by language; task types stress different skills.
Example: A model might read English text fine but struggle with dense Chinese characters.

Data Construction (how the benchmark was built):

Raw recording: Collect phone and desktop screen recordings of real interactions.
Annotation: Label actions (clicks, scrolls), write instructions and goals, and create JSON metadata.
Quality control: 1) Rule-based filtering (remove noisy samples). 2) Expert verification (ensure action/visual match). 3) Statistical calibration (balance distribution). Final: 700 curated samples across the five task types.

Concrete mini-examples per task:

Single-step: From App Store’s main page, instruction says “Open app details.” The next screen should be that app’s detail page.
Multi-step: “Book an appointment and add it to the calendar.” Five frames should logically go from opening app → choosing appointment → confirming → opening calendar → seeing the event added.
Fiction-app: “Design a study planner with tabs for Today/Week/Month and a + button to add tasks.” The screens must be plausible and consistent even though no real app is given.
Real-app: “Download an area for offline use.” The sequence should show correct steps for that rare flow.
Grounding: “Click [938, 61] to open the top-right menu,” and the next screen should reflect a popup anchored near that point.

The Secret Sauce (what makes this method clever):

Discrete-transition focus: It mirrors how real apps jump between states, not how natural videos flow.
Five-dimensional rubric: Splits look vs logic vs stability vs plausibility vs clarity—so problems are visible, not hidden.
VLM-as-a-Judge: Strong correlation (about 0.99) with human scores enables scalable, trustworthy evaluation.
Diverse tasks: From pinpoint clicks to long plans and make-believe apps, each task stresses a different muscle.

Extra Sandwich recaps for the core components:

🍞 Hook: You know how report cards show several subjects instead of one big grade? 🥬 The Concept: GE-Score’s five parts (Goal, Logic, Consistency, UI, Quality) grade different skills.

How it works: Score each part 0–5; normalize and average.
Why it matters: A model can’t hide poor logic behind pretty visuals. 🍞 Anchor: “A+ in Art” but “D in Science” tells a useful story—same here.

🍞 Hook: Tapping a tiny icon is like hitting a bullseye. 🥬 The Concept: Grounding point localization checks if changes appear right where you clicked.

How it works: Provide [0–1000] normalized coordinates; expect changes anchored there.
Why it matters: Agents must trust that clicks map to correct pixels. 🍞 Anchor: Click the dropdown arrow; the dropdown should appear hugging that arrow—not floating elsewhere.

04Experiments & Results

The Test: Models were asked to generate next screens or five-step sequences across five task types—Single-step, Multi-step, Fiction-app, Real-app, and Grounding. Each output was graded on five dimensions: Goal Achievement, Interaction Logic, Consistency, UI Plausibility, and Visual Quality. Scores were normalized and combined into GE-Score. To ensure fairness, three different VLMs (GPT‑4o, Gemini‑3, Qwen3‑VL) served as judges, and runs were repeated to reduce randomness.

The Competition: 12 models took part—8 commercial and 4 open-source. Commercial: Google Nano Banana Pro, Google Nano Banana, OpenAI GPT‑image‑1.5, OpenAI GPT‑image‑1.0, Seedream 4.5, Seedream 4.0, Wan 2.6, Flux‑2‑Pro. Open-source: Bagel, UniWorld‑V2, Qwen‑Image‑Edit, Longcat‑Image.

The Scoreboard (with context):

Overall leaders: Google’s Nano Banana Pro scored the highest on the Chinese subset (about 69.6 GE-Score). OpenAI’s GPT‑image‑1.5 led the English subset (about 63.2). Think of that as getting a strong B+/A- when many others are at C or below.
Single-step strength: Top models exceeded 80 in Single-step—like acing a quiz with one hard question. They can follow single instructions well.
Multi-step drop: Scores often plunged below 60, and sometimes much lower for weaker models—like going from an A on a small quiz to a D on the big test. This shows weak long-horizon planning and error accumulation across frames.
Grounding struggles: Even the best model achieved only around 23.9% Goal on Grounding. That’s like missing the dartboard most of the time. Models knew what to change but not exactly where to put it.
Open-source gap: Open-source models lagged markedly, especially on complex multi-step logic and precise grounding, suggesting substantial room for improvement.

Surprising or Noteworthy Findings:

VLM-as-a-Judge validity: Human experts and VLM judges agreed strongly—Pearson correlation r ≈ 0.989 overall (≈ 0.993 for Nano Banana Pro, ≈ 0.983 for GPT‑Image‑1). That’s like two graders giving nearly identical marks.
Visual vs Functional paradox: Some models produced sharp, pretty screens (high Quality) but with broken logic (low Goal/Logic). It’s like a gorgeous bridge that can’t hold cars. This proves we must check logic, not just looks.
Bottlenecks: The big three weaknesses were:

Text rendering: Especially for dense Chinese text—characters got overlapped or warped, making labels unreadable.
Icon interpretation: Models sometimes misread what icons mean, leading to wrong screens.
Localization precision: Popups and menus drifted away from the click point, breaking spatial trust.

Error snowballing: In Multi-step tasks, small drift or a minor misread early can grow into totally wrong final screens—a classic compounding-error problem.

Concrete mini-cases:

Single-step win: “Open Zoom” from a cluttered Windows desktop—top models correctly showed the Zoom client window with options like Join/Start.
Multi-step weakness: “Order a coffee” across five frames—models often lost track after step 2 or 3, misplacing panels or mixing up content.
Grounding failure: “Click [100, 80]” to open a nearby dropdown—many models popped menus far from the click, indicating poor spatial grounding.

Why these results matter: For training agents, multi-step and grounding are essential—agents must plan multiple actions and rely on precise clicks. Today’s best models can do the first step but get lost on the journey or miss the target spot, so agents trained in these worlds may learn brittle behaviors.

Sandwich mini-explanations tied to findings:

🍞 Hook: Like starting a race strong but slowing by lap 3. 🥬 The Concept: Multi-step trajectories demand stable planning across frames.

How it works: Keep goal and layout stable while making logical changes each step.
Why it matters: Real tasks are multi-step; mistakes compound. 🍞 Anchor: Booking an appointment requires 4–5 correct transitions, not just one.

🍞 Hook: Clicking a tiny map pin and expecting the right info bubble. 🥬 The Concept: Grounding point localization ensures changes appear where you tapped.

How it works: Map the abstract coordinate to the correct pixel region and anchor UI responses there.
Why it matters: Without it, agents can’t trust clicks. 🍞 Anchor: Tapping the top-right menu must open a menu at the top-right, not the center.

05Discussion & Limitations

Limitations:

Benchmark scope: 700 samples is carefully curated but not endless; future expansions could include more devices, apps, and accessibility features.
Image-only outputs: The test checks images, not actual runnable code or interaction backends; some real-world constraints aren’t captured.
Grounding shape: Tasks focus on point-based clicks; other gestures like drag, long-press, or multi-touch are not yet standard in this release.
Judge dependence: Although three VLM judges strongly align with humans, judges can still carry biases; ongoing calibration is necessary.
Domain focus: Primarily phone and desktop GUIs; web variations, TV interfaces, or smartwatch UIs could add new challenges.

Required Resources:

An image generation model capable of instruction following/editing.
Compute for running models (GPU/TPU or cloud APIs for commercial models).
Access to VLM judges (APIs or strong open-source VLMs) and storage for generated sequences.

When NOT to Use:

If you need executable, event-driven simulations with exact timing or physics—GEBench is about visual state changes, not code-level interactions.
If your task is natural-scene video continuity, not discrete GUI jumps—use a video benchmark instead.
If you only care about pure aesthetics without functionality—traditional image scores may suffice.

Open Questions:

How to fuse stronger spatial understanding so clicks map perfectly to pixels? (Architectures integrating explicit coordinate encodings or UI element detectors?)
How to stabilize multi-step planning to prevent error snowballing? (State memory, layout graphs, or intermediate symbolic plans?)
How to ensure robust text rendering across languages, especially dense scripts? (Specialized text-rendering modules or vector-text layers?)
Can we expand beyond point clicks to drags, long-presses, and keyboard shortcuts?
How to combine image-level generation with structured UI representations to improve plausibility and consistency at once?

06Conclusion & Future Work

Three-sentence summary: GEBench is a new benchmark that tests whether image generation models can act like real app environments by checking not only how screens look but also how they change after user actions. It introduces GE-Score, a five-part rubric (Goal, Logic, Consistency, UI Plausibility, Visual Quality) across five task types, including precise coordinate grounding and multi-step planning. Results show current models handle simple one-step changes well but stumble on longer plans and precise clicks, especially with text and icon handling.

Main Achievement: GEBench shifts evaluation from “pretty pictures” to “working screens,” providing a trusted, human-aligned, multi-dimensional testbed that reveals the true readiness of generative models as GUI simulators.

Future Directions: Improve spatial grounding with explicit coordinate-aware mechanisms; strengthen long-horizon planning with memory and symbolic layout reasoning; integrate robust text renderers; broaden tasks to include gestures and more device types; and combine visual generation with structured UI graphs.

Why Remember This: If we want safe, capable agents that can operate phones and computers for us, they need realistic practice arenas. GEBench tells us which image models already act like real apps and precisely where they fail, guiding the next wave of research toward GUI worlds that are not just beautiful—but usable and trustworthy.

Practical Applications

•Evaluate and compare image models for building GUI simulators before training agents.
•Diagnose model weaknesses (e.g., grounding vs planning vs text rendering) using dimension scores.
•Pre-train GUI agents in synthetic environments that behave more like real apps.
•Stress-test multi-step tasks (like booking or checkout flows) to reduce error accumulation.
•Benchmark multilingual UI clarity, especially dense scripts, to improve accessibility.
•Prototype zero-shot app ideas (fiction-app) and quickly check plausibility and consistency.
•Improve icon understanding by curating failure cases where semantics break.
•Guide model architecture changes (coordinate-aware layers, layout memory) with grounding and consistency scores.
•Track progress over time as new model versions fix identified bottlenecks.
•Select the best model-task fit (e.g., single-step editing vs multi-step planning) for a specific product need.

Version: 1