GEBench: Benchmarking Image Generation Models as GUI Environments
Key Summary
- ā¢This paper introduces GEBench, a new test to check if image generation models can act like real app screens that change when you click or type.
- ā¢Instead of judging only how pretty a picture is, GEBench scores whether the model followed the instruction, kept the logic of the app, stayed consistent, looked like a real UI, and had good visual quality.
- ā¢The benchmark has 700 samples across five tasks: single-step changes, multi-step plans, make-believe (fiction) apps, rare real-app journeys, and precise point-click grounding.
- ā¢GE-Score is a five-part report card: Goal Achievement, Interaction Logic, Consistency, UI Plausibility, and Visual Quality.
- ā¢Models do well on simple, one-step changes but lose track over multi-step sequences and struggle a lot with clicking exact coordinates.
- ā¢Common failures include misreading icons, messy text rendering (especially Chinese), and putting changes in the wrong place on the screen.
- ā¢A VLM-as-a-Judge system (three strong vision-language models) scores results and agrees closely with human experts (correlation about 0.99).
- ā¢Top commercial models lead the board; open-source models lag, especially on longer plans and grounding.
- ā¢The big idea is to shift from 'pretty pictures' to 'working screens' so future AI agents can safely learn in realistic, low-cost GUI worlds.
- ā¢GEBench points to clear next steps: better grounding, smarter planning across steps, and stricter text/icon handling.
Why This Research Matters
Apps run our daily livesāpaying bills, booking visits, sharing locations. To train helpful AI agents safely, we need realistic, low-cost practice worlds where clicking a button leads to the right next screen. GEBench checks if image generation models can provide those worlds by scoring not just how good screens look, but whether they behave correctly. This reduces the risk of agents learning bad habits from illogical or blurry screens. It also pushes model builders toward better grounding, clearer text, and more stable multi-step planning. In the long run, that means smarter, safer assistants that can truly help people get things done on their devices.
Detailed Explanation
Tap terms for definitions01Background & Problem Definition
š Hook: Imagine youāre playing a tablet game where buttons switch screens. You tap āSettings,ā and instantly you see sliders and toggles. Itās not a slow morph like a video; itās a jump to a new screen that still makes sense.
š„¬ The Situation (The World Before): For years, AI artists (image generation models) got great at drawing beautiful, detailed pictures from text. Video AIs learned to make smooth, continuous motions, like a cat walking. But apps arenāt like that. Apps are made of screens that change in jumpsāclick a button, and bam, a new screen appears. People started dreaming: what if we could train helpful AI agents inside fake app worlds made by image models? Agents could practice safely and cheaply, without needing every real app installed. The catch: we had no good way to check if these generated app screens behaved like real ones when you click or type.
š„¬ The Problem: Existing tests focused on how nice images look or how smoothly videos move. They did not check the special logic of GUIs (Graphical User Interfaces): did the right screen appear after a tap? Did icons mean what they should? Did the text stay readable? Could the model place new elements exactly where a click happened? In short, we were grading paintings and movies, not checking if buttons and screens worked like real apps.
š„¬ Failed Attempts: People tried using classic image scores (like FID or CLIP-score) and video benchmarks to judge these GUI changes. But these tools miss app logic. A picture could look lovely and still be a nonsense screenāwrong popup, broken menu, or changes appearing in the wrong place. Some tried full simulators tied to specific operating systems or apps, but those are expensive to build and not very flexible. They also donāt test whether image models can learn app behavior from pictures and instructions alone.
š„¬ The Gap: What was missing was a GUI-specific benchmark that judged: (1) Did the right change happen? (2) Was the step realistic for an app? (3) Did unchanged parts stay stable? (4) Did the UI look like a real app? (5) Was the image clear and readable? Also, real usage needs different kinds of tasks: simple one-step edits, longer five-step plans, make-believe apps (to test imagination), rare real-app situations, and precise coordinate clicks. No existing test covered all of that.
š„¬ Why It Matters (Real Stakes): If we want smart assistants that can do things on your phone or computerālike booking a doctor visit, paying a bill, or turning on a settingāthey need safe practice worlds. Generative GUI environments are like flight simulators for app agents. But if those worlds are illogicalāwrong popups, wobbly layouts, unreadable textāthe agents will learn bad habits. That could waste money, cause mistakes (like paying the wrong bill), or create accessibility problems for users who rely on assistive tech.
š Anchor: Think of learning to drive in a simulator. If the simulator shows pretty roads but the steering wheel sometimes controls the radio, you wonāt learn real driving. We need tests that make sure the simulator acts like a real car. GEBench is that testābut for app screens.
ā
To make the later ideas simple, letās introduce two prerequisites using the Sandwich pattern.
š Hook: You know how apps have screens with buttons, lists, and text labels? š„¬ The Concept: A Graphical User Interface (GUI) is the screen layout you click and type onābuttons, menus, tabs, and text.
- How it works: 1) The app shows a screen. 2) You take an action (tap, type, scroll). 3) The app switches to the correct next screen or updates parts of the screen.
- Why it matters: Without a GUI, you canāt easily tell the app what to do or see results. š Anchor: On your phone, tapping the WiāFi icon opens the WiāFi settings screenāthatās a GUI doing its job.
š Hook: Imagine describing a picture to a super-artist who draws it from your words. š„¬ The Concept: An image generation model makes pictures from instructions (and sometimes from a reference image too).
- How it works: 1) Read your text. 2) Understand key objects and layout hints. 3) Paint pixels to match the request.
- Why it matters: It lets computers quickly create visual scenes, designs, or edits. š Anchor: Say āa blue button labeled āSearchā at the top.ā The model draws a screen with that button.
02Core Idea
š Hook: Picture a school science fair. If we only judge posters by how colorful they are, we might give top prize to something pretty but scientifically wrong. That wouldnāt be fairāor helpful.
š„¬ The Aha! Moment (one sentence): Donāt just grade how nice a generated GUI looksāgrade whether it behaves like a real app when a user acts.
š„¬ Multiple Analogies:
- Game Referee: A soccer ref doesnāt judge jersey colors; they judge legal moves and goals. GEBench is the ref for GUI behavior, not just style.
- Comic Strip Logic: Each frame in a comic should follow from the last. GEBench checks that GUI frames change logically after actions.
- Treasure Map: Itās not enough to have a beautiful map; the X must be where the treasure is. GEBench checks if taps lead to changes in the right spot (grounding).
š„¬ Before vs After:
- Before: Models got praise for pretty images but werenāt checked for correct app logic. A popup might appear, but not the right one; text might look stylish but be unreadable; icons might be misinterpreted.
- After: With GEBench and GE-Score, we evaluate five must-haves: did you reach the goal, was the interaction logical, did the screen stay consistent, does the UI look real, and is it visually clear?
š„¬ Why It Works (intuition without math): GUI interactions are discrete jumps caused by actions (clicks, types). A single score can hide important failures, so GEBench splits the grade into five partsālike checking all nuts and bolts on a bike, not just the paint. Using strong Vision-Language Models (VLMs) as judges scales up fair scoring and agrees with human experts a lot, so we can trust the process.
š„¬ Building Blocks (explained with Sandwich pattern in the best learning order):
- š Hook: Imagine a playground where you can test if a slide is safe, a swing is sturdy, and the sandbox is clean. š„¬ The Concept: GEBench is a benchmark that tests image generation models as GUI environmentsācan they produce the correct next screens from actions?
- How it works: 1) Give a current screen and an instruction (or a coordinate click). 2) Model generates the next screen or 5-step sequence. 3) VLM judges score five dimensions. 4) Scores are combined (GE-Score) and compared across tasks/models.
- Why it matters: Without a GUI-specific test, we canāt tell if models behave like real apps. š Anchor: Start on a home screen, instruction says āOpen Settings.ā GEBench checks if the next screen really looks like Settings and makes sense.
- š Hook: Like a report card with subjects: math, reading, science, art, and PE. š„¬ The Concept: GE-Score is the combined five-part score that measures GUI behavior quality.
- How it works: 1) Judge each dimension from 0ā5. 2) Normalize to percentages. 3) Average across tasks and samples.
- Why it matters: A single āprettyā score hides logic mistakes; GE-Score reveals them. š Anchor: A model could get high Visual Quality but low Interaction Logic, warning us it looks good but acts wrong.
Now the five GE-Score dimensions: 3) š Hook: You know when a teacher asks, āDid you answer the question?ā š„¬ The Concept: Goal Achievement checks if the exact requested change or final goal happened.
- How it works: Inspect the generated screen(s) and confirm the intended result is clearly there.
- Why it matters: If the goal isnāt met, the rest doesnāt matter. š Anchor: Instruction says āOpen WiāFi details.ā If we see the WiāFi detail page, thatās good Goal Achievement.
- š Hook: Pressing an elevator button for Floor 3 should not teleport you to the roof. š„¬ The Concept: Interaction Logic checks that each change fits real app behavior.
- How it works: Compare the action and the result: do they match normal UI patterns?
- Why it matters: Without logic, agents learn fake rules. š Anchor: Tapping a tab should switch the tabās content, not randomly open a settings dialog.
- š Hook: When you edit one paragraph, the rest of your document shouldnāt scramble. š„¬ The Concept: Consistency checks that unrelated parts stay stable across frames.
- How it works: Look for drift in areas that should be unchanged.
- Why it matters: Unnecessary changes confuse agents and users. š Anchor: Opening a small dropdown shouldnāt shift the whole header bar.
- š Hook: A movie set should look believable, not like cardboard scenery. š„¬ The Concept: UI Plausibility asks if UI elements are native-looking and structurally correct.
- How it works: Check for proper layering, states, and platform conventions.
- Why it matters: Fake-looking UI breaks trust and function. š Anchor: A modal should sit on top and dim the background; it shouldnāt hide behind the page.
- š Hook: Reading a sign only works if the letters are clear. š„¬ The Concept: Visual Quality checks text/icon clarity and artifacts.
- How it works: Inspect sharpness and legibility; spot blurs or smears.
- Why it matters: If you canāt read it, you canāt use it. š Anchor: A āSearchā button thatās too blurry to read fails Visual Quality.
Key task types: 8) š Hook: Flip one page in a flipbook. š„¬ The Concept: Single-step transitions test one precise change from an instruction.
- How it works: Given a screen and one action, generate the next screen.
- Why it matters: It checks fine-grained instruction following. š Anchor: āTap the gear iconā: next screen should be Settings.
- š Hook: Following a recipe over five steps. š„¬ The Concept: Multi-step trajectories test five-step plans with temporal coherence.
- How it works: Start from a goal like āOrder a coffeeā and show five logical steps.
- Why it matters: Real tasks take multiple steps; errors can snowball. š Anchor: From home ā open app ā choose drink ā pick size ā checkout.
- š Hook: Build a pretend app from a detailed description. š„¬ The Concept: Fiction-app tests zero-shot creativity while staying plausible.
- How it works: No reference screenāonly instructions.
- Why it matters: Tests imagination plus structure. š Anchor: āA habit tracker with tabs for Today, Week, Monthā should look coherent and app-like.
- š Hook: Taking the less-traveled path. š„¬ The Concept: Real-app (rare trajectories) checks long-tail sequences not seen often.
- How it works: Follow unusual but valid flows.
- Why it matters: Agents must handle edge cases. š Anchor: āExport app data as CSV, then shareā is rarer than āOpen home.ā
- š Hook: Tap exactly here and watch the right thing happen there. š„¬ The Concept: Grounding point localization tests precise coordinate-based changes.
- How it works: Provide a point (in [0, 1000] normalized) and expect the correct anchored change.
- Why it matters: Without spatial precision, clicks donāt map to the right UI parts. š Anchor: Clicking [940, 40] should open the top-right menuānot a random popup in the center.
Bonus concept used by the benchmark: 13) š Hook: Three fair judges scoring a talent show. š„¬ The Concept: VLM-as-a-Judge uses strong vision-language models to grade results.
- How it works: Multiple VLMs score each dimension; results align strongly with humans.
- Why it matters: Scalable and reliable evaluation. š Anchor: Scores from GPTā4o, Geminiā3, and Qwen3āVL correlate with humans at ~0.99āthatās tight agreement.
03Methodology
At a high level: Input (current GUI + instruction or point) ā Model generates next screen(s) ā VLM Judges score five dimensions ā Compute GE-Score ā Analyze across five task types.
Step-by-step (like a recipe):
- Pick a Task Type
- What happens: Choose one of five suitesāSingle-step, Multi-step, Fiction-app, Real-app, or Grounding.
- Why it exists: GUIs require different skills (precision, planning, creativity, long-tail reasoning, spatial grounding). Testing all gives a full picture.
- Example: Single-step: āTap āSettingsā.ā Multi-step: āCreate a recurring bill reminder.ā Grounding: āClick [100, 80].ā
- Provide Inputs
- What happens: The model receives a reference screen and an instruction (except Fiction-app, which has no reference), or a coordinate to click (Grounding).
- Why it exists: Realistic GUI changes depend on both the current state and the userās action.
- Example: Start from a phone settings screen + āEnable Bluetooth.ā Or start from desktop + āOpen Zoom.ā
- Generate Next State(s)
- What happens: The image model outputs either the next screen (Single-step, Grounding) or a five-frame sequence (Multi-step, Fiction-app, Real-app).
- Why it exists: We need visual evidence of how the GUI would change.
- What breaks without it: We couldnāt test logic or consistencyāonly text guesses.
- Example: For āOpen Calendar, add event,ā we expect a 5-frame journey ending on an event details page.
- Judge with VLMs (VLM-as-a-Judge)
- What happens: Three strong vision-language models (e.g., GPT-4o, Gemini-3, Qwen3-VL) independently score each of the five dimensions: Goal, Logic, Consistency, UI, Quality.
- Why it exists: Scales human-like judging; reduces single-judge bias; provides rich, rubric-aligned feedback.
- What breaks without it: Manual judging is slow, inconsistent, and unscalable; simple image metrics miss logic and readability.
- Example: A blurry āSearchā label lowers Quality; an unrelated popup after a tap lowers Logic and Goal.
- Convert to GE-Score
- What happens: Each dimension gets 0ā5, normalized to percentages, then averaged across samples/tasks to produce a holistic score.
- Why it exists: A combined, interpretable score helps compare models while still keeping dimension-level detail for diagnosis.
- Example: A model with 85 in Single-step but 45 in Multi-step clearly struggles with planning.
- Analyze Results Across Task Types and Languages
- What happens: Compare performance on Chinese vs English subsets, and across all five tasks.
- Why it exists: Text rendering and icon semantics may differ by language; task types stress different skills.
- Example: A model might read English text fine but struggle with dense Chinese characters.
Data Construction (how the benchmark was built):
- Raw recording: Collect phone and desktop screen recordings of real interactions.
- Annotation: Label actions (clicks, scrolls), write instructions and goals, and create JSON metadata.
- Quality control: 1) Rule-based filtering (remove noisy samples). 2) Expert verification (ensure action/visual match). 3) Statistical calibration (balance distribution). Final: 700 curated samples across the five task types.
Concrete mini-examples per task:
- Single-step: From App Storeās main page, instruction says āOpen app details.ā The next screen should be that appās detail page.
- Multi-step: āBook an appointment and add it to the calendar.ā Five frames should logically go from opening app ā choosing appointment ā confirming ā opening calendar ā seeing the event added.
- Fiction-app: āDesign a study planner with tabs for Today/Week/Month and a + button to add tasks.ā The screens must be plausible and consistent even though no real app is given.
- Real-app: āDownload an area for offline use.ā The sequence should show correct steps for that rare flow.
- Grounding: āClick [938, 61] to open the top-right menu,ā and the next screen should reflect a popup anchored near that point.
The Secret Sauce (what makes this method clever):
- Discrete-transition focus: It mirrors how real apps jump between states, not how natural videos flow.
- Five-dimensional rubric: Splits look vs logic vs stability vs plausibility vs clarityāso problems are visible, not hidden.
- VLM-as-a-Judge: Strong correlation (about 0.99) with human scores enables scalable, trustworthy evaluation.
- Diverse tasks: From pinpoint clicks to long plans and make-believe apps, each task stresses a different muscle.
Extra Sandwich recaps for the core components:
š Hook: You know how report cards show several subjects instead of one big grade? š„¬ The Concept: GE-Scoreās five parts (Goal, Logic, Consistency, UI, Quality) grade different skills.
- How it works: Score each part 0ā5; normalize and average.
- Why it matters: A model canāt hide poor logic behind pretty visuals. š Anchor: āA+ in Artā but āD in Scienceā tells a useful storyāsame here.
š Hook: Tapping a tiny icon is like hitting a bullseye. š„¬ The Concept: Grounding point localization checks if changes appear right where you clicked.
- How it works: Provide [0ā1000] normalized coordinates; expect changes anchored there.
- Why it matters: Agents must trust that clicks map to correct pixels. š Anchor: Click the dropdown arrow; the dropdown should appear hugging that arrowānot floating elsewhere.
04Experiments & Results
The Test: Models were asked to generate next screens or five-step sequences across five task typesāSingle-step, Multi-step, Fiction-app, Real-app, and Grounding. Each output was graded on five dimensions: Goal Achievement, Interaction Logic, Consistency, UI Plausibility, and Visual Quality. Scores were normalized and combined into GE-Score. To ensure fairness, three different VLMs (GPTā4o, Geminiā3, Qwen3āVL) served as judges, and runs were repeated to reduce randomness.
The Competition: 12 models took partā8 commercial and 4 open-source. Commercial: Google Nano Banana Pro, Google Nano Banana, OpenAI GPTāimageā1.5, OpenAI GPTāimageā1.0, Seedream 4.5, Seedream 4.0, Wan 2.6, Fluxā2āPro. Open-source: Bagel, UniWorldāV2, QwenāImageāEdit, LongcatāImage.
The Scoreboard (with context):
- Overall leaders: Googleās Nano Banana Pro scored the highest on the Chinese subset (about 69.6 GE-Score). OpenAIās GPTāimageā1.5 led the English subset (about 63.2). Think of that as getting a strong B+/A- when many others are at C or below.
- Single-step strength: Top models exceeded 80 in Single-stepālike acing a quiz with one hard question. They can follow single instructions well.
- Multi-step drop: Scores often plunged below 60, and sometimes much lower for weaker modelsālike going from an A on a small quiz to a D on the big test. This shows weak long-horizon planning and error accumulation across frames.
- Grounding struggles: Even the best model achieved only around 23.9% Goal on Grounding. Thatās like missing the dartboard most of the time. Models knew what to change but not exactly where to put it.
- Open-source gap: Open-source models lagged markedly, especially on complex multi-step logic and precise grounding, suggesting substantial room for improvement.
Surprising or Noteworthy Findings:
- VLM-as-a-Judge validity: Human experts and VLM judges agreed stronglyāPearson correlation r ā 0.989 overall (ā 0.993 for Nano Banana Pro, ā 0.983 for GPTāImageā1). Thatās like two graders giving nearly identical marks.
- Visual vs Functional paradox: Some models produced sharp, pretty screens (high Quality) but with broken logic (low Goal/Logic). Itās like a gorgeous bridge that canāt hold cars. This proves we must check logic, not just looks.
- Bottlenecks: The big three weaknesses were:
- Text rendering: Especially for dense Chinese textācharacters got overlapped or warped, making labels unreadable.
- Icon interpretation: Models sometimes misread what icons mean, leading to wrong screens.
- Localization precision: Popups and menus drifted away from the click point, breaking spatial trust.
- Error snowballing: In Multi-step tasks, small drift or a minor misread early can grow into totally wrong final screensāa classic compounding-error problem.
Concrete mini-cases:
- Single-step win: āOpen Zoomā from a cluttered Windows desktopātop models correctly showed the Zoom client window with options like Join/Start.
- Multi-step weakness: āOrder a coffeeā across five framesāmodels often lost track after step 2 or 3, misplacing panels or mixing up content.
- Grounding failure: āClick [100, 80]ā to open a nearby dropdownāmany models popped menus far from the click, indicating poor spatial grounding.
Why these results matter: For training agents, multi-step and grounding are essentialāagents must plan multiple actions and rely on precise clicks. Todayās best models can do the first step but get lost on the journey or miss the target spot, so agents trained in these worlds may learn brittle behaviors.
Sandwich mini-explanations tied to findings:
š Hook: Like starting a race strong but slowing by lap 3. š„¬ The Concept: Multi-step trajectories demand stable planning across frames.
- How it works: Keep goal and layout stable while making logical changes each step.
- Why it matters: Real tasks are multi-step; mistakes compound. š Anchor: Booking an appointment requires 4ā5 correct transitions, not just one.
š Hook: Clicking a tiny map pin and expecting the right info bubble. š„¬ The Concept: Grounding point localization ensures changes appear where you tapped.
- How it works: Map the abstract coordinate to the correct pixel region and anchor UI responses there.
- Why it matters: Without it, agents canāt trust clicks. š Anchor: Tapping the top-right menu must open a menu at the top-right, not the center.
05Discussion & Limitations
Limitations:
- Benchmark scope: 700 samples is carefully curated but not endless; future expansions could include more devices, apps, and accessibility features.
- Image-only outputs: The test checks images, not actual runnable code or interaction backends; some real-world constraints arenāt captured.
- Grounding shape: Tasks focus on point-based clicks; other gestures like drag, long-press, or multi-touch are not yet standard in this release.
- Judge dependence: Although three VLM judges strongly align with humans, judges can still carry biases; ongoing calibration is necessary.
- Domain focus: Primarily phone and desktop GUIs; web variations, TV interfaces, or smartwatch UIs could add new challenges.
Required Resources:
- An image generation model capable of instruction following/editing.
- Compute for running models (GPU/TPU or cloud APIs for commercial models).
- Access to VLM judges (APIs or strong open-source VLMs) and storage for generated sequences.
When NOT to Use:
- If you need executable, event-driven simulations with exact timing or physicsāGEBench is about visual state changes, not code-level interactions.
- If your task is natural-scene video continuity, not discrete GUI jumpsāuse a video benchmark instead.
- If you only care about pure aesthetics without functionalityātraditional image scores may suffice.
Open Questions:
- How to fuse stronger spatial understanding so clicks map perfectly to pixels? (Architectures integrating explicit coordinate encodings or UI element detectors?)
- How to stabilize multi-step planning to prevent error snowballing? (State memory, layout graphs, or intermediate symbolic plans?)
- How to ensure robust text rendering across languages, especially dense scripts? (Specialized text-rendering modules or vector-text layers?)
- Can we expand beyond point clicks to drags, long-presses, and keyboard shortcuts?
- How to combine image-level generation with structured UI representations to improve plausibility and consistency at once?
06Conclusion & Future Work
Three-sentence summary: GEBench is a new benchmark that tests whether image generation models can act like real app environments by checking not only how screens look but also how they change after user actions. It introduces GE-Score, a five-part rubric (Goal, Logic, Consistency, UI Plausibility, Visual Quality) across five task types, including precise coordinate grounding and multi-step planning. Results show current models handle simple one-step changes well but stumble on longer plans and precise clicks, especially with text and icon handling.
Main Achievement: GEBench shifts evaluation from āpretty picturesā to āworking screens,ā providing a trusted, human-aligned, multi-dimensional testbed that reveals the true readiness of generative models as GUI simulators.
Future Directions: Improve spatial grounding with explicit coordinate-aware mechanisms; strengthen long-horizon planning with memory and symbolic layout reasoning; integrate robust text renderers; broaden tasks to include gestures and more device types; and combine visual generation with structured UI graphs.
Why Remember This: If we want safe, capable agents that can operate phones and computers for us, they need realistic practice arenas. GEBench tells us which image models already act like real apps and precisely where they fail, guiding the next wave of research toward GUI worlds that are not just beautifulābut usable and trustworthy.
Practical Applications
- ā¢Evaluate and compare image models for building GUI simulators before training agents.
- ā¢Diagnose model weaknesses (e.g., grounding vs planning vs text rendering) using dimension scores.
- ā¢Pre-train GUI agents in synthetic environments that behave more like real apps.
- ā¢Stress-test multi-step tasks (like booking or checkout flows) to reduce error accumulation.
- ā¢Benchmark multilingual UI clarity, especially dense scripts, to improve accessibility.
- ā¢Prototype zero-shot app ideas (fiction-app) and quickly check plausibility and consistency.
- ā¢Improve icon understanding by curating failure cases where semantics break.
- ā¢Guide model architecture changes (coordinate-aware layers, layout memory) with grounding and consistency scores.
- ā¢Track progress over time as new model versions fix identified bottlenecks.
- ā¢Select the best model-task fit (e.g., single-step editing vs multi-step planning) for a specific product need.