Generative Visual Code Mobile World Models

Woosung Koh; Sungjun Han; Segyu Lee; Se-Young Yun; Jamin Shin

Generative Visual Code Mobile World Models

Intermediate

Woosung Koh, Sungjun Han, Segyu Lee et al.2/2/2026

arXiv PDF

Key Summary

•This paper shows a new way to predict what a phone screen will look like after you tap or scroll: generate web code (like HTML/CSS/SVG) and then render it to pixels.
•By predicting code instead of raw images, the model writes perfectly readable text and keeps layouts neat, which image generators often mess up.
•The team built gWorld (8B and 32B), open-weight models that are fast, single models (no slow multi-step pipelines) and very accurate.
•They created a data factory that turns old app-using videos into training pairs and adds helpful reasoning steps that explain the change.
•Across six benchmarks (including two out-of-distribution), gWorld beats much larger open-weight models and has under 1% render failures.
•Average instruction-following accuracy reaches 74.9% (8B) and 79.6% (32B), setting a new accuracy-vs-size pareto frontier.
•Scaling data from 37K to 240K samples improves results predictably, following a power law, so more data should keep helping.
•Better world models help downstream phone-control agents choose better actions, giving strong accuracy boosts in policy tests.
•Rendering code is fast (about 0.3 seconds per screen) and inference can be parallelized, making rollouts practical and cheap.
•MWMBENCH, a new benchmark, tests visual next-state prediction using real coordinate actions and includes two OOD sets.

Why This Research Matters

Phones are how billions of people access the digital world, so better phone-controlling AIs can help with everyday tasks like messaging, shopping, and navigation. A world model that predicts crisp, correct next screens lets agents plan safely and efficiently before touching a real device. Because the output is code, rendering is fast and robust, making large-scale simulated practice and evaluation practical. This helps build assistants that work across languages and apps, including accessibility tools for people who need extra help navigating screens. Open weights and a new benchmark mean the broader community can improve, adapt, and deploy this approach in real systems.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine playing a mobile game where you have to predict what the screen will look like after you press a button. If you guess wrong, your next move won’t make sense. That’s what phone-controlling AIs face every step.

🥬 The Concept (Mobile GUI Agents): These are AIs that look at a phone screen, decide an action (like tap or scroll), and need to predict the next screen to plan well. How it works: 1) See a screen. 2) Pick an action. 3) Guess the next screen. Why it matters: Without good guesses, the AI wastes time or makes mistakes—like tapping the wrong button or typing in the wrong box.

🍞 Anchor: Think of a friend guiding you through an app over a video call—if they can’t imagine where a tap takes you, they can’t guide you to the goal.

The World Before: For a while, many mobile “world models” (the parts that predict the next screen) talked only in words. They’d describe the next state in text like “The Settings page opens and a toggle appears.” That’s helpful, but phone screens are about pixels: exact positions, sharp icons, colors, and especially crisp, correct text (like usernames and labels). Text-only world models don’t carry the full picture: they miss precise layout, icon styles, or font details. On the other side, visual models tried to generate the next screen as an image. But image generators struggle with GUIs’ most important discrete parts: readable text and strict layouts. They blur words, bend boxes, and often just copy the current screen with tiny edits, instead of truly changing to the next state.

🍞 Hook: You know how a comic book panel changes to the next panel to tell the story? If the next panel looked almost the same, the story wouldn’t move forward.

🥬 The Concept (Visual World Model): A world model predicts what the world (here, a phone screen) becomes after an action. How it works: 1) Take the current screen and the action. 2) Predict the next screen. Why it matters: Without an accurate next panel, the story (your plan) stalls.

🍞 Anchor: If you tap “Send,” the next screen should actually show your message sent—not the same screen again.

The Problem: One recent system (VIMO) proved that visual prediction helps policies a lot—but needed a complicated, slow pipeline: OCR to extract text, masking boxes, a big vision-language model to process regions, a custom diffusion image generator, and extra big-model calls to fill text. It also needed converting exact coordinate taps into text descriptions using a closed model. This brought heavy computation, complexity, and reproducibility issues.

Failed Attempts: 1) Text-only next states: fast and simple, but lose vital visual details. 2) Image-only next states: keep visuals but struggle with crisp text and correct layout changes, and often just “copy the input.” 3) Multi-stage pipelines: can work but are slow, closed, and hard to deploy.

🍞 Hook: You know how sheet music can be turned into beautiful sound by any instrument that can read it?

🥬 The Concept (Renderable Code): Instead of drawing the next screen pixel-by-pixel, write a clean recipe (HTML/CSS/SVG) that a browser can render into pixels. How it works: 1) Generate code that exactly describes boxes, text, icons, and colors. 2) Let a browser render it. Why it matters: Browsers are great at sharp text and structured layouts.

🍞 Anchor: If the AI writes “<button>Send</button>”, the browser draws a real, crisp Send button at the right spot.

The Gap: What was missing was a single, open model that predicts the next mobile screen as code—keeping accurate visuals and precise text—without the heavy pipeline.

The New Idea: gWorld predicts code that renders the next mobile screen. It uses a vision-language model’s knowledge of language and web code structure to write high-fidelity screens. Because the model outputs text code, it naturally writes perfect text labels, and because code has structure, layouts stay clean.

Real Stakes: Why should you care? - Accessibility: agents can help people navigate phones reliably. - Productivity: faster, more accurate automation for repetitive phone tasks. - Safety: agents can “practice” risky moves (like money transfers) in simulation without touching the real device. - Speed: rendering code is fast, and models can run many rollouts in parallel. - Research: open weights make it easier for everyone to build better agents.

🍞 Hook: Imagine you’re assembling Lego. With clear instructions, you can build the exact model every time.

🥬 The Concept (Instruction Accuracy, IAcc.): A score that checks if the predicted next screen matches what should happen when you apply the action. How it works: 1) Judges look at current screen, action, and predicted next screen. 2) They vote pass/fail. 3) Average the judges. Why it matters: Pretty pictures are useless if they don’t reflect the action’s effect.

🍞 Anchor: If the action is “tap the Settings icon,” a pass means the next screen actually looks like Settings—not just something similar.

02Core Idea

🍞 Hook: Think about cooking from a recipe instead of copying a photo of a dish. If you follow the recipe, you can remake it anytime, and every ingredient is clear.

🥬 The Concept (Aha! Moment): Predict the next phone screen as renderable web code (HTML/CSS/SVG) and let a browser draw the pixels—don’t paint pixels directly. How it works (big picture): 1) See the current screen image and the action (like a tap at coordinates). 2) First write a brief reasoning about what should change. 3) Generate clean code that encodes the new layout and text. 4) Render the code into pixels. Why it matters: Text becomes perfectly readable, layouts stay structured, and the model can generalize better because it writes plans (code), not just copies images.

🍞 Anchor: If you tap “Inbox,” the model writes code for an email list view and the browser renders a crisp list of emails.

Three analogies to cement the idea:

Blueprint vs. Photo: A blueprint (code) tells builders exactly where beams, walls, and doors go. A photo is just a snapshot. If you want to recreate a house faithfully and fix details, a blueprint wins.
Sheet Music vs. Recording: Sheet music (code) represents structure—notes, timing, dynamics—so any orchestra can perform it accurately. A recording (pixels) sounds great but is hard to edit or repurpose.
Lego Instructions vs. Sculpture: Lego instructions (code) give step-by-step parts and positions. A sculpture (pixel image) looks right but doesn’t tell you how to rebuild it.

Before vs. After:

Before: Text-only WMs missed visual specifics; image WMs made blurry text and often barely changed the screen, relying on expensive, multi-model pipelines.
After: One model (gWorld) writes the next screen as structured code, so text is crisp, layouts are consistent, and rendering is fast and reliable—with open weights and no complex chain.

Why it works (intuition, no equations):

Language priors: Vision-language models have learned tons of structured web code and natural language. That makes them good at writing UI code with sensible text content and layout rules.
Discrete structure: GUIs are made of boxes, buttons, and text. Code is naturally discrete and structured, so it matches GUIs better than fuzzy pixels.
Executable target: The output is runnable. If the code doesn’t render, it’s obviously wrong—so the system gravitates toward valid, consistent patterns.
Reason-then-code: Brief reasoning first helps the model plan changes (like “the menu opens” or “the switch toggles”), then code faithfully captures that plan.

Building blocks (the idea broken into parts):

Input state and action: The current screenshot (image) plus an action in coordinates or text.
Reasoning trace: A short text explanation of what should change, grounded in the true next screen during training (look-ahead), so the planning is consistent with reality.
Code generator: The VLM outputs HTML/CSS/SVG that encodes layout, text, icons, and colors for the next screen.
Renderer: A lightweight browser renders the code into pixels fast (~0.3s), giving a faithful visual state.
Evaluator: Metrics like Instruction Accuracy (action-consistency) and Similarity (appearance) check quality.

🍞 Hook: You know how a calculator gives exact answers because it follows rules, not guesses?

🥬 The Concept (Renderable Code Generation): Generating code that a browser can execute to draw the next screen. How it works: 1) Predict valid, complete HTML5 with styling. 2) Use SVG for icons/placeholders. 3) Render in a mobile viewport. Why it matters: It locks in structure and perfect text.

🍞 Anchor: Writing “<div class='title'>Your Cart</div>” ensures the exact title appears, sharply, every time.

The clever twist: Make the target “code + reasoning” during training. The reasoning is made with look-ahead to the actual next screen so the model learns the correct kinds of changes to make. Then, train it to produce both the explanation and the renderable code. This splits a hard problem into two easier steps: decide the change, then implement it.

🍞 Hook: Picture calling the next play in a sports game: first you plan, then you execute.

🥬 The Concept (World Model): A predictor of “if I do this, what happens next?” How it works: 1) Take state and action. 2) Predict next state. Why it matters: It lets agents plan several steps ahead without touching the real device.

🍞 Anchor: If you scroll down, the next screen should show lower content; if you tap Back, you should return to the previous page.

03Methodology

High-level pipeline: Input (current screenshot + action) → Reason about change → Generate web code for next state → Render to pixels → Output next screen.

Step 0: Set up the training data like a recipe

What happens: Start with big collections of phone usage (trajectories): sequences of screenshots and actions. Convert each (state S_t, action A_t, next-state S_{t+1}) into a training pair.
Why this step: We need examples of cause-and-effect (action leads to next screen) to teach the model.
Example: S_t shows an email inbox; A_t taps “Compose”; S_{t+1} shows a new message screen.

🍞 Hook: Like factory workers turning raw ingredients into meal kits. 🥬 The Concept (Data Generation Framework): A system that turns old app interactions into training examples with reasoning plus renderable code. How it works: 1) Repurpose trajectories into transitions. 2) Turn the next image into code. 3) Add a reasoning explanation with look-ahead. Why it matters: It creates high-quality, structured data at scale. 🍞 Anchor: From a video of someone using a calendar app, you get many (screen, tap) → (reasoning, code) pairs.

Step 1: Repurpose policy trajectories into transitions

What happens: From each episode of using an app, build pairs of (current screenshot S_t, action A_t) → (next screenshot S_{t+1}).
Why this step: It matches what a world model needs to learn: predict next from current + action.
Example: Tap “Settings” on the home screen → next image is the Settings page.

Step 2: Cross-modal re-labeling: turn next images into web code

What happens: Use a strong image-to-code model to translate the next-screen image S_{t+1} into HTML/CSS/SVG (call it S_code_{t+1}).
Why this step: gWorld outputs text, not pixels. We need code targets the model can learn to produce. Code guarantees crisp text and structured layout.
Example: The Settings page becomes a structured HTML with headers, toggle rows, and icon placeholders.

Step 3: Synthesize reasoning with free look-ahead

What happens: Since we know the real next screen during training, we generate a short reasoning trace R_t that describes the change consistent with that next screen.
Why this step: Reasoning helps the model plan changes before coding, and look-ahead ensures the plan matches reality.
Example: “Tapping Settings opens the Settings page with a header and a list of options; the top row shows ‘Network & Internet’.”

Training objective

What happens: Train gWorld to output both R_t (reasoning) and S_code_{t+1} (code) from (S_t image + A_t). Only the LLM and projector are tuned; the vision encoder is frozen. Training uses open-weight Qwen3-VL 8B/32B as bases.
Why this step: Jointly learning to plan and to code tightens the link between action understanding and visual state changes.
Example: Input = screenshot of home + tap Settings; Output = reasoning + HTML for Settings page.

Inference-time world modeling

What happens: At test time, given a new S_t and A_t, gWorld produces reasoning and next-state code, then the browser renders it (~0.3s after warm-up).
Why this step: Fast, accurate, executable next states enable planning and evaluation for many candidate actions.
Example: From an inbox, try three candidate taps (“Search,” “Compose,” “Menu”), render all predicted next screens, and pick the best move.

Evaluation recipe

Instruction Accuracy (IAcc.): A panel of three frontier judge models checks if the predicted next screen matches the intended effect of the action (binary pass/fail averaged over judges). A small rule-based filter marks un-renderable code as fail ahead of judging.
Similarity: Compute visual embedding similarity (DINO v1/v2) between predicted and ground-truth images to measure appearance closeness (but this alone can be fooled by copying).

🍞 Hook: Like a referee checking whether the ball actually crossed the goal line, not just if the photo looks nice. 🥬 The Concept (Instruction Accuracy): A pass/fail measure of whether the predicted next state correctly follows from the action. How it works: Judges compare current state, action, and predicted next. Why it matters: It scores real understanding of dynamics. 🍞 Anchor: Tap on “Wi‑Fi” passes only if the next screen shows Wi‑Fi settings as expected.

Secret sauce (what makes it clever):

Renderable code target: Forces structure and perfect text—two chronic weaknesses of image generators on GUIs.
Look-ahead reasoning: Training signal that aligns planning with the true next screen, splitting “decide change” and “write code.”
Coordinate actions preserved: Actions stay in their real form (taps, scrolls) instead of being translated to text, enabling real-world deployment compatibility.
Open, single model: No fragile, slow multi-stage pipelines; easy to reproduce, fast to run and parallelize.

Concrete walk-through example:

Input: Current screen shows a Music app “Now Playing.” Action: Tap the queue icon at coordinates (x, y).
Reasoning: “Tapping the queue icon opens a bottom sheet listing upcoming songs with a title, list items, and play order.”
Code: HTML with a modal/bottom sheet div, a title “Up Next,” and list rows for songs (using SVG for album placeholders), styled to match colors.
Output: Rendered next screen shows a crisp, structured queue—text readable, layout correct.

🍞 Hook: Think of drawing with shapes and labels, not smearing paint. 🥬 The Concept (Similarity metric): Measures how similar two images look overall. How it works: Compare features from vision encoders. Why it matters: It checks visual closeness but not action correctness. 🍞 Anchor: Two nearly identical screens can look similar even if the action should have opened a new panel—so Similarity alone isn’t enough.

04Experiments & Results

The Test: What did they measure, and why?

Instruction Accuracy (IAcc.): Primary metric that asks, “Does the predicted next screen make sense for this action?” It’s like getting a correct answer, not just a pretty drawing.
Render Fail Rate: If the code doesn’t render, you instantly lose that example. This checks structural soundness.
Similarity: Do the predicted and real next screens look alike overall? This adds visual context but doesn’t guarantee action correctness.

The Competition: Who/what was this compared against?

Image-edit/generation models: Qwen-Image-Edit 20B, Emu3.5 34B.
Vision-language models (VLMs): Llama 4 (109B, 402B), Qwen3 VL (8B, 32B, 235B), GLM-4.6V 106B.
There was no prior single-model, open-weight, visual world model specialized for mobile GUIs, so these are the strongest open baselines.

Datasets and splits

In-distribution (ID): AitW, GUIOdyssey, AndroidControl, AMEX.
Out-of-distribution (OOD): AndroidWorld (AW), KApps (KA; Korean-language apps). These stress-test zero-shot generalization to new domains and languages.

Scoreboard highlights (averages over six benchmarks):

gWorld 8B: 74.9% IAcc. with under 1% render failures; top-tier similarity.
gWorld 32B: 79.6% IAcc. with under 1% render failures; best or tied-best similarity.
The largest baselines (e.g., Llama 4 402B) did not match the new size–accuracy pareto frontier set by gWorld 8B/32B.
Image generators achieved decent Similarity but much lower IAcc. (often around teens to 20s on average), indicating they frequently copy the input rather than model action-driven change.

Context for the numbers

Think of 79.6% IAcc. as scoring a solid A when many alternatives are closer to C. And it does this with far fewer parameters than some competitors 13–50× larger.
Render fails under 1% mean the code almost always shows a real, viewable screen—no broken pages.

Surprising findings

Bigger isn’t automatically better: Some extremely large models didn’t reach gWorld’s accuracy–size frontier. Specialized training and the code target matter more than sheer size here.
Image generators cluster near “no change”: Analysis showed a strong correlation between how similar the ground-truth next screen is to the current one and the image model’s output. In short, they often copy the input (identity mapping) with tiny tweaks—good for Similarity, bad for real dynamics.
Data scaling follows a power law: Growing the dataset from 37K → 77K → 129K → 240K steadily improved performance with a predictable curve. Since there are up to 3.7M transitions available, there’s headroom for more gains.

Ablations (what parts mattered)

Cross-modal next-state labeling (image → code): Their method achieved 100% renderable code and higher judged accuracy than a naive alternative prompt—so carefully generated code targets matter.
Look-ahead reasoning: Training with look-ahead-grounded reasoning consistently outperformed reasoning without look-ahead across all tested benchmarks, showing that planning aligned with true outcomes improves learning.

Downstream policy impact (so what?)

Using gWorld inside a phone-control agent to evaluate candidate actions improves action selection. In tests with two different backbone policies and K=3 action candidates, gWorld-based rollouts delivered the biggest gains versus baselines (including a no-world-model value estimator), and better world modeling correlated with bigger policy improvements (about +0.49 policy points per +1 IAcc. point on average).

Efficiency

Rendering is fast (~0.3s per screen after a 1s browser warm-up) and parallelizable.
Inference throughput with vLLM: about 20,000 tokens/s (8B) and 5,000 tokens/s (32B) on H200s—practical for large-scale rollouts.

Bottom line: Across six benchmarks (including two OOD), gWorld 8B and 32B set a new pareto frontier for accuracy vs. size, drastically reduce structural errors, and translate world-modeling quality into real policy gains.

05Discussion & Limitations

Limitations (honest view)

Data scale not maxed: Training used about 260K samples, but up to 3.7M usable transitions exist. Since scaling shows predictable gains, more data should push results higher.
Photo-realism edge cases: Some screens contain complex natural imagery (e.g., camera previews, video thumbnails). Pure code can approximate layout and placeholders, but not photorealistic content. For UI logic and planning, this is often fine, but it’s still a visual limitation.
Single-frame Markov assumption: The current model predicts next state from just the current state and action. Some tasks need memory (like items added to cart across several pages). A memory-augmented approach could handle these longer dependencies.

Required resources

GPUs: Fine-tuning the 8B/32B bases and high-throughput inference works well on modern multi-GPU nodes (H200s in the paper), but smaller labs can still run the 8B model more modestly.
Browser renderer: A lightweight browser instance to render predicted code to pixels (tiny overhead per render and easy to parallelize).
Dataset preparation: Access to offline trajectories and a strong image-to-code frontier model (used during data synthesis) to bootstrap the code labels and reasoning.

When not to use

Heavily photo-centric states where precise pixel-level natural images matter more than UI structure (e.g., editing photos, video frames). A hybrid method (code+image) may be better.
Environments without a browser-like rendering model for the target UI (unusual widgets that can’t be approximated via web code).
Tasks demanding long-term hidden state (e.g., authentication flows with tokens) where single-step predictions miss crucial context.

Open questions

Hybrid outputs: Can we combine code for structure and a small learned image patch for photo regions to get the best of both worlds?
Memory and multi-step consistency: How to add working memory so multi-screen flows remain coherent over long horizons without drift?
Multi-lingual and domain shift: gWorld performed strongly on Korean OOD (KApps). How far does this generalize to more languages and niche apps?
Judge robustness: While multi-judge IAcc. reduces bias, can we design even fairer, cheaper, and fully open evaluation protocols?
Test-time planning: How to make policies reliably generate good candidate actions at higher K, unlocking even larger gains from world models?

06Conclusion & Future Work

Three-sentence summary: This paper introduces gWorld, a visual world model that predicts the next mobile screen as executable web code instead of raw pixels. By leveraging VLMs’ strengths in language and structured web code, gWorld achieves crisp text, accurate layouts, and strong action-conditioned predictions, all with a single open-weight model. It sets a new accuracy–size pareto frontier across six benchmarks and boosts downstream agent performance while being fast and practical to deploy.

Main achievement: Showing that renderable code is a powerful target for mobile GUI world modeling—delivering both visual fidelity and semantic correctness in one simple, open system.

Future directions: Scale training toward the 3.7M available transitions; add working memory for long flows; explore hybrid code+image for photo-heavy regions; further expand multilingual and domain generalization; and refine open, robust evaluation standards.

Why remember this: The switch from “paint the pixels” to “write the code” is a paradigm shift for GUI simulation—like moving from copying pictures to writing blueprints—unlocking accuracy, speed, and generalization that help real agents plan better on our everyday devices.

Practical Applications

•Train phone-control agents safely by simulating risky actions (e.g., financial transfers) without real execution.
•Speed up reinforcement learning with massive, parallel rollouts using fast code rendering instead of slow emulators.
•Improve test-time planning by rolling out multiple candidate actions and choosing the best based on predicted next screens.
•Automate routine mobile tasks (e.g., form filling, settings configuration) with higher reliability thanks to crisp text and structured layouts.
•Build accessibility assistants that can predict and describe upcoming screens clearly before actions are performed.
•Localize and test apps across languages by simulating UI flows on unseen regional apps (e.g., KApps).
•Prototype UI changes by generating code-based next states that render instantly for quick design iterations.
•Create synthetic datasets of mobile interactions by recursively rolling forward from existing screens.
•Evaluate agents with consistent, open metrics using MWMBENCH’s visual, coordinate-based tasks.
•Deploy lightweight world models on modest hardware (especially the 8B variant) for on-prem or edge scenarios.

Version: 1