Mobile-Agent-v3.5: Multi-platform Fundamental GUI Agents

Haiyang Xu; Xi Zhang; Haowei Liu; Junyang Wang; Zhaozai Zhu; Shengjie Zhou; Xuhao Hu; Feiyu Gao; Junjie Cao; Zihua Wang; Zhiyuan Chen; Jitong Liao; Qi Zheng; Jiahui Zeng; Ze Xu; Shuai Bai; Junyang Lin; Jingren Zhou; Ming Yan

Mobile-Agent-v3.5: Multi-platform Fundamental GUI Agents

Intermediate

Haiyang Xu, Xi Zhang, Haowei Liu et al.2/15/2026

arXiv

Key Summary

•This paper builds GUI-Owl-1.5, an AI that can use phones, computers, and web browsers like a careful human helper.
•It learns from a Hybrid Data Flywheel that mixes real apps, safe cloud sandboxes, and smart simulations to collect strong training data fast.
•A Unified Thought-Synthesis pipeline teaches the model to think step by step, remember key facts on screen, reflect on mistakes, and decide when to use tools.
•A new multi-platform RL method called MRPO makes training stable across phones, PCs, and the web, even for very long tasks.
•The model comes in sizes from 2B to 235B and in Instruct (fast, on-device) and Thinking (strong planning) versions for edge–cloud teamwork.
•It reaches state-of-the-art scores on 20+ benchmarks, including OSWorld (56.5), AndroidWorld (71.6), ScreenSpot-Pro (80.3 with zoom-in), and GUI Knowledge Bench (75.5).
•Virtual environments and step-wise thoughts each give big performance boosts, showing data quality and reasoning matter a lot.
•The system can click, type, scroll, and also call tools/APIs (MCP), remember information across steps, and collaborate in planner/executor/verifier roles.
•Special MRPO tricks (online rollout buffer, token-ID transport, alternating platform training) solve common stability problems in agent RL.
•The models are open-sourced with a cloud sandbox demo, making practical GUI agents easier to test, use, and extend.

Why This Research Matters

Reliable GUI agents can automate everyday digital chores—filling forms, gathering prices, organizing files—saving people and teams lots of time. Small Instruct models can run on-device to protect privacy while larger Thinking models in the cloud handle tough planning, enabling safe edge–cloud collaboration. Accessibility improves because voice-driven users can ask the agent to operate screens they can’t easily manipulate. IT support and RPA workflows get more robust as the agent adapts to small UI changes and uses tools when clicking is brittle. Education and training benefit from simulated practice environments that teach precise computer skills. Research gains a reusable recipe—data flywheel + thoughts + stable multi-platform RL—that can transfer to other agent domains. Overall, this moves AI from answering questions to competently doing multi-step digital work.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine teaching a friend to use your phone, your laptop, and a web browser to do chores—order food, fill forms, or look up prices. You’d want them to see the screen, understand buttons and text, remember what they found, and finish the job without getting lost.

🥬 The Concept (World Before): Before this work, AI systems could read text and look at pictures, but using real apps across different devices (phone + PC + web) was messy. Agents often worked only on one platform, forgot what happened two steps ago, or stumbled when a pop-up appeared. Collecting training data from real apps was slow, expensive, and brittle—CAPTCHAs or app updates could break everything.

Why it matters: Without a solid way to see, think, and act across devices, AI helpers can’t reliably automate normal computer or phone tasks we do daily.

🍞 Anchor: Picture an AI that finds flight prices on a website, copies them into a spreadsheet on your PC, and sends a text on your phone—with no human fixing its steps. That’s the dream.

— New Concept 1 — 🍞 Hook: You know how you tap icons and buttons on a tablet? That’s a graphical interface. 🥬 The Concept (GUI): A Graphical User Interface (GUI) is the screen of apps and websites—buttons, menus, text boxes—where users click, type, and scroll. How it works: 1) The app draws visible elements; 2) You interact (click/type); 3) The app changes based on your action. Why it matters: If an AI can’t understand GUIs, it can’t operate devices. 🍞 Anchor: When the AI sees a “Search” box and types “weather tomorrow,” it’s using the GUI correctly.

— New Concept 2 — 🍞 Hook: Think of a careful helper who follows your instructions on your device. 🥬 The Concept (GUI Agent): A GUI agent is an AI that looks at screenshots, understands the screen, plans what to do, and performs actions like click, type, scroll. How it works: 1) See the screen; 2) Read the user’s goal; 3) Choose an action; 4) Do it; 5) Check results; 6) Repeat. Why it matters: This turns AI into a hands-on assistant, not just a talker. 🍞 Anchor: “Open notes app, write grocery list, save it.” The agent does each step on-screen.

— New Concept 3 — 🍞 Hook: When you learn to ride a bike, you try, wobble, and get better from feedback. 🥬 The Concept (Reinforcement Learning, RL): RL teaches an agent by rewarding good outcomes and discouraging bad ones. How it works: 1) Try an action; 2) See result; 3) Get a score (reward); 4) Adjust strategy. Why it matters: Many-step GUI tasks need practice-based learning to be reliable. 🍞 Anchor: The agent earns reward for successfully saving a file in the right folder.

The Problem: Researchers faced three big roadblocks. 1) Data collection: Real app trajectories are costly and blocked by CAPTCHAs and breakages. 2) Multi-platform: One agent must work on phones, PCs, and browsers with different layouts and actions. 3) Agent skills: Beyond clicking, the agent must reason, remember, call tools/APIs, and collaborate with other agents.

Failed Attempts: Rule-based scripts break on small UI changes. Single-platform agents don’t generalize. Offline-only datasets miss real pop-ups and edge cases. RL across mixed devices is unstable—training diverges when outcomes collapse (all success or all fail) and when tokenizers disagree between inference and training.

The Gap: We needed (1) a faster, safer data flywheel combining simulations with cloud sandboxes; (2) a unified way to add step-by-step thoughts, memory, and reflections to trajectories; and (3) an RL recipe that stabilizes multi-platform learning.

Real Stakes: Reliable GUI agents can help with accessibility (navigating interfaces by voice), routine office work (spreadsheets, forms), online errands (shopping, bookings), and privacy (small models running on-device). Poor agents waste time and can mis-click something important. Good ones save hours and reduce frustration.

02Core Idea

🍞 Hook: Imagine a sports team that learns faster because practices (data), playbooks (thinking steps), and coaching (training) all reinforce each other every day.

🥬 The Concept (Aha! in one sentence): Combine a Hybrid Data Flywheel, a Unified Thought-Synthesis pipeline, and a stable multi-platform RL method (MRPO) so one agent can reliably use phones, PCs, and the web.

— New Concept 4 — 🍞 Hook: Like a snowball that grows as it rolls. 🥬 The Concept (Hybrid Data Flywheel): A system that mixes simulated worlds, cloud sandboxes, and a bit of human data to keep producing better UI grounding and task trajectories. How it works: 1) Synthesize hard grounding data and multi-window scenes; 2) Mine trajectories and tutorials; 3) Generate infeasible negatives; 4) Use virtual environments to mass-produce clean action sequences; 5) Feed them back to train the model, which then collects even better data. Why it matters: Without a steady data engine, agents plateau or overfit. 🍞 Anchor: Build a fake spreadsheet app to safely practice drag/scroll 10,000 times, then use those skills on real office software.

— New Concept 5 — 🍞 Hook: You know how good students write steps in the margin, remember key numbers, and fix their mistakes? 🥬 The Concept (Unified Thought-Synthesis / CoT): A pipeline that adds step-by-step thoughts, memory updates, reflections, and tool reasoning to each action in a trajectory. How it works: 1) Describe the screen; 2) Extract task-relevant info into memory; 3) Compare expected vs actual change; 4) Reflect/correct; 5) Write a concise conclusion. Why it matters: Without thoughts and memory, agents forget details and repeat errors. 🍞 Anchor: “Apple price = 255.78; store in memory; next open Sheets; if not found, search again.”

— New Concept 6 — 🍞 Hook: Like a coach training the same athlete to play soccer on grass, turf, and indoor courts. 🥬 The Concept (MRPO RL): A multi-platform reinforcement learning method that keeps training stable across phone/PC/web. How it works: 1) One device-conditioned policy; 2) Online rollout buffer oversamples then selects diverse on-policy groups; 3) Token-ID transport keeps inference/training log-probs aligned; 4) Alternating platform optimization reduces gradient fights. Why it matters: Without stability tricks, long-horizon multi-device RL collapses. 🍞 Anchor: The model practices tasks on Android this round, Windows next, Web after, staying calm and improving.

Multiple Analogies: (1) Orchestra: Data (musicians), Thoughts (sheet music), MRPO (conductor) produce a harmonious performance. (2) Map + Compass: Data gives a map of the UI world; thoughts are the compass; MRPO is the hiking plan to reach the goal efficiently. (3) Kitchen: Data are ingredients, thoughts are the recipe, MRPO is the cooking technique that avoids burning dinner.

Before vs After: Before—agents were brittle, single-platform, and forgetful. After—GUI-Owl-1.5 handles diverse devices, remembers key facts, uses tools/APIs, and scores SOTA on 20+ benchmarks (e.g., AndroidWorld 71.6, ScreenSpot-Pro 80.3 with zoom-in).

Why It Works (intuition): The flywheel gives broad, hard, and realistic practice; thoughts inject reasoning and memory; MRPO keeps the learning signal strong and fair across devices, avoiding collapse. Together, they reduce hallucinations, handle long tasks, and adapt to different UIs.

Building Blocks: (1) Grounding data synthesis and mining; (2) DAG-based trajectory generation in real and virtual environments; (3) GUI knowledge + world modeling supervision; (4) Unified CoT synthesis; (5) Multi-agent collaboration roles (planner, executor, verifier, notetaker); (6) MRPO for stable RL scaling.

03Methodology

High-Level Recipe: Input (screenshot + instruction) → Context manager (recent turns + compressed history) → Perception + Thought (screen description, memory, reflection) → Action conclusion + Tool/GUI action → Environment updates screenshot → Loop until goal.

Step-by-Step

Inputs and Outputs

What happens: The agent receives a screenshot (visual observation) and the user’s instruction, then produces (a) an action conclusion in language and (b) a structured tool call (click/type/scroll or an API/MCP call).
Why it exists: Clear outputs let the environment execute exact actions and keep logs.
Example: “Click the search icon” with coordinates, or “filesystem_read_text_file(path=...)”.

— New Concept 7 — 🍞 Hook: When you tell a long story, you keep the recent parts detailed and older parts summarized. 🥬 The Concept (Sliding Window Context + Compression): Keep N recent turns fully (images + text), but summarize older steps into short textual notes. How it works: 1) Keep last N full turns; 2) Concatenate earlier conclusions into a summary; 3) Feed both into the model. Why it matters: Without this, memory runs out or the model forgets long-term progress. 🍞 Anchor: “We already found Apple’s price earlier—don’t redo it; now fill the spreadsheet.”

Grounding Data Construction (Figure 4)

Hard Grounding Synthesis: Generate tricky multi-window, high-resolution interfaces and challenging app screenshots with checks to avoid occlusion or bad labels.
Extension at Scale: (a) Mine grounding pairs from trajectories with critic filtering; (b) Extract QA from tutorials/subtitles; (c) Create infeasible query negatives via random pairing and model consensus.
Why: Grounding (locating the right UI element) is the foundation for correct actions.
Example: “Tap the gray gear icon in the top-right settings panel (not the upload arrow).”

— New Concept 8 — 🍞 Hook: Planning a trip by stringing together small hops. 🥬 The Concept (DAG-based Task Synthesis): Build tasks as paths through a Directed Acyclic Graph where each node is a subtask and edges are valid transitions. How it works: 1) Annotators define subtasks; 2) Sample paths; 3) Compose sub-instructions; 4) Validate checkpoints; 5) Truncate/repair partially-correct runs. Why it matters: Without structure, tasks drift or hallucinate. 🍞 Anchor: Path: Open app → Search term → Open result → Copy info → Paste in notes.

Trajectory Collection (Figure 5)

Real Devices + Checkpoints: Run agents on phones/PCs/browsers; verify each subtask with predicates; keep clean prefixes; request human demos for tough cases.
Virtual Environments: Web-rendered apps for precise, scalable practice of scroll/drag and common office scenarios; can execute scripted/RPA policies for perfect references.
Why: Real devices give realism; virtual worlds give scale and exact feedback.
Example: Practice dragging cells in a virtual spreadsheet 1,000 times to learn precise control.

— New Concept 9 — 🍞 Hook: A safe driving simulator makes you a better driver before the real road. 🥬 The Concept (Virtual Environments): Browser-rendered apps that give exact subtask feedback and auto-generate clean trajectories. How it works: 1) Render UI; 2) Decompose instructions into atomic ops; 3) Execute and record; 4) Validate with exact predicates. Why it matters: Without sims, data is slow, noisy, and blocked by CAPTCHAs. 🍞 Anchor: A virtual word processor where “select line → bold → save” is validated perfectly.

Agent Capability Enhancement (Figure 6)

GUI Knowledge Injection: Crawl docs/forums, rewrite into QA/VQA; train world modeling to predict next-screen changes after actions.
Unified CoT Synthesis: Add per-step observation, memory updates, reflection, and concise conclusions; include tool schemas for tool reasoning.
Multi-Agent Collaboration: Train roles—Manager (planner), Worker (executor), Reflector (verifier), Notetaker (memory)—in a closed loop.

— New Concept 10 — 🍞 Hook: Guessing what the screen will show after clicking “Save” helps you decide wisely. 🥬 The Concept (World Modeling): Teach the model to predict how the screen changes after an action. How it works: 1) Given screenshot + action, generate a description of the next state; 2) Use it as supervision to build internal dynamics. Why it matters: Without anticipating effects, the agent clicks blindly. 🍞 Anchor: “After clicking ‘Download,’ expect a new file in the Downloads folder and a toast message.”

— New Concept 11 — 🍞 Hook: Sometimes you need a calculator or a file reader—don’t poke the screen; just use the tool. 🥬 The Concept (Tool/MCP Calling): The agent can invoke external tools or standardized MCP APIs alongside GUI actions. How it works: 1) Decide if a tool is better than GUI; 2) Call tool with parameters; 3) Blend tool outputs back into the plan. Why it matters: Many tasks are faster/safer via tools than manual clicks. 🍞 Anchor: Use filesystem_read_text_file to read code instead of opening an editor by hand.

— New Concept 12 — 🍞 Hook: A team works better when each teammate has a role. 🥬 The Concept (Multi-Agent Collaboration): Planner, Executor, Verifier, and Notetaker coordinate to finish tasks. How it works: 1) Planner sets subgoals; 2) Executor acts; 3) Verifier checks transitions; 4) Notetaker stores key info. Why it matters: Separating roles reduces errors in long tasks. 🍞 Anchor: Planner says “search price,” Executor searches, Verifier confirms, Notetaker stores “255.78.”

Training Paradigm

Pre-training: Mix UI recognition, trajectories, QA/VQA, world modeling, and tool/MCP data.
SFT: Align with multi-device trajectories (with CoT), augmented grounding, and browser specifics.
RL with MRPO: Stabilize cross-device, long-horizon control.

— New Concept 13 — 🍞 Hook: One uniform playbook with notes for phone, PC, and web. 🥬 The Concept (Device-Conditioned Policy): The single policy conditions on the device type to choose valid actions. How it works: 1) Input includes device ID; 2) Policy outputs device-legal actions. Why it matters: Without conditioning, actions mix up (e.g., phone taps on desktop). 🍞 Anchor: On mobile, “tap at (x,y)”; on web, “click element CSS selector.”

— New Concept 14 — 🍞 Hook: If everyone on your practice team scores the same, you learn nothing new. 🥬 The Concept (Online Rollout Buffer under Outcome Collapse): Oversample on-policy rollouts, then select a diverse subset so groups aren’t all success or all fail. How it works: 1) Sample kn runs; 2) Subsample n; 3) If collapsed, swap one with opposite outcome; 4) Stay on-policy. Why it matters: Without diversity, GRPO updates are uninformative. 🍞 Anchor: For a task, keep one success and one failure in the batch so the gradient makes sense.

— New Concept 15 — 🍞 Hook: If two people use different dictionaries, they might misread the same word. 🥬 The Concept (Token-ID Transport): Send back the exact token IDs used in inference to compute log-probs during training identically. How it works: 1) Environment returns text + token IDs; 2) Trainer uses those IDs for log-probs. Why it matters: Without this, KL and gradients become inconsistent. 🍞 Anchor: The agent typed “Search”; training scores the same exact token sequence.

— New Concept 16 — 🍞 Hook: Don’t study math and history in the same minute if they clash—alternate blocks. 🥬 The Concept (Alternating Multi-Platform Optimization): Train phones, PCs, and web in cycles to reduce gradient interference. How it works: 1) Focus on one platform per stage; 2) Rotate cyclically; 3) Keep a shared backbone. Why it matters: Mixed batches cause tug-of-war gradients. 🍞 Anchor: 10 steps mobile, then 10 steps PC, then 10 steps web—stable progress.

04Experiments & Results

The Tests: The team evaluated four big ability areas. (1) End-to-end automation in live environments (AndroidWorld, OSWorld-Verified, WebArena/VisualWebArena/WebVoyager/Online-Mind2Web), so the only thing that matters is whether the final goal is met. (2) Grounding (ScreenSpot-Pro, OSWorld-G/Refine, ScreenSpot-V2, MMBench-GUI-L2), to check if the agent points to the right UI element. (3) Knowledge and Memory (GUI Knowledge Bench and MemGUI-Bench), to see if it understands widgets/actions and remembers facts across steps. (4) Tool/MCP use (OSWorld-MCP, MobileWorld), blending GUI actions with tool calls.

The Competition: They compared against proprietary systems (e.g., Claude, GPT, Gemini) and open-source GUI agents (UI-TARS, OpenCUA, MAI-UI, EvoCUA, WebStar, DynaWeb, etc.).

The Scoreboard (with context):

OSWorld-Verified (PC use): 56.5 (like beating strong varsity teams; prior open agents were often in the 30–50 range).
AndroidWorld (mobile use): 71.6 (near the top; UI-TARS-2 at 73.3; many general models trail behind).
VisualWebArena (web visual tasks): up to 46.6 for 32B-Thinking and 40.8 for 8B-Thinking, well above most open baselines.
ScreenSpot-Pro (hard grounding): 80.3 with zoom-in (two-stage refine), a clear A+, surpassing even Gemini-3-Pro when using the refine tool; base 72.9 without crop is still SOTA among GUI agents.
GUI Knowledge Bench: 75.5 (top score, even above proprietary leaders like o3 at 73.3), showing strong understanding of widget functions and action parameters.
OSWorld-MCP (tool + GUI): 47.6 and MobileWorld 46.8 (competitive to state-of-the-art), proving the agent can decide when to use tools vs clicks.
Memory (MemGUI-Bench Easy): 27.1 (32B) and 22.9 (8B), beating prior native agents (e.g., 18.8).

Surprising Findings:

Small-but-smart: Even the 2B model beats much larger baselines on some tasks, showing great parameter efficiency. That’s like a very skilled kid outplaying older teens due to better practice.
No-crop strength: On ScreenSpot-Pro, the base (no crop) 72.9 already outperforms many big systems; adding crop zooms to 80.3, showing the value of two-stage refinement.
Thinking helps long horizons: Thinking variants consistently outperform Instruct on deeply multi-step tasks (e.g., WebVoyager jumps from 69.9 to 82.1), confirming the CoT/memory pipeline matters.
Ablations prove the recipe: Removing virtual-environment trajectories drops PC-Eval from 75.4% to 42.0% and Mobile-Eval from 86.7% to 50.0%. Removing Unified CoT drops OSWorld from 52.9% to 47.4% and AndroidWorld from 71.6% to 65.0%. These are big, clear signals.
Stable RL beats mixing: Alternating platform training avoids performance oscillations that appear when mixing phone/PC/web in one batch; the online rollout buffer sharply reduces useless all-success/all-fail groups, making updates informative.

Bottom Line: The trio—data flywheel, thought synthesis, MRPO—turns a multi-platform GUI agent from a fragile demo into a reliable assistant that scores at or near the top across 20+ benchmarks.

05Discussion & Limitations

Limitations:

Data hunger and drift: Although the Hybrid Data Flywheel reduces cost and speeds up collection, agents still crave lots of diverse, up-to-date data. Real apps change often; pop-ups, CAPTCHAs, and redesigned menus can reduce accuracy until data refreshes.
Tool/API fragility: MCP/tool schemas may change or fail. If the tool environment differs from training, calls can break.
Latency vs depth: Thinking models reason better but can be slower. Very long tasks may need time or parallelization.
Edge constraints: On-device (Instruct) models are small and fast, but sometimes weaker for complex planning without cloud help.
Safety and privacy: GUI agents can click sensitive things. Guardrails, permission prompts, and audit logs are essential in real deployments.

Required Resources:

Multi-platform sandboxes and some real devices for collection/evaluation.
GPUs/TPUs for pretraining, SFT, and large-scale RL.
A data pipeline for simulations, grounding augmentation, and tutorial mining.
Engineering for MRPO (rollout buffer service, token-ID transport, alternating schedules) and for tool/MCP integration.

When NOT to Use:

High-stakes, time-critical operations (e.g., medical device software) without strict verification and human oversight.
Environments dominated by aggressive anti-bot CAPTCHAs or rapidly changing UI skins that block stable operation.
Air-gapped or no-UI systems where text APIs or CLI integrations are safer and more direct.

Open Questions:

Robustness to UI change: How to adapt on the fly without full retraining (e.g., meta-learning, online adaptation)?
Stronger verification: Can we add formal checks or programmatic guards to prevent risky clicks?
Faster long-horizon planning: Can we compress thoughts, cache subroutines, or use hierarchical controllers to cut latency?
Tool-choice optimality: How to learn a cost-aware policy that balances GUI actions vs tool calls vs web automation?
Memory longevity: How to retain the right screen facts across very long sessions without bloating context?

06Conclusion & Future Work

Three-Sentence Summary: GUI-Owl-1.5 is a native multi-platform GUI agent trained by a Hybrid Data Flywheel, a Unified Thought-Synthesis pipeline for reasoning/memory, and a stable multi-platform RL method (MRPO). It clicks, types, scrolls, and calls tools/APIs across phones, PCs, and the web, reaching state-of-the-art scores on 20+ benchmarks. The models range from tiny on-device Instruct to large Thinking variants for cloud collaboration, balancing speed, privacy, and planning power.

Main Achievement: Showing that unifying data, thought, and RL into one tightly integrated system produces a reliable, broadly capable GUI agent that generalizes across devices and long tasks.

Future Directions: Adaptive grounding under UI changes, faster hierarchical planning, richer tool ecosystems (including local privacy-preserving tools), stronger safety verification, and broader simulations to cover edge cases (multi-window, multi-monitor, accessibility UIs). Exploring continual learning and better edge–cloud coordination can further reduce latency while keeping accuracy high.

Why Remember This: It demonstrates a practical recipe—data flywheel + thought synthesis + stable multi-platform RL—that turns GUI agents from fragile prototypes into dependable assistants. This pattern likely generalizes beyond GUIs to other embodied agents where seeing, thinking, and doing must work together across diverse environments.

Practical Applications

•Personal assistant: Search the web for flight prices, copy results into a spreadsheet, and email a summary.
•Enterprise RPA: Automate repetitive desktop workflows (report exports, file renames, uploads) across Windows and web dashboards.
•Customer support: Navigate ticketing GUIs, check knowledge bases, and update records with tool calls for database lookups.
•Accessibility: Voice-to-action control where users describe goals and the agent operates apps for them.
•QA testing: Stress-test new UI versions in virtual environments to catch regressions before release.
•Data entry: Extract values from websites/apps and reliably paste them into forms or sheets, remembering key fields.
•Developer ops: Use MCP tools to read/edit files, run scripts, and verify logs while blending GUI steps.
•Education: Provide a training ‘simulator’ that teaches students precise computer skills (drag/scroll/edit) safely.
•Multi-device coordination: Start a task on phone (collect data), finish on PC (format in sheets), and verify on web.
•IT onboarding: Guided, automated setup of software across devices using verification and memory for each step.

Version: 1