MAI-UI Technical Report: Real-World Centric Foundation GUI Agents

Hanzhang Zhou; Xu Zhang; Panrong Tong; Jianan Zhang; Liangyu Chen; Quyu Kong; Chenglin Cai; Chen Liu; Yue Wang; Jingren Zhou; Steven Hoi

MAI-UI Technical Report: Real-World Centric Foundation GUI Agents

Intermediate

Hanzhang Zhou, Xu Zhang, Panrong Tong et al.12/26/2025

arXiv PDF

Key Summary

•MAI-UI is a family of AI agents that can see, understand, and control phone and computer screens using plain language.
•It fixes four big real-world problems: unclear user instructions, brittle long UI click-chains, no device–cloud teamwork, and changing, messy app screens.
•The team created a self-evolving data pipeline so the agent keeps learning from new tasks, user chats, and tool (MCP) calls over time.
•It adds two superpowers beyond tapping screens: asking users clarifying questions and using MCP tools (APIs) to shortcut long click sequences.
•A native device–cloud collaboration system runs most steps on your phone for privacy and speed, and only calls a big cloud model when needed.
•An online reinforcement learning setup trains the agent inside hundreds of live app environments at once, making it tougher and more reliable.
•MAI-UI sets new records: 73.5% on ScreenSpot-Pro grounding, 76.7% on AndroidWorld navigation, and 41.7% on MobileWorld realistic tasks.
•Scaling up RL from 32 to 512 parallel environments adds +5.2 points, and longer interaction budgets add up to +4.3 points.
•Device–cloud collaboration boosts on-device success by 33%, cuts cloud calls by over 40%, and keeps sensitive data private.
•Models come in sizes from 2B (on-device) to 235B-A22B (cloud), all beating strong baselines at similar scales.

Why This Research Matters

This work makes everyday technology feel more like a helpful teammate and less like a maze of buttons. By asking you clarifying questions, it avoids costly mistakes and saves time. Using MCP tools, it shortcuts long, fragile tap sequences into one or two reliable steps. The device–cloud teamwork keeps private moments on your phone, reduces costs, and still delivers cloud-level power when it truly helps. Training in live, changing app worlds makes it robust to popups, updates, and surprises that frustrate typical agents. Altogether, tasks like travel planning, file management, and work automation become faster, safer, and more dependable.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

You know how using a new phone app can feel like walking through a maze, and sometimes you wish you could just say what you want and have the phone do it? That’s the dream behind GUI agents: helpers that can look at screens, understand what’s there, and tap, type, and swipe to get things done for you.

The World Before: For a while, AI could chat well, but controlling real apps was hard. Agents often memorized fixed click-paths and broke when a popup changed or a button moved. They also assumed your instructions were perfect, which isn’t true in real life. And they usually lived either fully on your device (small and limited) or fully in the cloud (powerful but slow, costly, and risky for privacy). In short, the promise of “tell your phone what to do” kept bumping into messy reality.

The Problem: Researchers faced four roadblocks. First, unclear instructions—people say “Send this to Mike” but forget Mike’s email. Second, long UI-only action chains—10 clicks to do what an API could do in 1 step, so errors snowballed. Third, no practical device–cloud teamwork—either too weak (on-device only) or too risky/expensive (cloud only). Fourth, brittleness—agents trained on static data struggled when apps updated layouts or threw surprise dialogs.

🍞 Top Bread (Hook): Imagine you’re baking cookies and instead of measuring and mixing by hand every time, you could just ask a helper to fetch pre-made dough from the fridge when that’s faster and cleaner. 🥬 Filling (The Actual Concept): MCP tool calls are the agent’s way to use ready-made services (APIs) instead of clicking through screens.

What it is: MCP tool calls let the agent perform tasks through structured APIs (like maps or GitHub) instead of many on-screen taps.
How it works:
1. The agent sees your instruction and the screen.
2. It decides if tapping around is best or if a tool (API) is faster.
3. If a tool helps, it sends the tool name and arguments.
4. It reads the tool’s structured result and continues the task.
Why it matters: Without MCP, agents need long, fragile click chains; one wrong tap can ruin everything. 🍞 Bottom Bread (Anchor): Instead of opening Maps, typing two addresses, and comparing routes, the agent calls a routing API twice and instantly picks the shorter drive.

🍞 Top Bread (Hook): Think of a school project folder that grows better every week because you keep adding corrected drafts, teacher notes, and new references. 🥬 Filling (The Actual Concept): A self-evolving data pipeline is a loop that keeps collecting, checking, and improving training data so the model steadily gets smarter.

What it is: A training pipeline that updates itself using new rollouts, human-reviewed fixes, and rejection-sampled examples.
How it works:
1. Start with seed tasks from manuals, experts, and public data.
2. Generate trajectories via humans and agents; keep good parts of partial successes.
3. Filter with automated and human checks.
4. Fine-tune the model, roll it out again, keep only better trajectories (rejection sampling), and repeat.
Why it matters: Without it, models stagnate on old data and fail when apps change. 🍞 Bottom Bread (Anchor): The agent learns from many tries to send a file—keeping the correct first 6 steps from a failed 10-step attempt—and improves its next run.

🍞 Top Bread (Hook): You use a pocket calculator for quick math but ask a computer for big jobs; both work together depending on what you need. 🥬 Filling (The Actual Concept): Device–cloud collaboration lets a small on-device model handle safe, easy steps and a bigger cloud model jump in only when needed.

What it is: A native system where phone and cloud models share a memory and hand off tasks based on progress and privacy.
How it works:
1. The local agent acts and monitors progress.
2. If stuck and no sensitive data is on screen, it sends a concise error summary and history to the cloud.
3. The cloud recovers the plan; the shared memory keeps both in sync.
Why it matters: Without it, you choose between weak but private vs. strong but costly and risky. 🍞 Bottom Bread (Anchor): Your phone handles most email sorting; only when a tricky filter breaks, it briefly asks the cloud to fix the rule, then continues locally.

🍞 Top Bread (Hook): When you learned to ride a bike, you fell, adjusted, and got better because you practiced for real. 🥬 Filling (The Actual Concept): Online reinforcement learning (RL) trains the agent by letting it try tasks in live app environments and learn from success/failure.

What it is: A feedback-driven way to improve policies by interacting with many stateful app emulators.
How it works:
1. Roll out tasks across hundreds of Android emulators.
2. Score whole trajectories (success/loops).
3. Update the model; repeat with a smart curriculum.
Why it matters: Without online RL, agents overfit to static data and break on popups and updates. 🍞 Bottom Bread (Anchor): The agent keeps failing login because of a new permission dialog; after RL, it learns to dismiss the popup and log in smoothly.

🍞 Top Bread (Hook): Using a map works best when you can spot the exact street you need. 🥬 Filling (The Actual Concept): GUI grounding is the skill of pointing to the exact UI element that matches a natural-language instruction.

What it is: Given a screenshot and text, predict the target coordinates.
How it works:
1. Understand the layout, meaning, and relations of on-screen elements.
2. Reason from different perspectives (appearance, function, location, intent).
3. Output a precise point; optionally zoom in to refine.
Why it matters: Without good grounding, the agent clicks the wrong button and derails the task. 🍞 Bottom Bread (Anchor): When told “make text bold,” the agent finds the bold icon among many toolbar buttons and taps it.

🍞 Top Bread (Hook): Navigating a mall means reading signs, choosing turns, and sometimes asking for help. 🥬 Filling (The Actual Concept): Mobile navigation is multi-step planning and acting to complete goals across apps.

What it is: A sequence of taps, types, swipes, tool calls, and sometimes clarifying questions.
How it works:
1. Read the instruction and screen(s).
2. Plan the next action.
3. Execute, observe, and repeat until done.
Why it matters: Without solid navigation, the agent stalls or loops. 🍞 Bottom Bread (Anchor): “Find last month’s resume in Downloads and email it with subject ‘candidates_cv’”—the agent opens Files, filters by date, picks the file, opens email, fills fields, attaches, and sends.

Failed Attempts and the Gap: Earlier systems skipped clarifying questions, avoided external tools, and had no clean device–cloud handoff or robust online RL. MAI-UI fills this gap with a unified approach: ask the user when needed, call MCP tools to shorten fragile click chains, train in dynamic environments with online RL at massive scale, and route work smartly between phone and cloud with a shared memory.

Real Stakes: This matters for everyone who uses phones and computers. It means faster, safer help with chores like travel planning, work tasks like checking GitHub commits, and private steps like handling passwords—done mostly on-device, with cloud help only when truly useful.

02Core Idea

The “Aha!” Moment in one sentence: Teach one agent to handle real life by (1) asking humans clarifying questions, (2) using tools (MCP) instead of endless taps, (3) learning by doing in large, live app worlds, and (4) teaming phone and cloud models natively with a shared memory.

Three analogies:

Swiss Army Helper: Sometimes you need the scissors (MCP tool), sometimes a quick question to the camper (ask_user), sometimes you phone a friend (cloud), but you mostly work with the tool in your pocket (device).
GPS + Shortcuts: You can drive street by street (UI taps), but highways (MCP) save time; if traffic looks weird (popup), the system learns from trips (RL) and re-routes; if the onboard map is fuzzy, it asks the control center (cloud) briefly.
Classroom + Library: The student (device agent) tries problems and learns from feedback (RL), asks the teacher for hints (ask_user), looks up references (MCP tools), and only goes to the big library (cloud) when the workbook is not enough.

Before vs After:

Before: Agents followed brittle click scripts, guessed when users were vague, and either lived on-device (weak) or in-cloud (risky/expensive), trained mostly on frozen data.
After: Agents clarify missing details, skip long click-chains with MCP tools, learn robustness via massive online RL, and collaborate natively between device and cloud with privacy-aware routing.

Why it works (intuition, no equations):

Reduce error chains: Replacing 10 taps with 1 API call shrinks places to fail.
Close the info gap: Asking users avoids wrong assumptions early.
Train where life happens: Practicing in live, changing environments teaches recovery from popups and layout shifts.
Right tool, right place: On-device is fast and private; cloud is powerful; a shared memory makes handoff seamless.
Evolve the data: Keeping only good or good-prefix trajectories steadily raises quality without wasting effort.

Building Blocks (with kid-friendly “Sandwich” explanations for new ideas):

🍞 Top Bread (Hook): When you’re not sure what your friend wants for lunch, you ask! 🥬 Filling: Agent–user interaction means the agent pauses and asks for missing info instead of guessing.

What it is: A special ask_user action for clarifications or consent.
How it works: Detect lack of key details → ask → read answer → continue the plan.
Why it matters: Without it, the agent sends the wrong thing to the wrong person. 🍞 Bottom Bread (Anchor): “Send this to Mike.” “What’s Mike’s email?” “mike@alibaba.com.” Task proceeds correctly.

🍞 Top Bread (Hook): You know how chefs taste as they cook and redo steps if needed? 🥬 Filling: Rejection sampling keeps only the best attempts or best prefixes from many tries.

What it is: A filter that throws away poor rollouts but saves correct parts.
How it works: Generate many trajectories → judge accuracy step by step → keep complete wins and longest correct prefixes.
Why it matters: Without it, the model learns from noisy failures and gets confused. 🍞 Bottom Bread (Anchor): If step 7 went wrong, but steps 1–6 were perfect, we keep 1–6 and train on them.

🍞 Top Bread (Hook): If a picture is crowded, you zoom in to tap the right tiny button. 🥬 Filling: Zoom-in strategy refines grounding by cropping around a first guess and predicting again.

What it is: Two-pass pointing: coarse, then fine.
How it works: Predict rough spot → crop and enlarge → predict precise dot.
Why it matters: Without it, small icons on dense screens are easy to miss. 🍞 Bottom Bread (Anchor): Finding a tiny “settings” cog in a complex app toolbar.

🍞 Top Bread (Hook): A referee watches the game and summarizes fouls so the coach can fix tactics fast. 🥬 Filling: The trajectory monitor checks if actions still follow the goal and writes an error summary.

What it is: A built-in checker that detects loops, wrong inputs, and drifts.
How it works: Every few steps, test alignment; on deviation, write a short error note and (if safe) switch to cloud.
Why it matters: Without it, the agent keeps making the same mistake. 🍞 Bottom Bread (Anchor): “Looping on login: missing password field.” The cloud uses that hint to recover.

🍞 Top Bread (Hook): Sharing a single notebook helps teammates pick up where you left off. 🥬 Filling: Unified Trajectory Memory stores the whole story so either device or cloud can resume instantly.

What it is: A synced history of instructions, screenshots, thoughts, and actions.
How it works: Record each step; project the same history into the formats both models expect.
Why it matters: Without shared memory, handoffs lose context. 🍞 Bottom Bread (Anchor): The cloud reads your last 5 steps and takes over exactly from there.

Put together, these pieces make an agent that is careful (asks), clever (uses tools), tough (trained in live worlds), and considerate (private-by-default with smart cloud boosts).

03Methodology

At a high level: User instruction + screenshots → (Perceive and Ground) → (Decide: UI action, ask_user, or mcp_call) → (Execute locally or hand off to cloud with shared memory) → (Repeat with monitoring) → Output (task done or answer).

Step A: Build grounding and perception

What happens: Collect diverse screenshots from virtualized OS containers and public datasets; generate multi-task perception data (QA, captions, state) and multi-perspective grounding instructions (appearance, function, location, intent). Train with supervised fine-tuning (SFT), then reinforce with RL using simple, reliable rewards (format + point-in-box) and an optional zoom-in inference trick.
Why it exists: If the agent can’t point to the right element, everything after falls apart.
Example data: “Instruction: ‘Jump to apps starting with K.’ Screen: app drawer. Model: predicts the ‘K’ scroller index button’s coordinates.”

Step B: Create the self-evolving navigation data pipeline

What happens: Start with seed tasks from app manuals, expert designs, and filtered open-source data. Expand tasks by tweaking parameters (L1) or swapping core objects (L2). Produce trajectories via human annotation and model rollouts, then judge both overall success and longest correct prefixes. Alternate training and rollout with rejection sampling to keep only high-quality or good-prefix examples.
Why it exists: Static, one-path datasets miss real-world diversity and waste partial successes.
Example: A 12-step ‘send a file’ plan fails at step 9; we still keep steps 1–8 to teach solid early navigation.

Step C: Teach agent–user interaction and MCP tool use

What happens: Mix in tasks where key info is missing, forcing ask_user; add tasks designed for tool benefits (e.g., Amap, GitHub). For MCP, store tool schemas, arguments, and structured results; for ask_user, log Q&A turns and continue.
Why it exists: Real users are vague; long click chains waste time when APIs can help.
Example: “Email the latest 3 GitHub commits.” The agent calls a GitHub MCP to fetch authors/messages, then composes the email.

Step D: Online RL in dynamic environments

What happens: Run large-scale, stateful Android emulators in Docker, coordinated by an Environment Manager across machines. Asynchronous rollouts keep GPUs busy; hybrid parallelism (TP+PP+CP) trains ultra-long sequences. A curriculum shifts sampling from easy to hard tasks. Rewards combine success detection (rule-based or judge-LLM) and loop penalties; replay keeps learning signals flowing.
Why it exists: Only practicing in living, changing apps teaches recovery from popups, permission prompts, and layout shifts.
Example: The agent learns to dismiss a new notification permission request that wasn’t present in static data.

Step E: Device–cloud collaboration (DCC)

What happens: The Local Agent both acts and monitors. Every few steps, it checks alignment. If deviating and no sensitive data is present, it sends an error summary and unified history to the Cloud Agent. The Local Unified Trajectory Memory keeps both sides synced. A privacy monitor blocks cloud switch when sensitive content (like passwords) appears.
Why it exists: On-device is private and quick; cloud is powerful. Smart routing gets the best of both while protecting privacy and cutting costs.
Example: On AndroidWorld, 42.7% of steps run locally and 40.5% of tasks finish entirely on-device, reducing cloud calls >40%.

The Secret Sauce:

Multi-perspective grounding (appearance, function, location, intent) injects structure into “how to think” before pointing.
Rejection sampling + prefix reuse steadily lifts data quality and the model’s pass@1.
Massive, asynchronous online RL scales to 500+ concurrent GUI environments, turning rare edge cases into learnable patterns.
Native DCC with error summaries makes handoffs brief, accurate, and privacy-aware.

Concrete mini walk-through (with real actions):

Instruction: “Find resumes downloaded this month and email to HR with subject ‘candidates_cv.’”
1. Perceive: See Downloads; filter by date; select files. (click [411,1297], long_press [419,930])
2. Missing info: No HR email provided → ask_user: “Please provide the recipient email and whether to add body text.”
3. User replies: “HR_chen@gmail.com, no body.”
4. Compose Email: type recipient, subject, attach files, send (type HR_chen@gmail.com, click send).
5. Monitor: Confirms alignment; terminate success.
Instruction: “Compare driving time to two apartments and send the nearer address to Mia.”
1. Parse SMS addresses.
2. Use MCP Amap calls for both destinations; read distances (e.g., 9618m vs 9866m).
3. Pick the shorter; open Messages; send to Mia.
4. If stuck, local monitor summarizes error; if safe, brief cloud handoff recovers.

04Experiments & Results

The Test: The team measured two core skills—GUI grounding (pointing to the right on-screen element) and navigation (completing multi-step tasks in real, dynamic environments). They also tested realistic extras: user clarifications and MCP tool use. Why? Because real life is messy, and success means clicking the right thing and finishing the job even when apps change.

The Competition: MAI-UI was compared against strong open-source and proprietary systems, including GUI-Owl, UI-Venus, UI-Tars-2, Seed1.8, Qwen3-VL variants, and Google/Anthropic/OpenAI models in agentic setups where relevant. Evaluations covered standardized benchmarks and the tougher MobileWorld suite.

The Scoreboard with context:

ScreenSpot-Pro (grounding, pro high-res screens): 73.5% with zoom-in, surpassing Gemini-3-Pro and Seed1.8. Think: top of the class on crowded toolbars and dense layouts.
MMBench GUI L2 (hierarchical instructions): 91.3%. That’s like scoring an A+ on both basic and advanced UI reasoning.
OSWorld-G (complex desktop grounding): 70.9% with zoom-in, beating prior bests. Like finding needles in multiple haystacks reliably.
UI-Vision (diverse reasoning): 49.2% with zoom-in, +12.4 points over the previous best. Big jump on tricky, varied queries.
AndroidWorld (online navigation): 76.7% success. That’s the new state of the art—better than UI-Tars-2 (73.3%) and Gemini-2.5-Pro (69.7%).
MobileWorld (realistic, includes user interaction and MCP): 41.7% overall, far ahead of end-to-end baselines (+20.8 over the best), competitive with strong agentic frameworks.
Subset wins: On MobileWorld’s User-Interaction tasks, 51.1% (beats best end-to-end baseline by +18.7). On MCP tasks, 37.5% (beats best end-to-end baseline by +32.1).

Scaling and Ablations made meaningful:

Parallel environments: Going from 32 to 512 live environments improves AndroidWorld +5.2 points. More practice fields = better play.
Longer interaction budgets: Allowing up to 50 steps adds up to +4.3 points over shorter budgets. More time to explore leads to better learning.
RL vs just SFT: Online RL adds +3.5 to +6.0 points depending on model size. Practicing live matters.
Device–cloud collaboration: +33% on-device performance gain over local-only; >40% fewer cloud calls. That’s like finishing almost half your errands without leaving home.
Error summaries: Adding the local monitor’s concise error note at handoff yields +6.9 points vs. no summary—small notes, big payoffs.

Surprising findings:

Small can be mighty: The 2B on-device model (with DCC support) outperforms some much larger cloud-only systems on AndroidWorld.
Zoom-in matters a lot on dense screens: A simple coarse-then-fine pass lifts several benchmarks by notable margins.
Hybrid verification (rules + judge LLM) reaches 83% agreement with humans—reliable enough to scale RL without huge manual bottlenecks.
Privacy-first routing still wins: Keeping sensitive steps on-device didn’t just protect users; it also improved efficiency and final success rates.

05Discussion & Limitations

Limitations:

Tool reliance coverage: MCP tools help a lot, but only where good APIs exist and are integrated; rare or proprietary tasks may still require long UI paths.
Environment scope: While 35+ apps and 500+ parallel emulators are large, the mobile app universe is far bigger, and some niche behaviors remain unseen.
Verification noise: Judge-LLM evaluators are strong but not perfect; edge cases can mislabel success or failure.
Latency spikes: Cloud handoffs are minimized but can still add delay on poor networks.
Maintenance cost: Keeping containers, app versions, and tool schemas updated is ongoing engineering work.

Required Resources:

Training: Multi-GPU clusters supporting hybrid parallelism (TP+PP+CP), storage for long trajectories, and orchestration for hundreds of emulators.
Serving: On-device runtime for 2B models, a scalable cloud endpoint for 32B/235B variants, and the device–cloud memory/monitor components.
Tooling: MCP server(s), tool catalogs, and privacy monitors.

When NOT to Use:

Ultra-latency-critical, offline-only scenarios with no tolerance for brief pauses (e.g., emergency control) where even short handoffs are unacceptable.
Apps with strict anti-automation or CAPTCHAs that block agent interactions.
Tasks demanding human judgment or legal consent beyond clarification (e.g., medical decisions, financial approvals) without a human-in-the-loop.

Open Questions:

Generalization across OS updates: How fast can models adapt to big UI overhauls without retraining?
Tool reasoning: Can the agent autonomously discover new tools and learn their schemas from docs?
Safety and consent: What are best practices for asking permission and handling sensitive actions across cultures and regulations?
Long-horizon memory: How to compress and retrieve very long histories even more efficiently on-device?
Co-learning with users: Can personalized preferences (with privacy) improve planning without overfitting?

06Conclusion & Future Work

Three-sentence summary: MAI-UI turns GUI agents into practical helpers by adding clarifying dialogue, tool (MCP) shortcuts, massive online RL in live app worlds, and a native device–cloud teamwork system. It sets new records in grounding and mobile navigation and shows strong, privacy-aware performance with far fewer cloud calls. The self-evolving data pipeline keeps the model learning from better and better examples over time.

Main achievement: Unifying ask_user, MCP tools, online RL at scale, and device–cloud collaboration with shared memory into one coherent, real-world-ready agent family—from a fast 2B on-device model to a powerful 235B-A22B cloud model.

Future directions:

Expand app/tool coverage and automatic tool discovery; add desktop and cross-device workflows.
Sharpen privacy monitors and consent flows; formalize safety guarantees.
Compress long histories further for even stronger on-device autonomy.
Broaden benchmarks with more multi-user, multi-app, and time-extended tasks.

Why remember this: MAI-UI shows how to make screen-using AI behave like a careful, capable assistant—asking when unsure, taking smart shortcuts, practicing in real conditions, and teaming device and cloud to be both strong and safe.

Practical Applications

•Plan a commute by comparing live driving routes via MCP and sending the best one to a friend automatically.
•Organize files by date in Downloads and email the right attachments with the correct subject after asking for missing details.
•Summarize the latest GitHub commits via MCP and send a formatted report to teammates.
•Fill out forms across apps (contacts, calendar, email) while handling surprise permission popups gracefully.
•Manage shopping carts and orders across e-commerce apps, even when UI layouts change between versions.
•Triage messages: open apps, filter unread items, and draft replies, asking for approval on sensitive messages.
•Schedule events by checking calendars, proposing times, and confirming details with the user if information is missing.
•Perform multi-app research workflows (browser + notes + email) with fewer errors and faster tool-assisted steps.
•Run on-device for routine tasks and hand off tricky steps to the cloud only when needed to preserve privacy and reduce cost.
•Assist users with accessibility needs by reliably locating small buttons and complex controls using zoom-in grounding.

Version: 1