Step-GUI Technical Report

Haolong Yan; Jia Wang; Xin Huang; Yeqing Shen; Ziyang Meng; Zhimin Fan; Kaijun Tan; Jin Gao; Lieyu Shi; Mi Yang; Shiliang Yang; Zhirui Wang; Brian Li; Kang An; Chenyang Li; Lei Lei; Mengmeng Duan; Danxun Liang; Guodong Liu; Hang Cheng; Hao Wu; Jie Dong; Junhao Huang; Mei Chen; Renjie Yu; Shunshan Li; Xu Zhou; Yiting Dai; Yineng Deng; Yingdan Liang; Zelin Chen; Wen Sun; Chengxu Yan; Chunqin Xu; Dong Li; Fengqiong Xiao; Guanghao Fan; Guopeng Li; Guozhen Peng; Hongbing Li; Hang Li; Hongming Chen; Jingjing Xie; Jianyong Li; Jingyang Zhang; Jiaju Ren; Jiayu Yuan; Jianpeng Yin; Kai Cao; Liang Zhao; Liguo Tan; Liying Shi; Mengqiang Ren; Min Xu; Manjiao Liu; Mao Luo; Mingxin Wan; Na Wang; Nan Wu; Ning Wang; Peiyao Ma; Qingzhou Zhang; Qiao Wang; Qinlin Zeng; Qiong Gao; Qiongyao Li; Shangwu Zhong; Shuli Gao; Shaofan Liu; Shisi Gao; Shuang Luo; Xingbin Liu; Xiaojia Liu; Xiaojie Hou; Xin Liu; Xuanti Feng; Xuedan Cai; Xuan Wen; Xianwei Zhu; Xin Liang; Xin Liu; Xin Zhou; Yifan Sui; Yingxiu Zhao; Yukang Shi; Yunfang Xu; Yuqing Zeng; Yixun Zhang; Zejia Weng; Zhonghao Yan; Zhiguo Huang; Zhuoyu Wang; Zihan Yan; Zheng Ge; Jing Li; Yibo Zhu; Binxing Jiao; Xiangyu Zhang; Daxin Jiang

Step-GUI Technical Report

Intermediate

Haolong Yan, Jia Wang, Xin Huang et al.12/17/2025

arXiv PDF

Key Summary

•This paper builds Step-GUI, a pair of small-but-strong GUI agent models (4B/8B) that can use phones and computers by looking at screenshots and following instructions.
•The key invention is CSRS (Calibrated Step Reward System), which turns messy model playthroughs into trustworthy training data by checking success at the end and then learning detailed steps from that.
•CSRS achieves over 90% annotation accuracy while cutting labeling costs by 10–100× compared to traditional step-by-step human annotation.
•A three-stage, self-evolving training loop (Mid-Train → Cold-Start → RL with verifiable rewards) steadily boosts performance round after round.
•Step-GUI-8B sets new open-state-of-the-art on AndroidWorld (80.2% pass@3) and leads on several GUI grounding benchmarks like ScreenSpot-Pro and OSWorld-G.
•GUI-MCP is a new standard protocol that lets any LLM control devices via a dual-layer toolset (low-level clicks and swipes; high-level task delegation), with a high-privacy mode that keeps images on-device.
•AndroidDaily is a practical benchmark based on how people actually use their phones every day, with 3,146 static actions and 235 end-to-end tasks.
•The 4B model is competitive and can run locally on consumer hardware, which is important for privacy and cost.
•The method preserves general multimodal skills while adding strong GUI abilities, so the model doesn’t become a narrow specialist.
•Overall, this work connects better data, safer deployment, and realistic evaluation to make GUI agents useful in real life.

Why This Research Matters

Real people want AI that actually finishes tasks on their phones and computers, not just talks about them. This work shows a practical way to train such agents reliably and cheaply by anchoring learning to successful task runs. It adds a standard, privacy-preserving interface (GUI-MCP) so assistants can work across different devices without leaking screenshots. A realistic benchmark (AndroidDaily) keeps the focus on what users really do daily, like rides, shopping, and payments. Small models that run locally are vital for cost, speed, and privacy, and this paper delivers competitive 4B and 8B options. The result is a path toward trustworthy, on-device helpers that save time and protect personal data.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Top Bread (Hook): Imagine teaching a friend to use your phone just by showing them screenshots and telling them what to tap. At first, they might guess wrong a lot. If you want them to learn fast, you need clear feedback and good examples.

🥬 Filling (The Actual Concept):

What it is: This paper is about building AI helpers (GUI agents) that can see a screen, understand what to do, and then tap, type, or swipe to finish tasks on phones and computers.
How it works (story of the field):
1. Before: AI was good at chatting and recognizing images, but using apps step-by-step was much harder because tasks take many steps and each step depends on the last.
2. The problem: Getting high-quality training data for long, multi-step tasks was too expensive and too subjective; human labelers often disagreed on the exact right step.
3. Failed attempts: People tried self-critique and filtering that work for single-turn questions, but these struggled for multi-step app use where you need a chain of correct actions.
4. The gap: We needed a way to cheaply create reliable training signals for entire multi-step journeys (trajectories), not just single actions.
5. The stakes: If we solve this, we can build real phone/PC helpers for rides, shopping, payments, and more—safely and privately.
Why it matters: Without better data and evaluation, GUI agents stay fragile: they click the wrong button, get lost in menus, or reveal private screens to the cloud.

🍞 Bottom Bread (Anchor): Think about ordering a pizza in a delivery app: open app → find restaurant → choose size → checkout → pay. If the AI messes up step 2, everything after breaks. We need data and training that help it get every step right, not just the final answer.

🍞 Top Bread (Hook): You know how a good coach first teaches general skills, then fixes your weak spots, and finally pushes you with tough drills?

🥬 Filling (The Actual Concept) — Multimodal Large Language Models (the starting point):

What it is: Multimodal large language models (MLLMs) are smart AIs that can read text, look at images, and reason about both together.
How it works:
1. They take text + images (like screenshots) as input.
2. They build a shared understanding of what’s on the screen and what the user wants.
3. They generate step-by-step plans and actions.
Why it matters: Without a model that understands both words and pictures, you can’t tell a computer, “Find the subway schedule and share it,” and expect it to understand the app screen.

🍞 Bottom Bread (Anchor): When you ask, “Open settings and turn on Wi‑Fi,” the model must read the instruction (text) and spot the settings icon (image) to act correctly.

🍞 Top Bread (Hook): Imagine playing a video game level: it’s not enough to press a single button—you need a whole sequence of moves to win.

🥬 Filling (The Actual Concept) — GUI Agents (what we want to build):

What it is: A GUI agent is an AI that performs multi-step actions on apps and desktops by looking and thinking.
How it works:
1. Observe the current screen.
2. Decide the next action (click, type, swipe).
3. Execute and observe the new screen.
4. Repeat until the task is done.
Why it matters: Without this loop, the agent can’t adapt if a menu looks different or a page takes longer to load.

🍞 Bottom Bread (Anchor): To “buy movie tickets at 7 pm,” the agent must search, filter showtimes, pick seats, and check out—several dependent steps.

Now, why was training so hard? Because labeling every single step by hand is slow, expensive, and sometimes people disagree about the “best” step. Older self-improvement tricks worked well when a single answer could be checked, but not for long journeys with many branches. On top of this, practical use needs two more things: a standard way for any LLM to talk to devices (so tools plug-and-play), and a way to keep private screens from leaving your phone. Finally, we need a benchmark that looks like real daily life, not just synthetic puzzles.

🍞 Top Bread (Hook): Think of a universal remote that works for TVs, game consoles, and speakers—and a testing course that matches real living rooms.

🥬 Filling (The Actual Concept):

What it is: This paper introduces three missing pieces:
1. A new data pipeline (CSRS) that converts full task runs into reliable training signals at low cost.
2. GUI-MCP, a standard, privacy-friendly way for LLMs to control different devices.
3. AndroidDaily, a benchmark based on what people actually do on phones every day.
How it works: Train smarter with CSRS, deploy safely with GUI-MCP, and test realistically with AndroidDaily.
Why it matters: Together, these move GUI agents from lab demos toward daily assistants.

🍞 Bottom Bread (Anchor): With these, an agent can safely help you hail a ride, pay for coffee, or share a file—without shipping your screenshots to the cloud—and we can measure if it truly works in common apps.

02Core Idea

🍞 Top Bread (Hook): You know how a teacher can grade your science project by checking if the volcano actually erupts at the end, even if your steps were a bit messy? If it erupts, you did something right—and the teacher can still help you polish your steps later.

🥬 Filling (The Actual Concept) — CSRS (Calibrated Step Reward System):

What it is: CSRS is a training pipeline that first checks if a whole task run (trajectory) succeeded, then uses powerful LLMs to write rich, step-by-step explanations and labels anchored to that success.
How it works:
1. Let the model try a full task (a rollout), producing a sequence of screens and actions.
2. Use a verifier or a human to give a simple final verdict: success or failure.
3. If success, extract seven kinds of high-quality training signals (like progress tracking, state summaries, effect prediction, reflection, verification, intent execution, action prediction).
4. If failure, still keep the knowledge parts (but not the wrong actions) to learn from mistakes.
5. Train the model on this structured data; repeat next round with a stronger model.
Why it matters: Without anchoring to the final outcome, step-by-step labels can be noisy or wrong; CSRS ties all learning to a high-confidence, trajectory-level check.

🍞 Bottom Bread (Anchor): Like scoring a maze run: you only mark the path as ‘good’ if it reaches the exit. Then you study that path to teach others—while wrong turns from failed runs become lessons, not instructions.

🍞 Top Bread (Hook): Imagine checkpoints on a hiking trail. Even if you don’t record every footstep perfectly, reaching the summit tells you the overall path was good; then you can align your notes to that success.

🥬 Filling (The Actual Concept) — Trajectory-Level Calibration:

What it is: It’s the idea of using the final success/failure of an entire run to calibrate all the fine-grained labels inside it.
How it works:
1. Judge the whole attempt first (did we finish the task?).
2. Use that verdict to filter and trust the steps (keep full labels for success; keep only knowledge for failures).
3. This reduces noise and keeps learning stable.
Why it matters: Without this, models might learn from incorrect step labels and get confused.

🍞 Bottom Bread (Anchor): If a cake tastes great at the end, your recipe notes are probably good; if it flops, you still keep the tips you learned (like ‘preheat the oven’), but you don’t copy the wrong baking time.

🍞 Top Bread (Hook): Playing a game that gives you clear points when you win helps you learn faster than vague hints every few seconds.

🥬 Filling (The Actual Concept) — RLVR (Reinforcement Learning with Verifiable Rewards):

What it is: A learning method where the model gets solid reward signals from checks we can verify (like correct location clicks, valid text typed, or final task completion), not just guesses.
How it works:
1. Define rewards we can measure (e.g., how close a click is to the target; did the typed text match?).
2. Add soft, LLM-judge rewards for reasoning quality.
3. Optimize the model to pick actions that increase these rewards.
Why it matters: Without verifiable rewards, the model may ‘hack’ the scoring or learn unstable tricks.

🍞 Bottom Bread (Anchor): Like bowling with clear pins and a scoreboard—no arguing about the score, so you focus on improving your throw.

🍞 Top Bread (Hook): Think of a Swiss Army knife that can read signs and follow instructions while also using tools.

🥬 Filling (The Actual Concept) — Multimodal Large Language Models (as the brain):

What it is: The model understands both the instructions (text) and the screen (image) and plans actions.
How it works:
1. Fuse text and vision to spot buttons, menus, and relevant regions.
2. Map instructions to actions (clicks, swipes, typing) over time.
3. Reflect and adjust as screens change.
Why it matters: Without both text and vision, the agent can’t reliably operate apps.

🍞 Bottom Bread (Anchor): Reading “open messages” and seeing the messages icon lets the model tap the right spot even if the icon moved.

🍞 Top Bread (Hook): Sometimes you use tiny tools (tweezers) and sometimes you hire a helper to do the whole chore.

🥬 Filling (The Actual Concept) — GUI-MCP (Dual-Layer Control & Privacy):

What it is: A standard protocol with two layers—low-level tools for atomic actions and a high-level ‘execute_task’ that delegates whole tasks to a local GUI specialist model, with options to keep sensitive images on-device.
How it works:
1. Low-level MCP: click, swipe, type, screenshot.
2. High-level MCP: describe the goal (“Buy a cup of coffee”) and let a local model do it.
3. High-privacy mode: only send text summaries to the cloud; keep images locally.
Why it matters: Without a standard and privacy options, deployment is clunky and risky.

🍞 Bottom Bread (Anchor): Your main LLM plans, your local helper taps through screens, and only short text like “Cart has 1 latte, $4.50” leaves your device.

🍞 Top Bread (Hook): If you want to test a soccer player, don’t only check if they can juggle; see how they perform in a real match.

🥬 Filling (The Actual Concept) — AndroidDaily (Real-Life Benchmark):

What it is: A test based on common daily phone tasks (transport, shopping, social, entertainment, local services), with fast static checks and full end-to-end tasks.
How it works:
1. Static: predict one correct action given screenshots (3,146 actions).
2. End-to-end: complete full tasks in live environments (235 tasks).
3. Measure success fairly, including pass@3 to reduce non-model issues.
Why it matters: Without realistic tests, we might build agents that ace toy problems but fail in daily life.

🍞 Bottom Bread (Anchor): Can the agent really book a ride, pick a seat at the movies, or favorite the right product—like a real user would?

Before vs. After:

Before: Expensive, subjective step labels; fragile skills; no standard device interface; weak real-life evaluations.
After: CSRS turns whole runs into trusted training signals cheaply; GUI-MCP standardizes and protects privacy; AndroidDaily validates everyday usefulness.

Why it works (intuition): Success at the end is the strongest truth signal for multi-step tasks. Anchor all learning to that, and use LLMs to richly explain the steps—then repeat in a loop. Add verifiable rewards for precision and a privacy-friendly deployment path. That combination steadily grows reliable skills.

Building blocks:

Mid-Train for broad multimodal + grounding skills.
Cold-Start to patch knowledge gaps via error-driven VQA.
Grounding-cleaning loop to fix noisy labels.
CSRS to convert rollouts into structured, high-quality data.
RLVR to sharpen precise execution with verifiable metrics.
GUI-MCP for efficient, private device control.
AndroidDaily to measure what really matters.

03Methodology

At a high level: Instruction + Screenshot → Model Rollout (actions) → CSRS (verify + extract 7 data types) → Self-Evolving Training (Mid-Train → Cold-Start → RLVR) → Stronger Agent.

Step 1: Model Rollout (Let the agent try)

What happens: The current model sees the task (e.g., “Restore last closed tab in Chrome”) and the screen, then outputs an action like click(x,y) or type("text"). It repeats this step-by-step to attempt the whole task.
Why this step exists: You need real attempts to learn from—both the wins and the mistakes.
Example: “Restore last closed tab” → the model clicks the three dots → clicks “History” or “Recently closed” → selects the last tab.

🍞 Top Bread (Hook): Like filming your practice run before your coach gives feedback. 🥬 Filling (The Actual Concept) — Trajectory (what we record):

What it is: A trajectory is the sequence of screens and actions from start to finish.
How it works: Observe → Act → Observe → Act… until done.
Why it matters: Without the full sequence, you can’t understand what went wrong or right. 🍞 Bottom Bread (Anchor): The ‘movie’ of your app session—every click and screen.

Step 2: CSRS (Verify once; learn a lot)

What happens: A verifier or human checks if the whole attempt succeeded (binary). Then an LLM extracts seven kinds of supervision: progress tracking, state summary, effect prediction, self-reflection, state verification, intent execution, and action prediction.
Why this step exists: Final success is the strongest truth signal; anchoring all fine labels to it creates reliable training data at low cost.
Example data snippets: • Progress track: “We opened menu; next is History.” • State summary: “Top-right menu visible; two tabs open.” • Effect prediction: “After click, a dropdown should appear.” • Reflection: “If History not found, try ‘Recently closed’.” • Verification: “This icon likely means ‘settings’.” • Intent execution: “Restore last closed tab.” • Action prediction: “Click the three-dot menu at top-right.”

🍞 Top Bread (Hook): Judge the race by the finish line, then study the steps. 🥬 Filling (The Actual Concept) — CSRS Selective Learning:

What it is: Keep all seven data types for successful runs; for failures, keep only the knowledge pieces (not the wrong actions).
How it works: Success → full extraction; Failure → knowledge only (categories 1–6).
Why it matters: Prevents learning bad habits while still learning from mistakes. 🍞 Bottom Bread (Anchor): If you fell in the maze, you still learn the map, but don’t copy the wrong turn.

Step 3: Two Parallel Data Flows (Generation + Refinement)

What happens: • Generation flow: The model creates fresh rollouts; CSRS validates and extracts new high-quality knowledge and trajectories for the next round. • Refinement flow: Old data goes through self-distillation and rejection sampling to keep solid samples (to reinforce) and route tricky cases to Cold-Start (to fix weaknesses).
Why this step exists: You need both new experiences and smart curation of what you already have.
Example: A tough checkout sequence that often fails gets routed as knowledge to Cold-Start for targeted practice.

🍞 Top Bread (Hook): Like learning a sport: keep playing new matches and also study past game tapes. 🥬 Filling (The Actual Concept) — Self-Evolving Loop:

What it is: A closed loop where better models produce better data, which trains even better models.
How it works: Round n model → rollouts → CSRS → data → train → Round n+1 model.
Why it matters: Continuous improvement without needing tons of new human labels every time. 🍞 Bottom Bread (Anchor): From 30–40% success to 80%+ after several rounds.

Step 4: Three Training Stages (what the model practices)

Mid-Train: Broad multimodal understanding + grounding + action formatting. Data mix includes general vision-language, knowledge-rich captions, grounding, action alignment, trajectories, and environment-specific samples.
Cold-Start: Error-driven knowledge injection (turn failures into targeted VQA), curated trajectories, and some general multimodal/grounding to preserve breadth.
RLVR: Precision polishing via verifiable rewards (spatial accuracy, action validity) plus LLM-as-a-judge for reasoning quality.

🍞 Top Bread (Hook): First, learn fundamentals; next, fix weak spots; finally, sharpen with scorekeeping. 🥬 Filling (The Actual Concept) — RLVR Rewards (intuitive view):

What it is: A mix of dense spatial rewards (how close was the click/box), action-type correctness, typed content checks, and soft reasoning quality.
How it works:
1. Spatial: clicks/boxes get higher reward when closer to targets.
2. Actions: right kind of action (e.g., WAIT vs COMPLETE) gets credit.
3. Content: typed text is checked; swipes compared by direction.
4. Soft judge: an LLM scores intent alignment and clarity.
Why it matters: The model learns to be both accurate and reasonable. 🍞 Bottom Bread (Anchor): Like archery scoring rings for precision plus a coach’s note on your form.

Step 5: Grounding-Cleaning Pipeline (make perception robust)

What happens: Train on raw grounding data, label pass-rates with complexity tags, do curriculum learning (easy to hard), exclude noisy cases early, then rewrite and reintroduce hard cases later with richer annotations.
Why this step exists: Grounding labels can be noisy; cleaning them improves reliability.
Example: Icon looks like a gear? Learn its function (“settings”), not just its pixels.

Step 6: GUI-MCP for Deployment (control devices safely and efficiently)

Low-level MCP: atomic ops (get_screenshot, click, swipe, type…)
High-level MCP: execute_task("Buy coffee") delegates to local GUI model.
High-privacy mode: keep screenshots local; cloud only sees semantic summaries.

🍞 Top Bread (Hook): Use the right tool for the job, and don’t leak private photos. 🥬 Filling (The Actual Concept) — High Privacy Mode:

What it is: A setting where raw images never leave your device.
How it works: Local GUI model processes screenshots; only short text summaries go to the cloud LLM.
Why it matters: Protects sensitive data while keeping powerful reasoning. 🍞 Bottom Bread (Anchor): “Cart: 2 items, $18.90, shipping today” goes to the cloud; the screenshots stay on your phone.

The Secret Sauce (what makes it clever):

Trajectory-level calibration anchors all fine-grained learning to a reliable end verdict.
Selective learning keeps knowledge from failures but blocks bad actions.
LLM-extracted chains of thought provide rich, consistent supervision at low cost.
The dual-flow data engine balances exploration (new data) and refinement (smart curation).
RLVR adds precise, verifiable rewards to polish execution.
GUI-MCP standardizes control and unlocks privacy-preserving deployment.

Result: A compact, efficient agent that steadily improves and works across many apps and platforms.

04Experiments & Results

The Test (what was measured and why)

Grounding: Can the model find the right UI regions and elements on screens (e.g., ScreenSpot-Pro, OSWorld-G, VisualWebBench)? This matters because if you can’t see the right button, you can’t click it.
End-to-End: Can the agent complete real tasks in live environments (OSWorld-Verified on desktop; AndroidWorld on mobile)? This shows real interactive skill.
Daily-Life Utility: AndroidDaily (Static for quick one-step checks; End-to-End for full tasks) based on common apps people actually use.
General Skills: A spread of multimodal benchmarks to ensure the model didn’t become a narrow specialist.

The Competition (who it faced)

Strong closed-source agents (e.g., Claude 4.5 Sonnet, OpenAI CUA o3, Gemini-2.5 Computer Use).
Specialized GUI agents (UI-TARS series, GUI-Owl, SeedVL, MobileRL, Mobile-Agent-v3).
Foundation multimodal backbones (Qwen3-VL instruction-tuned variants, etc.).

The Scoreboard (with context)

Grounding: • ScreenSpot-Pro: Step-GUI-8B hits 62.6 (best), beating many larger models; Step-GUI-4B is close at 60.0. • OSWorld-G: Step-GUI-8B at 70.0, a big jump over strong baselines (e.g., +9 vs Qwen3-30B-A3B at 61.0). • MMBench-GUI-L2: Step-GUI-8B tops at 85.6; 4B also strong at 84.0. • VisualWebBench: 4B leads at 90.7, surpassing SeedVL-1.5 by +3.4. Interpretation: Even the 4B model competes with or beats much larger models—like a nimble runner outpacing heavier sprinters.
End-to-End Environments (pass@3): • AndroidWorld: Step-GUI-8B reaches 80.2% (state-of-the-art, tying MobileRL-9B), while 4B gets 75.8% (second-best). This is like getting consistent A’s when others are mostly at B-level. • OSWorld-Verified: Step-GUI-8B gets 48.5% (second only to Claude-4.5 Sonnet), and far above OpenAI CUA o3 (23.0). Given OSWorld’s instability (VM crashes, CAPTCHAs), pass@3 fairly reduces noise from non-model issues.

🍞 Top Bread (Hook): If a test machine sometimes freezes, you wouldn’t fail the student for that. 🥬 Filling (The Actual Concept) — pass@3:

What it is: Allow up to three attempts per task; count success if any pass.
How it works: Repeat up to 3 runs; if one finishes, score success.
Why it matters: Reduces unfair penalties from random environment glitches. 🍞 Bottom Bread (Anchor): Like letting a goalie face three penalty kicks instead of just one to get a fairer measure.
AndroidDaily (real-life focus): • Static: Step-GUI-8B averages 89.91% (SOTA), 4B at 87.02%, well above UI-TARS-1.5 (67.69%) and general LMMs (e.g., GPT-4o at 17.73%). Especially strong on CLICK/TYPE and complex actions like INFO and WAIT. • End-to-End: Step-GUI-8B at 52.50%, close to UI-TARS-1.5 (56.64%), and ahead of 4B (49.06%). By task types, Query is easiest (~64–66%), Analyze hardest (~33–37%). Composite multi-step tasks remain challenging for all models (often 14–20%). Notably, Step-GUI-8B excels on high-ambiguity instructions (61.54%), suggesting robust instruction interpretation.

Surprising/Notable Findings

Small can be mighty: The 4B model performs impressively, often rivaling bigger models and topping VisualWebBench.
Phase jump: On AndroidWorld, performance leaps between rounds as the CSRS loop starts harvesting many successful long-horizon trajectories.
General skills preserved: Step-GUI-8B remains strong on broader multimodal tests (e.g., V*, OCRBench, MathVista), proving it’s not a one-trick pony.
Static vs End-to-End gap: Even with top static scores, end-to-end tasks show remaining challenges in planning over many steps, handling delays, and coping with app quirks.

Big Picture: With trajectory-level calibration, selective learning, and verifiable rewards, the agent improves stably across rounds—like a student who gets reliable grades, keeps the right notes, and steadily climbs from B to A performance.

05Discussion & Limitations

Limitations (honest look)

Real apps vary a lot: Layouts change, A/B tests move buttons, and CAPTCHAs or logins can block progress. The agent still struggles on complex composite tasks and nuanced edge cases.
Environment instability: VM or emulator crashes and slow loads create friction; pass@3 helps but doesn’t remove all noise.
Data bias: Even with AndroidDaily’s real-life focus, coverage isn’t everything—some regions, languages, or niche apps are less represented.
Reasoning gaps: Long-horizon plans with contingencies (loops, conditionals) remain tough; the model sometimes hesitates or over-commits.

Required Resources (to use this system)

Models: Step-GUI-4B (local-capable) or 8B (stronger), built on a Qwen3-VL backbone.
Compute: GPU(s) for training or fine-tuning; consumer hardware can run 4B locally.
Tooling: The GUI-MCP server and device connectors (Android emulators, desktop VMs) plus verifiers/harnesses.
Data engine: CSRS to validate and extract training signals; optional human-in-the-loop for tricky verifications.

When NOT to Use

High-stakes, zero-error contexts (e.g., financial wire transfers) without strict guardrails and human oversight.
Highly dynamic or blocked apps (e.g., frequent CAPTCHA walls, paywalls, or strict TOS) where automation is unreliable or inappropriate.
Ultra-latency-critical flows on weak hardware if local inference is too slow and privacy mode forbids cloud help.

Open Questions

Stronger verification: Can we scale automated verifiers that check success without human review, even for fuzzy goals?
Robust planning: How can we better handle multi-branch tasks, recover from dead-ends, and respect app-specific conventions?
Safety & privacy: Can we guarantee no sensitive tokens or images ever leave the device under all failure modes?
Fairness & coverage: How do we extend to more languages, accessibility modes, and regional UI patterns without exploding data costs?
Continual learning: What’s the best schedule for mixing exploration (new rollouts) and refinement (curated hard cases) over time for stable gains?

06Conclusion & Future Work

Three-Sentence Summary: This work introduces CSRS, a way to turn full task runs into reliable training signals at low cost, enabling steady self-improvement of GUI agents. It delivers Step-GUI models (4B/8B) that reach state-of-the-art on key mobile/desktop benchmarks while preserving general multimodal skills. For deployment and evaluation, it adds GUI-MCP (a dual-layer, privacy-friendly control protocol) and AndroidDaily (a realistic daily-life benchmark).

Main Achievement: Anchoring rich, step-level learning to trajectory-level verification—so the model learns a lot from each run, safely and cheaply—unlocking a self-evolving loop that scales GUI competence.

Future Directions: Automate more verifiers, expand AndroidDaily’s coverage and languages, improve long-horizon planning with better recovery strategies, strengthen privacy guarantees, and broaden GUI-MCP integrations across ecosystems. Also refine RLVR signals (e.g., better semantic checks and robustness to UI shifts) and keep pushing efficient local models.

Why Remember This: It connects the three missing links—better data (CSRS), safer deployment (GUI-MCP), and real-life testing (AndroidDaily)—to move GUI agents from cool demos to everyday helpers that can actually open apps, press the right buttons, and finish your tasks without leaking your screens.

Practical Applications

•Personal phone assistant that books rides, orders food, and pays—while keeping screenshots on your device.
•Desktop workflow automation for emails, file management, and calendar tasks using visual understanding.
•Enterprise robotic process automation (RPA) that handles legacy UIs lacking APIs with higher reliability.
•Accessibility helpers that navigate complex interfaces on behalf of users with motor or visual impairments.
•Customer support agents that reproduce user issues by visually operating apps in test environments.
•QA testing bots that follow visual test scripts, detect UI regressions, and auto-file detailed bug reports.
•E-commerce assistants that compare products, apply coupons, and complete checkout across different stores.
•Education/training simulators that teach app usage step-by-step with safe, verifiable practice runs.
•On-device privacy mode copilots for handling sensitive tasks (e.g., banking, health portals) without image upload.
•Data flywheel creation for new domains: auto-generate, verify, and learn from trajectories to bootstrap specialized agents.

Version: 1