OmegaUse: Building a General-Purpose GUI Agent for Autonomous Task Execution

Le Zhang; Yixiong Xiao; Xinjiang Lu; Jingjia Cao; Yusai Zhao; Jingbo Zhou; Lang An; Zikan Feng; Wanxiang Sha; Yu Shi; Congxi Xiao; Jian Xiong; Yankai Zhang; Hua Wu; Haifeng Wang

OmegaUse: Building a General-Purpose GUI Agent for Autonomous Task Execution

Intermediate

Le Zhang, Yixiong Xiao, Xinjiang Lu et al.1/28/2026

arXiv PDF

Key Summary

•OmegaUse is a new AI that can use phones and computers by looking at screenshots and deciding where to click, type, or scroll—much like a careful human user.
•It gets great at this by using super-clean training data built with a careful pipeline and by training in two stages: first copying strong examples (SFT) and then improving with feedback (GRPO).
•OmegaUse uses a Mixture-of-Experts (MoE) brain, which keeps thinking power high while only turning on the expert parts it needs, saving compute.
•The team builds navigation data in three ways: cleaning open data, bottom-up app exploration, and top-down tasks guided by a smart checklist (taxonomy), plus expert demos.
•They also made OS-Nav, a pair of tough tests for Chinese Android (ChiM-Nav) and Ubuntu desktop (Ubu-Nav), to check real-world skills across systems.
•OmegaUse hits state-of-the-art results on key tests: 96.3% on ScreenSpot-V2 (grounding), 79.1% step success on AndroidControl, 74.24% on ChiM-Nav, and 55.9% on Ubu-Nav.
•Decoupling the model into a precise “grounder” and a smart “navigator,” then training each with tailored rewards, reduces confusion and boosts accuracy.
•Special rewards (like “inside-the-box” for clicks) teach the model to pick exact spots, not just “close enough” pixels.
•Compared to giant dense models, OmegaUse competes strongly while being more efficient, thanks to MoE and data quality.
•This matters because better GUI agents can automate everyday digital chores safely and reliably across apps and operating systems.

Why This Research Matters

OmegaUse moves us closer to computers that can truly help with everyday digital chores across phones and desktops. It saves time by reliably handling repetitive tasks like settings changes, form filling, and file organization. Because it learns from carefully cleaned data and uses safety-minded rewards, it’s more precise and less likely to click the wrong thing. Its efficiency-friendly Mixture-of-Experts design means strong performance without always requiring giant, expensive models. The cross-OS benchmarks ensure it works not only in English or one platform, but also in Chinese Android and Ubuntu desktops. This broader reliability can support accessibility, reduce tech support friction, and let people focus on creative or human-centered work.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: You know how you learn to use a new app by looking, tapping, and fixing mistakes as you go? Computers want to do that too—see the screen, understand it, and act.

🥬 The Concept — GUI Agent: A GUI agent is an AI that looks at screenshots and decides actions like click, type, or scroll to finish tasks. How it works:

See the screen (a picture).
Read the instruction (the goal).
Think of the next step.
Do one action (click/type/scroll).
Repeat until done. Why it matters: Without this, AI can’t actually “use” apps and computers for us—it only talks about them. 🍞 Anchor: “Open Settings, turn off Bluetooth” becomes: see Bluetooth toggle, tap it, confirm it’s off.

Before this paper: Early GUI agents either split everything into many parts (planner here, clicker there) or tried to learn everything at once. They often broke because:

Their training data was noisy (wrong boxes around buttons or vague labels).
They planned poorly over many steps (too many extra clicks, got lost).
Benchmarks didn’t cover real diversity (different languages, mobile vs. desktop).

🍞 Hook: Imagine doing a jigsaw puzzle with blurry pieces—your picture won’t look right.

🥬 The Concept — UI Grounding (finding the right spot): UI grounding means matching words like “Issues” or “Settings” to the exact place on the screen. How it works:

Read the instruction.
Scan the screen for matching text/icons.
Pick the exact region to click. Why it matters: If grounding is wrong, every next step fails, like pressing the wrong button every time. 🍞 Anchor: If the task says “Click ‘Submit’,” grounding finds the actual ‘Submit’ button, not a random gray box.

🍞 Hook: Think of learning to bake cookies from a messy recipe—too much guessing, too many burnt cookies.

🥬 The Concept — Data Quality Pipeline: A data quality pipeline is a careful process that removes wrong labels and writes clearer instructions so the model learns from good examples. How it works:

Gather lots of data from many apps/sites.
Filter out broken, blurry, or misleading examples.
Fix boxes and rewrite unclear prompts by humans-in-the-loop.
Keep only the reliable, diverse set. Why it matters: Bad data teaches bad habits; good data builds accurate skills. 🍞 Anchor: If the label says “Click Play” but the box points to “Pause,” the agent learns confusion; fix it, and the agent learns the right target.

🍞 Hook: You know how different sports need different tests—sprinting isn’t the same as swimming.

🥬 The Concept — Benchmark: A benchmark is a standard test set to measure how well an agent performs on tasks. How it works:

Define tasks (e.g., “Open Wi-Fi and connect”).
Provide gold answers/steps (verified by experts).
Run agents and measure success fairly. Why it matters: Without good tests, we don’t know if the agent truly works across apps and systems. 🍞 Anchor: OS-Nav (ChiM-Nav for Chinese Android, Ubu-Nav for Ubuntu desktop) checks if an agent can handle real, multi-step tasks.

The problem: Agents failed on tricky UIs, mismatched boxes, and long, multi-step plans. They needed cleaner data, smarter training that separates “seeing exactly” from “planning wisely,” and better tests across devices.

The gap: A unified, efficient agent with a high-quality data pipeline, a decoupled training plan (learn basics first, then improve with feedback), and strong cross-OS evaluation didn’t exist.

Real stakes: This impacts daily life—automating form filling, scheduling, software setup, and support; helping accessibility; and saving time at work and home. If agents click the wrong spot, waste steps, or fail on non-English apps, they’re not helpful. This paper targets all three: data, training, and testing, to make a reliable “do-it-for-me” computer helper.

02Core Idea

🍞 Hook: Imagine a relay team where one runner specializes in quick starts and another in long-distance pace—they run better when each does what they do best.

🥬 The Concept — Decoupled Two-Stage Agent: OmegaUse splits the job into two specialists (a precise Grounder and a smart Navigator) and trains them in two phases: learn from examples (SFT) and then polish with feedback (GRPO). How it works:

Build clean, diverse data.
Train with SFT to learn correct formats and basic logic.
Improve with GRPO rewards that score precision and step quality.
Use a Mixture-of-Experts brain to stay powerful but efficient. Why it matters: Mixing “see exactly” and “plan wisely” in one go can cause confusion; splitting them boosts both accuracy and reliability. 🍞 Anchor: The Grounder nails “where to click,” the Navigator nails “what to do next,” and GRPO rewards reinforce both.

Aha! in one sentence: Train a precise clicker and a thoughtful planner separately on ultra-clean, richly built data, then tune them with smart rewards—all inside an efficient Mixture-of-Experts model.

Three analogies:

Orchestra: Strings (grounding) handle fine notes; conductor (navigation) guides the music; MoE brings in the right section at the right time.
GPS Trip: Camera sees road signs (grounding), navigator plans turns (navigation), and MoE is like lane assist switching features only when needed.
Cooking Show: Prep chef (grounding) picks exact ingredients; head chef (navigation) sequences steps; the producer (MoE) calls which expert to feature.

🍞 Hook: Think of one-size-fits-all shoes—they rarely fit perfectly.

🥬 The Concept — Mixture-of-Experts (MoE): MoE is a model that picks a few specialized “experts” for each situation instead of using the whole big brain every time. How it works:

Many experts are available.
A gate chooses which small subset to activate for the current input.
Only those experts compute, saving time and energy. Why it matters: You get large-model smarts without always paying large-model costs. 🍞 Anchor: Reading a tiny icon? Call the “small object” expert. Planning a 7-step desktop workflow? Call the “sequential reasoning” expert.

Before vs. After:

Before: Noisy data + tangled training = missed clicks and messy plans.
After: Clean data + split training + GRPO rewards + MoE = sharper clicks, steadier plans, faster inference.

Why it works (intuition):

Clean data gives the model truthful patterns.
SFT teaches format and fundamentals so answers are well-formed.
GRPO doesn’t just say “right/wrong”; it compares rollouts in a group so the model steadily prefers better ones.
Tailored rewards (like “inside-the-box” for clicks) aim the model at exact interaction regions.
MoE activates only needed skills, avoiding overload and preserving reasoning capacity.

Building blocks:

High-quality Grounding Set (manual fixes, strict filtering).
Grounder trained with SFT + GRPO using format and inside-the-box rewards.
Unified Action Space so desktop, mobile, and web share a common “verb” set.
Hierarchical Navigation Data Pipeline: clean open data, bottom-up exploration graphs, top-down taxonomy tasks, expert demos.
Navigator trained with SFT + GRPO using multi-part action rewards (type, coordinates, scroll direction, typing content, hotkeys).
OS-Nav benchmarks to test cross-OS skills with expert-verified steps.

🍞 Anchor: Like teaching a kid: first show neat examples (SFT), then play practice games with scores (GRPO), and switch coaches depending on the skill needed (MoE).

03Methodology

At a high level: Instruction + Screenshot + History → Ground precisely (OmegaUse-G) → Plan next step (OmegaUse Navigator) → Output action (click/type/scroll) → Repeat.

🍞 Hook: Imagine building a robot that needs eagle eyes and a wise brain.

🥬 The Concept — Supervised Fine-Tuning (SFT): SFT teaches the model by showing correct examples and having it imitate. How it works:

Feed instruction, screen, and perfect answer.
The model learns to match the correct format and action.
Repeat across many cases to form good habits. Why it matters: Without SFT, the model may output messy, unparseable answers. 🍞 Anchor: Learning to always answer clicks as box=(x, y) instead of rambling text.

🍞 Hook: Think of practice scrimmages where your coach scores each play, not just the game result.

🥬 The Concept — Group Relative Policy Optimization (GRPO): GRPO improves decisions by sampling several candidate answers for the same prompt and nudging the model toward the better ones based on relative rewards. How it works:

For a task, generate multiple rollouts.
Score each with tailored rewards (format + precision + logic).
Compare within the group; reward above-average, downweight below-average.
Gently keep the model near its reference policy to stay stable. Why it matters: It stabilizes learning and avoids needing a separate critic model, making RL more efficient. 🍞 Anchor: If five attempts try to click a small toggle, the one whose point lands inside the box wins more points and trains the model toward that style.

Step-by-step recipe:

Grounding Data Pipeline (seeing exactly):

Aggregate six grounding datasets (mobile/web/desktop).
Filter harshly: drop ambiguous prompts, fix misaligned boxes, remove blurry images.
Keep a high-fidelity 111K set so supervision is precise. Example: If “Settings” box is 10 pixels off, humans fix it so the model learns the true region.

Grounding Training:

SFT Phase: Teach the model to output coordinates cleanly and consistently.
GRPO Phase: Add two rewards— • Format Reward: gives points only if output follows the required syntax. • Inside-the-Box Reward: gives points if the predicted point lands strictly inside the target box. Why both: Without format reward, outputs become unparseable; without inside-the-box, the model clicks near edges or off-target. Example: Task says “Click ‘Issues0’.” The best rollout outputs a valid format with a point inside the “Issues0” region.

🍞 Hook: Using one set of verbs for phone and PC keeps conversations simple.

🥬 The Concept — Unified Action Space: A shared set of actions (Click, Drag, Scroll, Type, Wait, Finish) plus platform add-ons (Hotkey on desktop, Back/Home on mobile). How it works:

Define core actions that exist everywhere.
Add extra actions only where needed (e.g., LongPress for mobile).
Train the model to produce these actions in a strict template. Why it matters: Consistency boosts cross-platform generalization and simplifies learning. 🍞 Anchor: “Click(box=(x,y))” means the same thing on Android and Ubuntu.
Hierarchical Navigation Data Pipeline (planning wisely):

Open-source Curation: Filter short or looping trajectories; use an auditor model to verify task completion.
Automated Synthesis: • Bottom-up Exploration (DFS): • Explore apps via accessibility trees; collect triples <pre-state, action, post-state>. • Build a state transition graph; merge near-duplicate screens with semantic clustering. • Extract diverse, loop-free paths; enrich with natural-language goals and step reasons. • Top-down Taxonomy Tasks: • Create a hierarchical task list per platform (e.g., System Ops → Network → Wi-Fi). • Generate task descriptions; run an expert agent in simulators; keep only successful, human-verified trajectories.
Expert Demonstrations: Human pros execute 5+ step tasks; two independent audits ensure logical and goal correctness. Example: “Record an audio clip,” “Reply to a message,” “Change display brightness,” each with verified step-by-step actions.

🍞 Hook: Wandering a new city bottom-up (explore streets) vs. following a tour plan top-down (checklist of sights).

🥬 The Concept — Bottom-up Exploration: What it is: Automatically discovering screens and actions by exploring apps. How it works:

Start from a screen and try valid actions.
Record transitions and screenshots.
Build a graph; remove duplicates; extract paths. Why it matters: Finds real flows that developers didn’t handwrite. 🍞 Anchor: The agent discovers “Settings → Bluetooth → Toggle” by itself.

🥬 The Concept — Top-down Taxonomy-guided Generation: What it is: Generating tasks from an expert-designed checklist to cover must-have skills. How it works:

Define domains and subskills.
Create tasks per subskill.
Run them; keep only verified successes. Why it matters: Ensures coverage of important, real-world actions that exploration might miss. 🍞 Anchor: “Browser & Web → Tab Management → Duplicate a tab” becomes a concrete, trained trajectory.
Navigation Training (two-stage):

SFT: Teach O → T → A format: Observation (state), Thought (reason), Action (executable code).
GRPO with action-wise rewards: • Format: Must follow the template. • Type: Correct primitive (Click vs. Scroll). • Coordinates: Closer is better; perfect is best. • Scroll: Right direction plus accurate region. • Type: Match content tokens well. • Hotkey: Exact key combo. Why it matters: Without these, the agent might type when it should click, or scroll the wrong way. Example: “Open a new browser tab, search ‘weather’, copy result title”—checks hotkeys, typing, clicks, and scrolls together.

Secret sauce:

Decoupling reduces interference: precision vision and long-horizon planning don’t fight.
Tailored rewards target what truly matters (exact region, correct action type, right content).
MoE delivers big-model reasoning with smaller active compute.
Data pipeline blends real, discovered, and designed tasks to cover both common and rare cases.

04Experiments & Results

🍞 Hook: If two students take different tests, you can’t compare them. But if they take the same tests, you can rank fairly.

🥬 The Concept — Benchmarking with Context: The team measured grounding (pointing to the right spot) and navigation (taking the right steps) across multiple public and new benchmarks. How it works:

Choose strong baselines (prior top models).
Use shared metrics (accuracy, step success rate).
Compare apples-to-apples across platforms. Why it matters: Numbers only make sense when we know what they mean vs. others. 🍞 Anchor: 96.3% on ScreenSpot-V2 is like getting an A+ when most top students were already getting As.

Grounding (ScreenSpot-V2): OmegaUse-G scores 96.3% average—state-of-the-art. It’s near-perfect on text elements (e.g., 99%+) and very strong on icons/widgets across mobile, desktop, and web. Compared to well-known leaders (UI-Venus-Ground-72B, Seed1.5-VL), OmegaUse edges ahead despite using an MoE-efficient backbone, showing the power of data + decoupled training.

Grounding (ScreenSpot-Pro): This is a tougher, pro-software setting with tiny UI details. OmegaUse-G averages 55.47%. While ultra-large dense models still lead globally (e.g., 61.9%), OmegaUse shines in specific slices like OS-Icon (best-in-class) and ranks runner-up in OS-Text and several domain-text/icon categories. Translation: even against giants, OmegaUse competes tightly where precision icons and system UIs matter most.

Navigation (AndroidControl, offline): OmegaUse reaches 87.6% type accuracy and 79.1% step success rate—new SOTA, surpassing prior top models (including some 72B dense models). This is like solving a multi-step math problem set with fewer mistakes than anyone else.

Navigation (AndroidWorld, online): OmegaUse achieves 55.7% success using only screenshots (no external planners or accessibility trees). It beats some larger models and is close to others that use more inputs. That’s notable because fewer crutches (fewer input modalities) usually makes the task harder.

OS-Nav (ChiM-Nav, Chinese Android): OmegaUse scores 87.78% type accuracy and 74.24% step success, clearly leading open-source baselines. This shows strong robustness on non-English, region-specific app ecosystems with unique layouts.

OS-Nav (Ubu-Nav, Ubuntu desktop): OmegaUse averages 55.9%, topping other strong open baselines. It is especially strong in non-coordinate actions like Hotkey and Type, showing it bridges visual grounding with accurate semantic commands.

Surprises and insights:

Clean, human-refined grounding data boosts cross-platform accuracy dramatically.
The inside-the-box reward makes a big difference on tiny targets.
Even with fewer modalities, a well-trained end-to-end agent remains competitive online.
Decoupling plus MoE seems to capture both precision and planning without ballooning compute costs.

🍞 Anchor: Think of a decathlon: OmegaUse doesn’t just sprint fast (grounding); it also nails the pole vault and hurdles (navigation)—and does it on different playing fields (mobile in Chinese, Ubuntu desktop), not just one track.

05Discussion & Limitations

Limitations:

Very complex, unfamiliar apps or rare widgets can still confuse grounding, especially in ScreenSpot-Pro-like micro-UI cases.
Online environments change; pop-ups, latency, or A/B-tested layouts can break recorded strategies.
Without accessibility trees or extra tools, screenshot-only agents may miss hidden semantics (e.g., unlabeled icons).
Long-horizon recovery (undoing mid-trajectory mistakes) remains challenging; partial progress can still derail final success.

Required resources:

Training needs substantial GPU/TPU time for SFT+GRPO and MoE routing.
Data building requires human-in-the-loop verification; quality costs expert hours.
Simulation environments and app farms (Android emulators, Ubuntu VMs) are needed for synthesis and audits.

When not to use:

Highly dynamic, security-sensitive UIs (e.g., unpredictable captchas, banking tokens) where automation risks lockouts.
Tasks demanding privileged system APIs (firmware updates) rather than GUI sequences.
Settings with strict compliance rules forbidding automated input.

Open questions:

Can we improve tiny-object grounding further without massive model scaling—perhaps via adaptive zoom or multi-shot glimpses?
How to make online recovery smarter—detect drift and re-plan mid-task reliably?
Can we auto-detect unsafe or irreversible actions and ask for confirmation dynamically?
How to continuously learn from real user feedback while preserving privacy and avoiding catastrophic forgetting?
What’s the best mix of bottom-up discovery and top-down coverage to minimize annotation while maximizing robustness?

06Conclusion & Future Work

Three-sentence summary:

OmegaUse is a general-purpose GUI agent that separates precise seeing (grounding) from smart planning (navigation), then trains each with SFT and GRPO on a high-quality, carefully built dataset.
Using a Mixture-of-Experts backbone and tailored rewards (like inside-the-box), it achieves state-of-the-art grounding and leading navigation results across mobile and desktop.
The new OS-Nav benchmarks (ChiM-Nav and Ubu-Nav) show strong cross-terminal generalization and offer fair tests for future agents.

Main achievement: Demonstrating that decoupled training + MoE efficiency + rigorous data construction and evaluation can beat or match much larger dense models on key GUI tasks.

Future directions:

Sharpen tiny-target grounding with adaptive zooming or multi-frame attention.
Improve online recovery and self-correction; add safer action guards.
Expand OS-Nav to more OSes and languages; incorporate multi-app workflows and tool use.
Explore continual learning with privacy-preserving logs.

Why remember this: OmegaUse shows that better data and smarter training structure matter as much as model size. By teaching an agent to both see exactly and plan wisely—and testing it across real systems—we step closer to trustworthy “do-it-for-me” computer help in everyday life.

Practical Applications

•Automate routine settings tasks (e.g., toggling Wi‑Fi/Bluetooth, adjusting brightness) on phones and desktops.
•Speed up onboarding by auto-configuring software and system preferences across new machines.
•Assist with customer support workflows by navigating diagnostic steps and collecting logs via the GUI.
•Handle office chores like renaming files, compressing folders, or converting documents using desktop apps.
•Perform multi-step browser tasks (open tabs, search topics, copy info) with consistent hotkeys and scrolls.
•Support accessibility by following voice instructions to click small buttons or fill forms precisely.
•Run nightly maintenance: clear caches, update apps, and verify system settings without scripts.
•Demonstrate product tutorials by executing verified step-by-step GUI actions for training videos.
•Help QA testing by exploring apps bottom-up to surface hidden screens and record reproducible steps.
•Assist non-technical teams in data entry and report generation by reliably typing and copying content.

Version: 1