CUA-Skill: Develop Skills for Computer Using Agent

Tianyi Chen; Yinheng Li; Michael Solodko; Sen Wang; Nan Jiang; Tingyuan Cui; Junheng Hao; Jongwoo Ko; Sara Abdali; Leon Xu; Suzhen Zheng; Hao Fan; Pashmina Cameron; Justin Wagle; Kazuhito Koishida

CUA-Skill: Develop Skills for Computer Using Agent

Intermediate

Tianyi Chen, Yinheng Li, Michael Solodko et al.1/28/2026

arXiv PDF

Key Summary

•This paper builds a big, reusable library of computer skills so an AI can use Windows apps more like a careful human, not a clumsy robot.
•Each skill is a small, clear action (like 'rename a sheet in Excel') with fill-in-the-blank slots (arguments) and a map of possible steps (an execution graph).
•A planner model doesn’t memorize all skills; it writes search queries, retrieves likely skills, picks the best one, fills in the blanks, and runs it.
•If something goes wrong, the agent remembers what failed and tries a different path instead of repeating the same mistake.
•Skills can run by clicking/typing on the screen or, when safer, by using scripts or hotkeys to avoid brittle mouse clicks.
•On its own skill executions, the library hits a 76.4% success rate, which is 1.7×–3.6× better than strong baselines for generating action trajectories.
•On the tough WindowsAgentArena benchmark, the CUA-Skill Agent reaches 57.5% best-of-three success and stays efficient (≤30 steps).
•The method scales: adding more skills doesn’t require retraining the planner; the agent just retrieves new ones at test time.
•Performance improves with stronger language models, but even modest models get sizable boosts when they use the skill library.
•This creates a practical path to dependable desktop helpers for tasks like spreadsheets, browsing, file management, and settings.

Why This Research Matters

This work makes desktop AIs more dependable, so they can actually help with everyday tasks like organizing files, editing documents, and browsing safely. By using human-sized skills, it reduces silly mistakes that waste time and cause frustration. It also helps people who struggle with complex UIs by automating careful, step-by-step actions. Businesses can save money by automating routine computer work without writing fragile scripts for every single app. As the skill library grows, the same agent can solve more tasks without retraining. This is a practical path toward trustworthy digital assistants on real, messy desktops.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine teaching a friend to use your computer. You don’t say “Move mouse here, click, type, click” for every tiny step forever. You teach named moves like “Open Excel,” “Rename a sheet,” and “Fill a column.” Those named moves are reusable and make work faster and less confusing.

🥬 The Concept: Computer-Using Agents (CUAs) are AIs that try to control a computer like a person would—opening apps, clicking buttons, typing text, and finishing tasks. How it works:

The agent sees the screen and reads your instruction.
It decides which action to do next (click, type, hotkey, etc.).
It keeps going until the task is done or it gets stuck. Why it matters: Without good structure, CUAs often make small mistakes that quickly snowball, especially on long tasks across changing screens.

🍞 Anchor: If you ask, “Open Edge and go to the home page,” a good CUA knows to start Edge, then navigate home, not randomly click shiny things.

🍞 Hook: You know how building a LEGO castle is easier with labeled pieces (doors, towers, windows) than with a heap of random bricks?

🥬 The Concept: Long-horizon tasks are multi-step jobs like “Find a file, edit it, email it, and archive it.” How it works:

Many steps depend on earlier steps (you can’t attach a file you didn’t open).
Any tiny slip (wrong tab, wrong menu) can break later steps.
Success needs planning, memory, and recovery. Why it matters: Without reliable building blocks and memory, the agent keeps getting lost in long tasks.

🍞 Anchor: Think of doing a school project: research → write → format → share. If you forget where you saved your draft, you waste time retracing steps.

🍞 Hook: Imagine a cookbook where every recipe is just a long list of individual spoon moves. That would be exhausting to follow every single time.

🥬 The Concept: Reusable skills are named, shareable procedures that match how people actually use computers. How it works:

Each skill captures a small, meaningful intent (like “Create a new folder named X”).
It has blanks (arguments) like the folder name to fill in at run time.
It has a step map with guarded options (e.g., use a hotkey if available, else click a button). Why it matters: Without skills, agents treat every task as brand new, re-discovering the same steps again and again, which is slow and brittle.

🍞 Anchor: Instead of “move cursor, click, type L, type o, type g, type s,” a skill called “Create folder named Logs” does the whole mini-job intelligently.

🍞 Hook: You know how you look at the screen to make sure you’re clicking the right thing? That’s your eyes grounding your actions to the right spot.

🥬 The Concept: GUI grounding means matching words like “the Save button” to the actual pixels on the screen. How it works:

The agent views a screenshot.
It finds the target area (button, menu, field).
It clicks or types exactly there. Why it matters: Without grounding, the agent can’t act accurately on changing screens, causing cascading errors.

🍞 Anchor: If the agent can’t locate the “Timer” tab in the Clock app, it can’t start a 25-minute “Pomodoro Session.”

The world before CUA-Skill: Many agents modeled desktop work as flat sequences of clicks and keys. They often stuffed big tool lists into prompts or wrote monolithic plans that quickly broke when the UI changed. Some systems tried memory and knowledge graphs, which helped track facts, but not the “how-to” procedures humans lean on. Others relied on heavy API integrations (like Model Context Protocol tools). That’s great for servers and scripts, but many desktop apps don’t expose friendly APIs or change often, making those approaches hard to maintain.

The problem: We lacked a middle layer that captured human-style, reusable procedures for GUIs—something between your overall goal (“format this document”) and the micro-actions (“move mouse by 10 pixels”).

Failed attempts (and why they stumbled):

End-to-end prompting: Powerful but brittle. A single wrong click can derail long tasks.
Big static tool menus: Unscalable to hundreds of app-specific tools, hard for the model to pick from reliably.
Code-only skills (server-side): Don’t transfer well to Windows apps without solid APIs.
Memory-only fixes: Memory helps recall, but doesn’t give reusable step-by-step skills.

The gap: Agents needed a structured skill base—small, parameterized skills with step maps that cover common Windows apps and flex with UI changes.

Real stakes: This matters for everyday life—students formatting assignments, workers managing spreadsheets, families organizing photos, or people with accessibility needs getting reliable digital help. With sturdy skills, agents save time, reduce frustration, and make computers easier and fairer to use.

02Core Idea

🍞 Hook: Think of a binder full of recipe cards. Each card has a clear title (“Bake Chocolate Cupcakes”), blanks to fill (how many cupcakes? which tray?), and a flow with branches (if no butter, use oil). You grab the right card, fill the blanks, and cook.

🥬 The Concept (Aha! in one sentence): Package human computer-use know-how as reusable, parameterized skills with step graphs, then let an LLM retrieve, fill, chain, and execute them with memory-aware recovery. How it works (high level):

Build a structured skill library of small, app-specific skills.
Give each skill arguments (the blanks) and an execution graph (the step map with safe branches).
Connect skills with a composition graph that shows typical human workflows.
At run time, retrieve likely skills, re-rank them, fill the arguments, and execute by GUI grounding or scripts.
Remember outcomes to avoid repeating mistakes and to try alternatives. Why it matters: This middle layer turns long, fragile click-chains into sturdy, reusable moves that scale across apps and tasks without constant re-training.

🍞 Anchor: “Rename an Excel sheet to ‘Company Analysis’” becomes a single skill with a plan: try hotkeys first, else right-click tab → Rename, type the name, press Enter.

Multiple analogies for the same idea:

LEGO: Each skill is a brick with pegs (arguments). Snap bricks in different orders to build castles (tasks).
Maps: An execution graph is a city map with detours. If a road (button) is closed, follow a signed alternate route (hotkey).
Sports plays: The composition graph is a playbook. Each play (skill) has routes; the quarterback (planner) picks and adapts based on field conditions (UI state).

Before vs. after:

Before: Agents clicked at the level of pixels, not procedures. They forgot clever shortcuts and re-invented steps.
After: Agents select named, transfer-ready maneuvers. They recover from bumps (dialogs, layout shifts) and keep going.

Why it works (intuition):

Human-aligned units: The brain thinks in named chunks (“Open Downloads,” not “Ctrl+L, type path, Enter, …”). Skills mirror those chunks.
Parameterization: Blanks let one skill cover many cases (any folder name, any color choice, any file path).
Graphs over scripts: Branches absorb UI variability (menu vs. hotkey, dialog vs. no dialog) without starting over.
Retrieval: The planner doesn’t juggle hundreds of skills in its short-term memory; it searches just-in-time.
Memory: Reflection prunes dead ends and encourages better choices next step.

Building blocks (Sandwich, one by one):

🍞 Hook: You know how a library keeps books labeled by topic so you can find what you need fast?

🥬 The Concept: Structured Skill Library is a catalog of small, reusable computer actions with names, descriptions, arguments, and graphs. How it works:

Write each skill like a recipe card.
Index the cards so the agent can search them.
Keep skills small and app-specific so they stay reliable. Why it matters: Without this library, the agent keeps guessing raw clicks from scratch.

🍞 Anchor: “FileExplorerCreateNewFolder(name=Logs)” beats improvising every click in File Explorer.

🍞 Hook: Imagine a choose-your-own-adventure where choices depend on what’s on the page right now.

🥬 The Concept: Parameterized Execution Graphs are step maps with branches guarded by the current UI. How it works:

Nodes are control states (start, success, alternatives).
Edges are actions (click, hotkey, script) with conditions (e.g., if dialog is open).
Arguments fill in targets (which file name? which cell?). Why it matters: Without graphs, one odd pop-up ruins the whole plan.

🍞 Anchor: If the font menu is hidden, the PowerPoint text-color skill uses Alt+H → F → C instead of hunting pixels.

🍞 Hook: Think of sorting groceries into two bins: fixed choices (apples, bananas) vs. open choices (weight in grams).

🥬 The Concept: Feasible Domains and Argument Generators define what values are valid for each skill blank and how to pick them. How it works:

Finite domains (menu items) are enumerated or read from the UI.
Open domains (file paths, free text) are sampled or chosen with smart rules.
The planner fills only allowed values to keep execution real. Why it matters: Without domains, the agent may ask for a color that doesn’t exist or a path that’s impossible.

🍞 Anchor: For “Timer length,” only time-like values make sense; the generator picks 25 for a Pomodoro session.

🍞 Hook: Building a robot dance from known moves—step, spin, dip—in a sensible order.

🥬 The Concept: Skill Composition Graph shows common chains humans follow across and within apps. How it works:

Nodes are skills; edges are typical next steps.
Paths form workflows (launch Edge → open homepage → search → save page).
It suggests good sequences without hard-coding one single plan. Why it matters: Without composition, the planner might hop randomly between unrelated skills.

🍞 Anchor: Excel open → rename sheet → enter formula → autofill is a tried-and-true path the agent can follow.

03Methodology

At a high level: Input (your instruction + the current screen) → Query & Retrieve candidate skills → Re-rank and pick one → Fill the blanks (arguments) → Execute via GUI or scripts → Reflect in memory → Repeat until done.

Step 1: Retrieve-Augmented Skill Planner

What happens: The planner LLM reads your goal and the screen, then writes a few different search queries (like “open Edge home,” “launch Edge,” “go to home page”) to hunt for matching skills in the library.
Why this step exists: The planner shouldn’t memorize every skill. Retrieval narrows hundreds of skills to a tiny, relevant shortlist at test time.
Example: For “Next: Open Edge Home Page,” the planner emits several queries so both launch and navigation skills surface.

🍞 Hook: Like searching a bookstore with multiple keywords to find the perfect guide. 🥬 The Concept: Dynamic Skill Retrieval means finding skills on the fly with both keyword and meaning-based search. How it works:

The planner writes multiple queries (ensemble) to cover different phrasings.
A hybrid index (lexical + embedding) returns top candidates.
The shortlist goes to the next stage. Why it matters: Without retrieval, the planner either sees too many tools or too few, causing wrong picks. 🍞 Anchor: Typing “rename sheet” and “tab rename” both retrieve the ExcelRenameSheet skill.

Step 2: Skill Re-ranker

What happens: The agent compares retrieved skills using the current UI, history, and skill requirements, then selects the best one. It also keeps a small set of basic fallback actions (click, type, hotkey) in case no high-level skill fits.
Why this step exists: Retrieval is broad; reranking is precise. It turns a good guess into the best next move.
Example: If Edge is already open, “Go to Home” outranks “Launch Edge.”

🍞 Hook: Think of a coach choosing the right play after seeing the defense. 🥬 The Concept: Retrieval-Augmented Skill Selection is picking the single most promising skill from the shortlist. How it works:

Score candidates by goal match, UI fit, and argument feasibility.
Prefer skills that continue the current workflow path.
Fall back to primitives if nothing fits. Why it matters: Without selection, the agent may waste steps or pick incompatible skills. 🍞 Anchor: In File Explorer at Downloads, “Create new folder” beats “Go to Downloads” again.

Step 3: Skill Configurator (Arguments within Feasible Domains)

What happens: The agent fills in the blanks for the chosen skill—names, paths, cells, colors—using domain rules.
Why this step exists: Correct arguments make the skill executable in the real UI.
Example: For “Create Timer,” it sets minutes=25, label=“Pomodoro Session.” For Excel rename, it sets newSheetName=“company analysis.”

🍞 Hook: Like filling a form before printing your ticket—wrong details, no trip. 🥬 The Concept: Domain-aware Argument Instantiation keeps values valid and actionable. How it works:

Read allowable values (e.g., visible menu items, selectable colors).
For open text, use safe formats or heuristics (e.g., legal file names).
Check UI state to ensure targets exist (is the sheet tab visible?). Why it matters: Without this, the agent might request a non-existent menu option. 🍞 Anchor: PowerPoint font color picks from the actual dropdown grid; it won’t ask for “Invisible Unicorn Blue.”

Step 4: Executor (GUI Grounding + Scripts)

What happens: The agent walks the skill’s execution graph depth-first, executing one primitive at a time. It uses GUI grounding to locate buttons/fields on the screen. When possible, it prefers hotkeys or scripts for reliability.
Why this step exists: Accurate, step-by-step control survives UI changes better than a single blind click.
Example: In Excel, it may press Ctrl+G to jump to F7, type the formula, press Enter, then Ctrl+D to autofill.

🍞 Hook: Like following a map with optional detours when a road is blocked. 🥬 The Concept: Parameterized Execution Traversal turns the graph into real actions, choosing safe edges as the UI changes. How it works:

Check guarded branches: if dialog present, handle it; else continue main path.
Use hotkeys when available; else click grounded coordinates.
Respect edge weights (preferences) if provided. Why it matters: Without traversal logic, a surprise popup would break the whole run. 🍞 Anchor: If the rename context menu doesn’t appear, the Excel sheet rename path switches to double-clicking the tab and typing.

Step 5: Memory and Reflection

What happens: After each skill, the agent saves a short summary: what it tried, what happened, and whether it worked. It consults this memory before the next step to avoid loops.
Why this step exists: Remembering failures is how you stop repeating them.
Example: If “Open Edge” failed due to a permission dialog, the planner won’t keep relaunching; it tries handling the dialog or a different route.

🍞 Hook: Like keeping a lab notebook so you don’t redo a failed experiment the same way. 🥬 The Concept: Memory-Aware Failure Recovery uses past outcomes to steer future choices. How it works:

Store compact skill summaries with success/fail signals.
Penalize recent failures in reranking.
Propose alternative branches or different skills. Why it matters: Without memory, the agent can get stuck clicking the same wrong button. 🍞 Anchor: After a misclick in VLC, the agent stops chasing that control and switches to a keyboard shortcut.

Concrete mini-walkthroughs:

Excel: “Open betawacc.xlsx, rename Sheet1 to company analysis, compute averages in F7:F10.”
1. Retrieve ExcelOpenExistingWorkbook; configure filePath=betawacc.xlsx; execute via start command or recent files.
2. Retrieve ExcelRenameSheet; choose right-click→Rename or double-click; type company analysis.
3. Retrieve ExcelInsertFunctionCall; go to F7; type =AVERAGE(C7:E7); Enter.
4. Retrieve ExcelAutoFillDown; select F7:F10; Ctrl+D.
Clock: “Create a 25-minute timer called Pomodoro Session.”
1. Switch to Timer tab.
2. Add timer; set minutes=25; label=Pomodoro Session; Save.

Secret sauce:

Human-sized units (skills) + safe branching (graphs) + just-in-time retrieval (planner) + remembering mistakes (memory). Each ingredient is simple; together they make reliable long-horizon execution.

04Experiments & Results

The test: Can a big, carefully built skill library actually run lots of small tasks reliably, and can an end-to-end agent use it to solve real multi-step desktop problems?

What they measured and why:

Execution success rate for individual skills composed into tiny tasks: Do these cards execute as promised?
End-to-end success on WindowsAgentArena (WAA): Can the full agent retrieve skills, fill them, and complete natural-language tasks under changing UIs?
Efficiency (steps) and coverage (how many distinct skills used): Are we fast and general, not overfit to a few tricks?

The competition:

UltraCUA (a strong hybrid-action baseline), OpenAI Operator, Agent-S and AgentS3, UI-TARS variants, NAVI, STEVE, UFO-2, and human performance reported for context.

Scoreboard with context:

Skill executions and synthesized trajectories: 76.4% success. That’s like getting an A when many peers hover at C+/B−. In fact, it’s 1.7×–3.6× higher than UltraCUA and Operator on comparable synthesized tasks. Translation: the recipe cards are sturdy.
WindowsAgentArena (end-to-end): 50.26% single-run and 57.5% best-of-three with a GPT-5 planner. That edges out prior state-of-the-art and stays efficient (≤30 steps). Think of it as winning a close, high-stakes tournament while running fewer laps than rivals.
Skill usage breadth: Only 117 out of 478 distinct skills were needed to get SOTA on WAA. That suggests good general-purpose skills rather than benchmark-specific hacks.

Per-application flavor:

High success: Apps with stable layouts and strong hotkeys (Excel, Settings, Bing/Edge/Chrome tasks) shine. Hotkeys sidestep flaky pixel clicks.
Tougher zones: Visually dense or media-heavy apps (PowerPoint, VLC) are harder; tiny controls and dynamic content stress grounding and timing.

Surprising findings:

More capable planners help, but skills help everyone. Even smaller models get clear boosts when using the skill base, and bigger models get even bigger jumps (+15.6% for GPT-5 with skills vs. without on WAA).
Efficiency without sacrifice: The agent keeps step counts modest while improving success—a rare combo, because extra safety often costs more clicks. Here, safe branching + hotkeys + memory curb both errors and retries.
Not all skills are equal: A minority of versatile, transferable skills carry a lot of weight across tasks, hinting at a core set of “digital literacy” moves that generalize widely.

Takeaway: The numbers say the library is not just neat; it is useful. It turns brittle, ad-hoc clicking into dependable mini-procedures, and the agent uses them to beat strong baselines on a respected benchmark.

05Discussion & Limitations

Limitations (honest and specific):

Model dependence: Stronger language models retrieve, rank, and configure skills better. With lighter models, the library still helps, but planning errors rise.
UI volatility: Rapidly changing app layouts, transient pop-ups, or animations can still confuse GUI grounding and timing.
Coverage gaps: The initial release (hundreds of skills across ~17 apps) won’t cover every niche workflow; unfamiliar apps may force fallback to low-level clicks.
OS focus: The work targets Windows. Porting to macOS/Linux requires reauthoring skills, hotkeys, and execution branches.
Argument edge cases: Open-domain arguments (file paths, arbitrary text) can still be tricky without strong environment-aware heuristics.

Required resources:

A Windows environment (often virtualized) with stable versions of target apps.
An LLM planner (quality affects performance), an embedding model for retrieval, and a GUI grounding model for screen localization.
Skill authoring time to expand coverage; telemetry to refine domains and branches.

When not to use:

Highly specialized software that changes UI often or has complex modal dialogs not yet modeled.
Security-sensitive actions (system admin tasks) where any misclick is unacceptable; prefer audited, API-level automations.
Time-critical scenarios with strict latency budgets where even short retrieval/planning loops are too slow.

Open questions:

Automatic skill discovery: Can we mine new skills and branches from human traces or agent rollouts?
Learning edge weights: Can the agent learn which branches are most reliable per app/version and update them online?
Cross-OS generalization: How portable are skill schemas and composition graphs across operating systems?
Safer grounding: How to make UI perception more robust to theme changes, scaling, and adversarial elements?
Meta-reasoning: When should the agent choose scripts vs. GUI steps for the same intent to balance speed and safety?

Bottom line: CUA-Skill is a strong foundation but not the final word. Its structure unlocks scale and reliability, and the next wave is about growing coverage, automating skill authoring, and making perception and planning even sturdier.

06Conclusion & Future Work

Three-sentence summary: This paper turns human computer know-how into a reusable Windows skill library with fill-in-the-blank arguments and step graphs, then builds an agent that retrieves, configures, and executes those skills with memory-aware recovery. The result is a practical, scalable system that beats strong baselines on both synthesized trajectories (76.4% success) and the WindowsAgentArena benchmark (57.5% best-of-three) while staying efficient. It shows that the right middle layer—skills—can make desktop agents far more dependable.

Main achievement: Establishing a structured, parameterized skill base as the missing procedural layer between high-level goals and low-level GUI actions, and proving that this design scales in the wild.

Future directions:

Auto-mining and refining skills from human/agent traces.
Learning branch preferences and timing directly from feedback.
Porting to other OSes and expanding app coverage through community skill stores.
Tighter integration with standardized tool protocols without depending on app-specific APIs.

Why remember this: It reframes computer use for agents around human-sized chunks (skills), with parameters, graphs, retrieval, and memory. That simple recipe turns fragile click streams into reliable digital helpers you can trust for real work.

Extra concept anchors (Sandwich quickies):

🍞 Hook: Like baking from a trusted cookbook instead of winging it.
🥬 The Concept: Long-horizon Task Completion is finishing big, multi-step jobs by chaining dependable skills and recovering from bumps. How it works: retrieve → choose → fill → execute → remember → repeat. Why it matters: Without it, the agent keeps getting lost mid-project.
🍞 Anchor: From opening a spreadsheet to emailing a polished report—without derailing on a single wrong click.

Practical Applications

•Automate repetitive office chores: rename files, create folders, move documents, and archive completed work.
•Speed up spreadsheet work: open specific files, rename sheets, insert formulas, and autofill columns reliably.
•Standardize document formatting in Word or PowerPoint: set fonts, colors, headings, and export to PDF.
•Web research assistance: launch a browser, go to a home page, search, and save pages for later reading.
•System setup: open Settings, toggle options, adjust displays, and configure Wi‑Fi with guarded steps.
•Education helpers: prepare class materials by collecting web references and organizing them into folders.
•Customer support workflows: open ticket dashboards, fetch logs via File Explorer, and attach files to responses.
•Data collection: navigate to sites, download files, and place them into structured directories.
•Timed productivity: create Clock timers (e.g., 25-minute Pomodoro) and reminders while working in other apps.
•Team operations: share a common skill pack so multiple agents follow the same reliable procedures across machines.

Version: 1