OS-Symphony: A Holistic Framework for Robust and Generalist Computer-Using Agent

Bowen Yang; Kaiming Jin; Zhenyu Wu; Zhaoyang Liu; Qiushi Sun; Zehao Li; JingJing Xie; Zhoumianze Liu; Fangzhi Xu; Kanzhi Cheng; Qingyun Li; Yian Wang; Yu Qiao; Zun Wang; Zichen Ding

OS-Symphony: A Holistic Framework for Robust and Generalist Computer-Using Agent

Intermediate

Bowen Yang, Kaiming Jin, Zhenyu Wu et al.1/12/2026

arXiv PDF

Key Summary

•Computer-using agents kept forgetting important visual details over long tasks and could not reliably find up-to-date, step-by-step help for unfamiliar apps.
•OS-Symphony fixes this with a conductor-like Orchestrator, a Reflection-Memory Agent that saves only key milestone screenshots, and versatile Tool Agents (Grounders, a Coder, and a Multimodal Searcher).
•The Reflection-Memory Agent turns long, messy histories into compact visual milestones plus short summaries and flags loops or errors so the agent can self-correct.
•The Multimodal Searcher uses a See-Act strategy to browse the web visually inside a safe sandbox and synthesize tutorials that match what the screen actually looks like.
•Together these parts make the agent more robust on long workflows and better at generalizing to new software and operating systems.
•OS-Symphony sets new state-of-the-art results: 65.84% on OSWorld, 63.5% on WindowsAgentArena, and 46.0% on MacOSArena.
•Ablations show big gains from both milestone memory and visual-centric search compared to last-K memory and text-only retrieval.
•Smaller or cheaper models benefit most because the Searcher fills knowledge gaps; cost-performance is strong with GPT-5-Mini.
•Limits remain: desktop-only testing, slower than humans due to multiple agents, and occasional misses on very subtle visuals.
•The framework is modular and future-proof: better VLMs, grounders, or search engines can plug in without redesigning the whole system.

Why This Research Matters

Many people struggle with software that changes its menus or looks different on another operating system. OS-Symphony helps by remembering only the big, important visual moments and fetching tutorials that match the exact screen you see. This makes long, multi-step tasks like reports, spreadsheets, and system setup far more reliable. It also lowers the cost of automation because smaller models can succeed by calling the visual Searcher when needed. Cross-platform support means help on Linux, Windows, and macOS, which makes the system useful in homes, schools, and workplaces. Over time, this approach can turn computer help from a brittle script into a dependable assistant that adapts as software evolves.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: You know how when you work on a long school project, your desk gets messy, and you start losing track of where you put your notes? And when you try something new, you look up a how-to video that matches what you see on your screen.

🥬 The Situation Before: Computer-using agents (AIs that click, type, and scroll like people) got very good at handling short, simple tasks. But on long, multi-step tasks, they often forgot what mattered on past screens, made the same mistake again, or drifted off course. And when they faced new apps or versions, they relied on text-only search or outdated knowledge that didn’t match what the screen really looked like.

What it is: This problem space is about making AI steady over many steps and adaptable to new software.
How it worked before: Agents used short-term memory (last K turns) or text summaries to remember history; for extra knowledge, they tried Retrieval-Augmented Generation (RAG) using text search or curated local docs.
Why it mattered: Without stable memory and the right tutorials, agents looped, clicked the wrong things, or gave up.

🍞 Example: Imagine an AI trying to change a slide to portrait mode in a presentation app. It clicks a setting, thinks it worked, but the slide is still sideways. Without a clear, visual memory and reflection, it might wrongly say done instead of trying a better path.

🍞 Hook: Think about following a recipe with only written steps but no pictures. If the stove dial looks different from the recipe, you could get stuck.

🥬 The Problem: Two stubborn obstacles held agents back.

Long-horizon robustness: Agents needed granular control over which past screenshots to keep and how to reflect on them, not just long walls of text. Otherwise, visual context got lost, errors piled up, and loops went unnoticed.
Generalization to new software: Text-only or static knowledge bases couldn’t keep up with changing interfaces. They missed image-heavy, layout-specific cues (like where a menu actually sits), and local KBs were expensive to build and update.

🍞 Example: An agent tries to make a Chrome desktop shortcut on Linux. Menus are similar but not identical across versions. Text docs may say one path, but the menu label on screen is different. The agent needs a live, visual tutorial that matches this exact UI.

🍞 Hook: Picture a librarian who only keeps the most important pages marked with sticky notes, and a friendly tutor who can browse the web and show screenshots that look like yours.

🥬 Failed Attempts:

Last-K memory windows: Efficient but forget key earlier visuals; errors like intent drift go undetected.
Long text summaries: Compress history but drop critical visual semantics, so agents can’t verify if a click did anything visible.
Text-only RAG: Finds words, not pictures; hard to follow when tutorials are screenshot-heavy. Local knowledge bases require heavy maintenance and still miss new app versions.

🍞 Example: A text summary might say Click Properties, but if the panel looks different or the change didn’t show up on screen, a text-only approach won’t notice.

🍞 Hook: Imagine a coach who pauses the game at key moments, pins those frames on a board, and later compares new moves to spot repeats or mistakes.

🥬 The Gap: Agents needed two things at once: (1) a visual-first, milestone memory so they don’t drown in redundant screenshots yet keep the important ones, and (2) a visual-aware search tool that can browse, see, and build step-by-step tutorials aligned to the current screen.

🍞 Example: Save only the first screen, the target webpage, and each big sub-goal accomplished. When stuck, browse to get a tutorial with matching screenshots, not just text.

🍞 Hook: Think of a team where a conductor leads, one player remembers big moments, another looks up fresh how-tos with pictures, and others handle precise pointing and heavy lifting.

🥬 Why it Matters in Daily Life:

Doing taxes across multiple apps, formatting a report from a spreadsheet, or configuring settings on a new OS all need steady memory and up-to-date guidance.
Cross-platform help: Windows, Linux, and Mac look different. A visual-aware searcher can adapt.
Cost and access: Smaller, cheaper models can perform well if they can fetch the right tutorial and keep the right memories.

🍞 Example: A student needs to copy results from an Excel file into a report, reformat it, and export a PDF. The agent that remembers milestones and pulls a matching tutorial will finish faster and with fewer mistakes.

02Core Idea

🍞 Hook: You know how a band sounds best when a conductor leads, the drummer keeps time, the lead guitarist adds flair, and someone else handles lyrics? Teamwork makes it work.

🥬 Aha in One Sentence: Keep only the most important visual milestones, reflect on them to self-correct, and when stuck, browse the web visually to build a step-by-step tutorial that matches what’s on-screen—then orchestrate all of this as one team.

🍞 Anchor: When Chrome’s menu doesn’t show Create shortcut where you expect, the agent’s memory flags a loop, the Orchestrator calls the Searcher, the Searcher browses and returns a Linux-specific, screenshot-aligned tutorial, and the task completes.

— New Concept: See-Act Paradigm — 🍞 Hook: Imagine driving: you look, then decide, then act. 🥬 The Concept:

What it is: See-Act means the agent first sees the screen, then chooses an action based on what it sees.
How it works: 1) Capture a screenshot; 2) Analyze visible elements; 3) Choose an action like click or type; 4) Observe the new screen and repeat.
Why it matters: Without looking first, the agent would click blindly and get lost. 🍞 Anchor: The Searcher scrolls a help page only after seeing that the answer is below the fold.

— New Concept: OS-Symphony — 🍞 Hook: Think of an orchestra where different instruments play different parts to make the song work. 🥬 The Concept:

What it is: OS-Symphony is a holistic team for computer-use made of an Orchestrator, a Reflection-Memory Agent, and Tool Agents.
How it works: The Orchestrator plans; the Reflection-Memory Agent curates milestone screenshots and gives reflections; Tool Agents localize UI, write code when needed, and search the web visually.
Why it matters: Without teamwork, the agent either forgets key visuals or lacks the right how-to guidance. 🍞 Anchor: While automating a multi-app workflow, OS-Symphony saves only pivotal screens and calls the Searcher when UI differences appear.

— New Concept: Orchestrator — 🍞 Hook: Like a conductor keeping the tempo and cueing solos. 🥬 The Concept:

What it is: The Orchestrator is the decision-maker that reads instructions, current screen, short history, reflections, and tutorials, then picks the next action.
How it works: 1) Read task and latest reflection; 2) Consider last K steps plus any tutorial; 3) Choose a GUI action or call a Tool Agent; 4) Repeat until done or fail.
Why it matters: Without it, tools would not coordinate or know when to search or code. 🍞 Anchor: After the RMA says you are looping, the Orchestrator calls the Searcher instead of clicking more.

— New Concept: Reflection-Memory Agent (RMA) — 🍞 Hook: Think of a coach who pauses video at key plays and writes notes on what went right or wrong. 🥬 The Concept:

What it is: The RMA saves only milestone screenshots, summarizes steps, detects loops and errors, and sends a short reflection.
How it works: 1) Summarize last action using before/after screens; 2) Decide if the new step is a milestone; 3) Build a long-term memory of key screenshots plus summaries; 4) Label state as On-track, Completed, Infeasible, or Off-track with error type.
Why it matters: Without it, the agent loses critical visuals and cannot self-correct. 🍞 Anchor: If clicking a menu changes nothing on screen, the RMA flags GUI Error so the Orchestrator retries differently.

— New Concept: Long-term Memory (Milestone-driven) — 🍞 Hook: Like keeping only bookmarked pages of a long book. 🥬 The Concept:

What it is: A compact memory that stores just the key screenshots and summaries.
How it works: 1) Mark milestones; 2) Keep those screenshots; 3) Use them for future checks; 4) Drop redundant frames.
Why it matters: Without it, the context gets bloated and the agent misses the big picture. 🍞 Anchor: The agent keeps the first screen, the target webpage, and moments when a subgoal is achieved.

— New Concept: Versatile Tool Agents — 🍞 Hook: Like a Swiss Army knife with different tools for different jobs. 🥬 The Concept:

What it is: A set of helpers: Grounders (to find UI elements), a Coder (to manipulate files or data directly), and a Multimodal Searcher (to fetch tutorials).
How it works: The Orchestrator calls a specific tool only when needed; each tool runs in a mini-workflow and returns a concise result.
Why it matters: Without focused tools, the agent would struggle with precision, speed, or new knowledge. 🍞 Anchor: For a big CSV cleanup, the Orchestrator delegates to the Coder and then visually verifies the results.

— New Concept: Multimodal Searcher — 🍞 Hook: Like a friend who not only reads a guide but also watches a video to match the exact buttons you see. 🥬 The Concept:

What it is: A web-browsing agent that uses See-Act to navigate pages visually and produce a step-by-step tutorial aligned to the current UI.
How it works: 1) Get a how-to query plus the current screenshot; 2) Open a browser sandbox; 3) Click, type, and scroll to explore multiple pages; 4) Stop only when a high-fidelity, relevant tutorial is found; 5) Return a structured tutorial.
Why it matters: Text-only retrieval misses image-based guidance and version differences. 🍞 Anchor: To create a website shortcut on Linux Chrome, it finds a tutorial showing Save and share instead of Create shortcut.

— New Concept: Visual-Centric Search — 🍞 Hook: When following Lego instructions, pictures beat long paragraphs. 🥬 The Concept:

What it is: Retrieval that pays attention to images and layout, not just words.
How it works: The Searcher reads screenshots, page layouts, icons, and text together to ensure steps match the visible UI.
Why it matters: Without visuals, the agent follows instructions that look right in text but wrong on screen. 🍞 Anchor: The tutorial includes annotated screenshots highlighting the exact menu icon.

— New Concept: Retrieval-Augmented Generation (RAG) — 🍞 Hook: Like writing a report with your notes plus a library book you just looked up. 🥬 The Concept:

What it is: A way to improve answers by fetching external knowledge at runtime.
How it works: 1) Detect a knowledge gap; 2) Retrieve up-to-date info; 3) Use it to guide actions; 4) Keep it around for next steps.
Why it matters: Without retrieval, the agent leans only on memory that may be outdated. 🍞 Anchor: The Searcher retrieves a current Thunderbird guide, and the Orchestrator follows it to change a quoting setting.

Before vs After:

Before: Agents forgot key visuals, looped, and relied on brittle text search.
After: Milestone memory preserves essential visuals, and visual-centric tutorials guide tricky, new scenarios.

Why It Works (Intuition):

Milestones shrink noise and keep signal; reflections catch errors; on-demand, visual tutorials fill gaps precisely when and where needed.

Building Blocks: Orchestrator, RMA with milestone memory, Grounders, Coder, and a Multimodal Searcher operating in a safe sandbox.

03Methodology

🍞 Hook: Imagine assembling a Lego set: you look at the picture, place a few key pieces, step back to see if it matches, and check a guide when confused.

🥬 High-Level Recipe: Input (user goal + current screenshot) → Orchestrator reads short history, RMA reflection, and any tutorial → Decide: act directly, call Grounder/Coder, or invoke Multimodal Searcher → Execute action → RMA summarizes, updates milestones, and reflects → Repeat until done or fail.

🍞 Anchor: For making a desktop shortcut in Chrome on Linux, the Orchestrator sees repeated wrong clicks, reads the RMA’s Off-track: Lack of Tutorial, calls the Searcher, gets a matching tutorial, and completes the task.

Step-by-Step Details:

Orchestrator Decision Loop

What happens: Reads the instruction, last K steps, current screenshot, RMA reflection, and any tutorial. Chooses the next action: a grounded GUI action, call_code_agent, or call_search_agent.
Why this step exists: Without a central planner, tools would act out of sync.
Example: The reflection says last click failed; the Orchestrator retries with a more specific click description instead of blindly clicking again.

Reflection-Memory Agent (RMA)

What happens: Uses pre- and post-action screenshots plus the previous output to summarize the step and judge success. It decides whether the step is a milestone, keeps key screenshots only, and sends a reflection: On-track, Completed, Infeasible, or Off-track with error type (GUI Error, Lack of Tutorial, Code Error, Other).
Why this step exists: It prevents forgetting and catches loops or invisible failures.
Example: The RMA flags that changing orientation didn’t visibly apply, preventing a premature done.

Milestone-Driven Long-Term Memory

What happens: Stores only pivotal screenshots (first state, reaching target page, completing a subgoal) plus summaries.
Why this step exists: Keeps context compact and meaningful; without it, either the window overflows or key visuals vanish.
Example: In a research task, it saves the professor’s homepage view and the moment an email is found.

Tool Agents: Grounders

What happens: General Grounder combines appearance and function to pick coordinates; OCR Grounder scans words into a {text, id, box} table and selects an ID.
Why this step exists: Precise clicks and text selections are hard; without grounders, the agent would guess.
Example: Selecting the exact menu item labeled Save and share on the right menu, not a similar-looking one.

Tool Agent: Coder

What happens: For file edits or data transformations, the Orchestrator delegates to the Coder, which locates files, modifies content, verifies, and summarizes.
Why this step exists: Code is faster and more reliable for bulk or precise edits; without it, many clicks would be needed.
Example: Cleaning a CSV and then opening it to visually confirm the changes.

Tool Agent: Multimodal Searcher (Visual-Centric Search)

What happens: Upon call_search_agent with a how-to query and the current screenshot, the Searcher opens an isolated browser sandbox. Within a small action set (click, type, scroll), it visits multiple pages, cross-checks steps, and stops only when it can build a reliable, aligned tutorial.
Why this step exists: Text-only retrieval misses crucial visuals; without a sandbox, searches could disturb the main task.
Example: It finds that, in this Linux build, creating a shortcut is under Save and share rather than More tools.

Loop Detection and Error Typing

What happens: A rule-based checker compares recent actions and images to spot repeats; the RMA then labels Off-track: Lack of Tutorial or other errors.
Why this step exists: Loops are costly; catching them early triggers the Searcher.
Example: Repeatedly opening the same submenu without progress triggers a tutorial search.

Verification and Termination

What happens: After actions (especially Code), the Orchestrator visually verifies results. The process ends with done or fail or infeasible when the RMA and Orchestrator agree.
Why this step exists: Prevents silent failures and premature success.
Example: After editing a file via code, the agent opens it to confirm the table looks correct.

Secret Sauce:

Milestone Memory: Stores the right pictures, not all the pictures.
Structured Reflection: Clear labels (On-track, Off-track types) guide the planner.
Visual-Centric Search: Browses like a human and returns step-by-step, screenshot-aligned tutorials.
Context Folding: Tools do deep work in their own mini-contexts, then return concise summaries so the Orchestrator stays focused.

Concrete Mini-Examples with What Breaks Without Each Step:

Without RMA: After a failed click, the agent assumes success and ends early.
Without milestones: Either bloated context leads to confusion or missing critical screenshots hides evidence of failure.
Without Searcher: In OOD cases, the agent keeps guessing and looping.
Without Grounders: The agent clicks near the right spot but not exactly on it, causing silent missteps.
Without Code: Bulk edits take many fragile GUI steps and often break.

Putting It Together: The Orchestrator cycles actions; the RMA guards memory and correctness; tools handle precision and knowledge; the Searcher fills gaps only when needed. The result is a robust, generalist computer-using agent.

04Experiments & Results

🍞 Hook: Think of a school tournament: different teams (agents) play on different fields (operating systems), and the scoreboard shows who solved the most tasks correctly.

🥬 The Tests: The authors evaluated on three realistic arenas where the agent must operate real desktop apps by seeing and clicking:

OSWorld (Linux Ubuntu, 361 tasks after filtering): multi-domain, long workflows.
WindowsAgentArena (Windows, 154 tasks): a mix of Office, system, coding, web, and utilities.
MacOSArena (macOS, 63 tasks): single-app and multi-app tasks, with tricky UI differences. They measured success rate, meaning the agent completes the task within the step limit.

🍞 Anchor: It is like scoring 66 out of 100 tasks right on a tough exam where most competitors are below 60.

The Competition: OS-Symphony was compared to specialist native agents and modular frameworks like Agent S3 and CoAct-1, plus strong vision-language models and grounders (UI-TARS line).

Scoreboard With Context:

OSWorld: OS-Symphony hits 63.61% (50 steps) and 65.84% (100 steps) with GPT-5, a new state of the art. That is like getting an A- where others got B or C. In the complex Workflow domain (multi-app), it beats the runner-up by about 7 points, showing strong long-horizon stability.
WindowsAgentArena: 63.5% at 50 steps with GPT-5, topping both 50- and 100-step Agent S3 scores by 9.4 and 6.9 points, respectively. Even with GPT-5-Mini, OS-Symphony outperforms the 100-step Agent S3 using the stronger model.
MacOSArena: 46.0% with GPT-5-Mini; even with Qwen3-VL-32B-Instruct, it reaches 19.05%, surpassing previous methods that often scored near zero. macOS UI quirks are hard; visual-centric search and milestone memory help bridge the gap.

Surprising and Insightful Findings:

Small Models Benefit More: With weaker models like GPT-5-Mini or Qwen3-VL, calling the Multimodal Searcher more often fills knowledge gaps, shrinking the performance gap to big models at far lower cost.
Pass@K Ceiling: Running multiple attempts boosts success up to about 79.4% (Pass@5), indicating high potential but some run-to-run variability.
Ablations Prove the Ingredients Matter:
- Search: Multimodal (visual) search beats unimodal (text-only) search by around 10% relative in knowledge-heavy domains.
- Reflection and Memory: Milestone-driven long-term memory outperforms last-K reflection or no reflection, especially in multi-app workflows; it also saves steps by catching errors earlier.
Cost-Effectiveness: GPT-5-Mini variants achieve near top performance with much lower cost, making the system practical.

Why These Results Make Sense:

The RMA prevents silent failures and loops by watching visual outcomes; that is crucial in long sequences.
The Searcher gathers tutorials that actually look like the user’s screen, avoiding text-only mismatch.

Concrete Data Points Called Out in the Paper:

OSWorld SOTA: 65.84% at 100 steps with GPT-5.
WindowsAgentArena SOTA: 63.5% at 50 steps with GPT-5.
MacOSArena: 46.03% with GPT-5-Mini; 19.05% with Qwen3-VL-32B-Instruct.
Ablation snapshot (OSWorld, GPT-5-Mini, 50 steps): w/o Search 53.78% → Unimodal Search 54.81% → Multimodal Search 58.05%. w/o Reflection 54.38% → STM Reflection 54.01% → LTM Reflection (ours) 58.05%.

Bottom Line: The combination of milestone memory, structured reflections, and visual-centric search systematically lifts robustness and generalization across operating systems and model sizes.

05Discussion & Limitations

🍞 Hook: Even the best teams have limits: sometimes the lighting is dim, the sheet music is complex, or the tempo is hard to hold.

🥬 Honest Assessment:

Limitations:
1. Platform scope: Trained and tested on desktop OSes; mobile touch interfaces (Android, iOS) need action-space redesign and are untested.
2. Speed and cost: Multiple agents mean more tokens and latency; it is slower than human speed and not yet real-time.
3. Visual subtlety: Some fine-grained cues (tiny highlights, overlapping windows) can fool current VLMs, causing false or missed alarms in reflections.
4. Instruction and evaluation noise: Ambiguous tasks or overly strict checkers can mark correct intent as wrong (e.g., choosing Arlanda vs generic Stockholm).
Required Resources:
- A capable VLM, a general grounder (e.g., UI-TARS-1.5-7B), OCR, and a browser sandbox for search.
- Enough context window to hold last-K turns, reflections, and the tutorial.
- Compute budget if using large models; smaller models work well with search.
When Not to Use:
- Real-time control where split-second timing matters.
- Environments with extremely subtle visual cues beyond current VLM resolution.
- Strictly offline or air-gapped settings where web search is disallowed (unless you preload safe, visual-rich docs).
Open Questions:
1. Can we reduce latency by blending fast and slow thinking or by learning which tool to call earlier?
2. How do we improve fine-grained visual perception in reflections without blowing up the context budget?
3. What is the best action space for mobile and mixed-reality systems?
4. How can we stabilize runs so that single-shot performance approaches Pass@K ceilings?
5. Can we automatically rewrite or disambiguate instructions at deployment time without leaking answers or biasing tests?

🍞 Anchor: Think of future upgrades like adding a zoom lens for tiny UI details, a metronome for steadier timing, and a better librarian to fetch the exact picture book you need—faster.

06Conclusion & Future Work

🍞 Hook: Picture a pit crew where one person keeps notes of the key moments, one fetches the exact illustrated guide for your car model, and the lead calls the next move. That is how this agent works.

🥬 Three-Sentence Summary: OS-Symphony is a team-based framework that makes computer-using agents more reliable over long tasks and better at handling new software. It does this by saving only milestone screenshots, reflecting on progress and errors, and using a visual-centric Searcher to build tutorials that match what is on-screen. The Orchestrator coordinates these parts along with Grounders and a Coder to finish complex tasks across Linux, Windows, and macOS.

Main Achievement: New state-of-the-art performance on three online benchmarks—65.84% on OSWorld, 63.5% on WindowsAgentArena, and 46.0% on MacOSArena—by combining milestone-driven memory and visual-centric search in a unified system.

Future Directions:

Extend to mobile touch interfaces and hybrid GUI-API agents that reduce textual bottlenecks.
Sharpen fine-grained visual checks and speed up the multi-agent loop with fast/slow reasoning.
Improve stability so single attempts approach Pass@K results; integrate stronger commercial search engines as feasible.

🍞 Anchor: Next time an agent must wrangle slides, spreadsheets, browsers, and settings across different OSes, it can remember the key screens, fetch a matching picture guide when needed, and calmly finish the job—like a well-conducted symphony.

Practical Applications

•Automate report building: pull numbers from a spreadsheet, paste into a document, format, and export to PDF with visual verification.
•IT setup tasks: change system settings across Windows, Linux, or macOS with consistent success using milestone checks.
•Education support: guide students through software tutorials that visually match their screens for science projects or presentations.
•Customer support copilots: reproduce a user’s UI, fetch a matching tutorial, and perform fixes step-by-step.
•Data wrangling: use the Coder for bulk edits to CSVs or configs, then visually confirm results in GUI apps.
•Web tasks: navigate changing web interfaces by using the Searcher to find current, screenshot-aligned how-tos.
•Office workflows: coordinate between email, calendar, slides, and spreadsheets with fewer loops and better completion rates.
•Cross-version survival: when menu names move in new app versions, the Searcher supplies updated steps without manual retraining.
•Accessibility boosts: smaller, cheaper models can still perform well by relying on on-demand search and milestone memory.
•Quality assurance: detect loops and GUI errors early in long test runs, saving time and compute.

Version: 1