šŸŽ“How I Study AIHISA
šŸ“–Read
šŸ“„PapersšŸ“°BlogsšŸŽ¬Courses
šŸ’”Learn
šŸ›¤ļøPathsšŸ“šTopicsšŸ’”ConceptsšŸŽ“Shorts
šŸŽÆPractice
🧩ProblemsšŸŽÆPrompts🧠Review
Search
ShowUI-$Ļ€$: Flow-based Generative Models as GUI Dexterous Hands | How I Study AI

ShowUI-$Ļ€$: Flow-based Generative Models as GUI Dexterous Hands

Intermediate
Siyuan Hu, Kevin Qinghong Lin, Mike Zheng Shou12/31/2025
arXivPDF

Key Summary

  • •Computers usually click like a woodpecker, but they struggle to drag smoothly like a human hand; this paper fixes that.
  • •ShowUI-Ļ€ is a small (450M) AI hand for computers that can both click and drag in one unified way, using real-time vision.
  • •Instead of guessing a start and end point, it draws the whole path step by step, adjusting as the screen changes.
  • •It uses a flow-based generator (flow matching) so the mouse glides stably, not jittery like random guesses.
  • •A new ScreenDrag dataset and benchmark (20K drags; 505 tests across 5 apps) teaches and tests real dragging, not just endpoints.
  • •ShowUI-Ļ€ beats strong baselines on the online test by up to 4.8% despite being much smaller (26.98% success vs 22.18%).
  • •Key tricks include unifying clicks-as-tiny-drags, weighting the most important steps (start/end), and a direction-keeper regularizer.
  • •It shines at rotations, handwriting, and Captchas that need constant visual feedback, where tokenized ā€œaction-as-textā€ agents fail.
  • •Results suggest GUI agents should act like robot hands: continuous, closed-loop, and trajectory-aware.
  • •This work is a step toward computer helpers with human-like dexterity in everyday apps.

Why This Research Matters

Real software work often needs smooth mouse control—resizing, rotating, scrubbing timelines, and dragging files—so agents that only click can’t truly help. ShowUI-Ļ€ gives computers a steady ā€œhand,ā€ letting them adjust mid-move as screens change, just like humans do. That boosts productivity for creators and knowledge workers, and it makes assistive technologies more capable for users who can’t easily use a mouse. It also reduces the need for fragile, app-specific scripts by learning general motion patterns. Finally, the dataset and benchmark set a fair test so the community can measure true dexterity, not just clicks.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

šŸž Top Bread (Hook): You know how you can click a button once, but drawing a circle or rotating a picture needs you to keep your mouse down and move smoothly while watching the screen?

🄬 The Concept: Dragging vs. Clicking

  • What it is: Clicking is a one-and-done poke; dragging is a continuous move that needs constant seeing-and-adjusting.
  • How it works: 1) Press and hold the mouse, 2) Move along a path while watching the screen change, 3) Release at the right moment.
  • Why it matters: If an agent can only click, it can’t rotate objects, scrub timelines, or solve slider Captchas; it freezes on any task that needs a smooth path.

šŸž Bottom Bread (Anchor): Rotating a title in PowerPoint means holding the rotation handle and tracing an arc, not just clicking two points.

šŸž Top Bread (Hook): Imagine telling a friend, ā€œDrag this file into that folder,ā€ and they only tap the file once and stop halfway.

🄬 The Concept: Discrete vs. Continuous Actions

  • What it is: Discrete actions are single steps (like a click); continuous actions are flowing motions (like a drag path).
  • How it works: Discrete = pick one coordinate and act; Continuous = produce many tiny steps while checking the visuals each step.
  • Why it matters: Many real computer tasks (resize, rotate, scrub, draw) are continuous. Reducing them to two points (start, end) misses the whole path.

šŸž Bottom Bread (Anchor): Moving a volume slider to ā€œjust rightā€ needs tiny nudges while watching the level—not only a start and an end.

šŸž Top Bread (Hook): Think of driving a car: you don’t turn the wheel once and hope; you constantly look at the road and adjust.

🄬 The Concept: Closed-loop Control

  • What it is: A way of acting where you continuously observe the result and adjust your next move.
  • How it works: 1) Look at the current screen, 2) Make a small move, 3) Look again, 4) Repeat.
  • Why it matters: Without closed-loop control, small mistakes pile up; you miss targets, overshoot angles, or stop too early.

šŸž Bottom Bread (Anchor): Scrubbing a video timeline in Premiere Pro: you move, see the frame, and fine-tune—looping until it’s perfect.

šŸž Top Bread (Hook): Imagine trying to draw by only describing it in words like ā€œMove right 50, then up 30.ā€ It feels clunky.

🄬 The Concept: Action-as-Language Limitation

  • What it is: Many GUI agents turn actions into text tokens (like words) and predict them with a language model.
  • How it works: They discretize coordinates and issue commands like click(x,y) or drag(start,end) as text.
  • Why it matters: Text tokens are great for planning, but coarse and jumpy for fine motor control; rotations and curves become shaky or impossible.

šŸž Bottom Bread (Anchor): Telling someone ā€œdraw a perfect S with two commandsā€ won’t work; you need many tiny, smooth moves.

šŸž Top Bread (Hook): Robots got better at pouring, folding, and grasping by learning smooth motions, not just single pokes.

🄬 The Concept: Flow-based Generative Control (intuition)

  • What it is: A way to generate a whole smooth path by learning a velocity field that guides motion from start to finish.
  • How it works: 1) Start from a rough guess, 2) Predict small direction-and-speed updates, 3) March along a smooth path, 4) Land precisely.
  • Why it matters: This keeps drags stable and human-like, avoiding jitter and big misses.

šŸž Bottom Bread (Anchor): Like tracing a curve with a steady hand that constantly corrects tiny errors instead of redrawing from scratch.

The world before: Most GUI agents were great at spotting icons and clicking but couldn’t rotate a shape, solve a rotate-Captcha, or write on a canvas. People tried making drags as just start/end points, or turning every action into text, but these break down on curvy, long, or on-the-fly tasks. The gap: we needed a small, efficient system that perceives screens continuously and generates smooth, closed-loop trajectories.

Real stakes: Better digital dexterity saves time (batch file sorting), boosts creativity (precise slide design, video edits), and improves accessibility (assistive tech that truly manipulates software like people do). That’s why this paper builds ShowUI-Ļ€ and the ScreenDrag benchmark—to teach and test real dragging, not toy clicks.

02Core Idea

šŸž Top Bread (Hook): Imagine upgrading a computer cursor from a ā€œtapperā€ to a ā€œskaterā€ that glides smoothly while watching the rink.

🄬 The Concept: One-sentence Aha

  • What it is: Treat clicks as teeny-tiny drags and use a flow-based generator to draw the whole mouse path in real time.
  • How it works: 1) See the screen and read the instruction, 2) Predict a short chunk of tiny moves (a mini-trajectory), 3) Execute and re-check the screen, 4) Repeat until done.
  • Why it matters: This unifies all actions and gives stable, human-like control, especially for rotations, curves, and sliders.

šŸž Bottom Bread (Anchor): Rotating ā€œSummaryā€ by 135°: the model makes small arc steps, checks the angle, and stops exactly on target.

Three analogies:

  • Drawing analogy: Instead of placing two dots and connecting them with a ruler, the model freehands the curve while watching the line form.
  • GPS analogy: Rather than setting only start and end, it follows turn-by-turn directions that adapt to traffic (live screen changes).
  • Music analogy: Not just writing the last note; it plays every note smoothly in tempo, adjusting dynamics as the room acoustics change.

šŸž Top Bread (Hook): You know how a camera and a steady hand together make a great videographer?

🄬 The Concept: Unified Discrete–Continuous Actions

  • What it is: Represent both clicks and drags as sequences of (x, y, m) where m is mouse up/down.
  • How it works: Click = down then up at the same spot; Drag = down, many small moves, then up.
  • Why it matters: One shared model, no switching heads or tools; learning clicks and drags together improves both.

šŸž Bottom Bread (Anchor): Clicking a button is just a tiny ā€œstart-here-stop-hereā€ drag; resizing a textbox is the same recipe but longer.

šŸž Top Bread (Hook): Think of a coach whispering tiny directions: ā€œa bit left, now up, slow downā€¦ā€

🄬 The Concept: Flow-based Action Expert

  • What it is: A lightweight transformer head that predicts small velocity updates (direction + step) from the current screen and instruction.
  • How it works: 1) VLM encodes the screenshot and text, 2) Action expert attends to those features, 3) Outputs a chunk of tiny moves, 4) Executes, sees the new frame, repeats.
  • Why it matters: Predicting increments keeps motion stable and lets the agent correct mid-trajectory.

šŸž Bottom Bread (Anchor): While drawing an S-curve, the expert keeps the stroke smooth instead of jumping in big, clumsy steps.

šŸž Top Bread (Hook): When aiming a dart, the very start and the final landing matter most.

🄬 The Concept: Temporal Reweighting (start/end emphasis)

  • What it is: A training tweak that gives extra importance to the first and last steps of a drag.
  • How it works: Heavier training weight at drag-begin and drag-end; normal weight in the middle.
  • Why it matters: Helps begin correctly (on the right handle) and finish precisely (exact angle/spot), which most affects success.

šŸž Bottom Bread (Anchor): Beginning on the rotation handle and stopping exactly at 135°—not 120° or 150°—wins the task.

šŸž Top Bread (Hook): A compass keeps you moving the right way even if you take small uneven steps.

🄬 The Concept: Directional Regularization

  • What it is: A training helper that nudges predicted steps to align with the true direction.
  • How it works: Penalizes sideways or wobbly moves that don’t point where they should.
  • Why it matters: Reduces jitters and wrong-way slides, crucial for sliders and rotations.

šŸž Bottom Bread (Anchor): Sliding a Captcha: move straight to the notch, not diagonally or in zigzags.

Before vs. After:

  • Before: Agents said actions as text and guessed endpoints; fine control was brittle; rotations and handwriting often failed.
  • After: One compact model outputs smooth mini-trajectories, checks the screen, and adapts—handling curves, angles, and continuous feedback.

Why it works (intuition): Continuous problems need continuous solutions. A velocity field (flow) tells the cursor ā€œwhere to go nextā€ at every moment, so errors don’t snowball. Emphasizing starts/ends and keeping direction clean locks in the parts humans care about most.

Building blocks:

  • Unified (x, y, m) sequences for all actions.
  • Vision-language backbone for perception and instruction understanding.
  • Flow-based action expert for incremental motion.
  • Temporal reweighting and directional regularization for stability and precision.
  • ScreenDrag data and benchmark to teach and prove real drag skill.

03Methodology

At a high level: Input (screenshot + instruction) → Vision-language encoding → Flow-based action expert predicts an action chunk (tiny steps) → Execute steps → New screenshot → Repeat until goal.

šŸž Top Bread (Hook): Imagine teaching a helper to rotate a title: you show the slide, say the goal, then guide their hand in small moves.

🄬 The Concept: Perception and Conditioning

  • What it is: The model first understands what it sees and what you want.
  • How it works: 1) Screenshot and text go into a small VLM (SmolVLM backbone), 2) It produces visual-text tokens, 3) The previous action state is also fed in, so the model knows where the cursor was.
  • Why it matters: Without good perception, the model can’t find handles, sliders, or targets; without knowing the last move, it can’t stay smooth.

šŸž Bottom Bread (Anchor): If the last step put the cursor near the rotation handle, the next step should start from there, not from a random point.

Step-by-step recipe with concrete examples:

  1. Read the task and see the screen
  • What happens: Encode the instruction like ā€œRotate top-center Title starting with ā€˜Summary’ counterclockwise by 135 degreesā€ and the current screenshot.
  • Why it’s needed: The agent must find the right object and understand the exact transform.
  • Example: It locates the title box and recognizes the circular rotation handle area.
  1. Predict an action chunk (tiny moves)
  • What happens: The action expert attends to vision/text tokens and outputs a short sequence of (x, y, m) steps—mouse down, incremental moves, mouse up.
  • Why it’s needed: Short chunks let the model act smoothly and adjust quickly to changing visuals.
  • Example: It presses down on the handle, then traces a small arc of 3–5 steps.
  1. Execute and re-observe (closed loop)
  • What happens: The environment applies the moves; the model receives the updated screenshot; repeat.
  • Why it’s needed: Prevents drift. If it’s slightly off, the next chunk corrects it.
  • Example: After the first arc, the angle is 40° short; next chunk completes to ~135°.
  1. Stop when done
  • What happens: The model lifts the mouse (m = up) when it sees the goal satisfied (within tolerance).
  • Why it’s needed: Avoid overshoot and needless motion.
  • Example: It stops rotating right at 135° ± a few pixels/deg as allowed.

šŸž Top Bread (Hook): Think of writing a word: you push the pen down, draw a few letters, peek, and continue.

🄬 The Concept: Action Chunk

  • What it is: A small batch of steps predicted together (e.g., 10–20 mini-moves) before re-checking the screen.
  • How it works: Predict k steps, execute 1–k steps depending on the setup, then look again.
  • Why it matters: Chunks are the sweet spot: big enough for efficiency, small enough for accuracy.

šŸž Bottom Bread (Anchor): When handwriting ā€œHello,ā€ one chunk may draw ā€œHe,ā€ the next does ā€œll,ā€ then ā€œo.ā€

šŸž Top Bread (Hook): If you care about starting on the right spot and finishing exactly, you practice those parts more.

🄬 The Concept: Temporal Reweighting

  • What it is: During training, the start and end of each drag get extra importance.
  • How it works: Heavier loss weights at first/last steps; normal weights in the middle.
  • Why it matters: Makes grabs cleaner and landings precise—critical to success.

šŸž Bottom Bread (Anchor): Beginning exactly on the slider knob and ending exactly at the notch solves the Captcha; being off by a bit fails it.

šŸž Top Bread (Hook): A ruler helps you keep a straight direction as you draw.

🄬 The Concept: Directional Regularization

  • What it is: A gentle nudge to keep each step pointing the right way.
  • How it works: Penalizes sideways wobble; favors alignment with the true direction.
  • Why it matters: Removes jitters and wrong-way drifts that ruin fine edits.

šŸž Bottom Bread (Anchor): When resizing a box horizontally, it should slide straight sideways, not diagonally.

Data and training pipeline (ScreenDrag):

  • What happens: Collect/synthesize 20K dense drag trajectories across PowerPoint, OS Desktop/File Manager, Handwriting, Premiere Pro, and Captcha; store UI states, frames, and coordinates.
  • Why it’s needed: Real dragging needs many examples of smooth paths and on-the-fly observation.
  • Example: PowerPoint rotation, file sorting, canvas writing, timeline edits, slider/puzzle Captchas.

Evaluation like a game loop:

  • Offline (open-loop): Given the same frames and ground-truth next step, measure how close the predicted path is (trajectory error) and whether the end lands in the right zone (endpoint accuracy). No state updates from mistakes.
  • Online (closed-loop): After each chunk, the environment advances to the nearest recorded state if you’re within tolerance; measure success if you reach the goal. This captures real adjustment behavior.

The secret sauce:

  • Unified (x, y, m) for clicks and drags—no tool-switching.
  • Flow-based velocity prediction—deterministic, stable paths.
  • Emphasis on the most important steps—start on the handle, end on target.
  • Direction-keeping regularizer—smooth, purposefully aligned motion.
  • Lightweight backbone—fast enough to act in real time on desktops.

Concrete data example:

  • Instruction: ā€œReduce the image width by 0.4 from its right.ā€
  • Model: Mouse down on right edge; predict steady right-to-left micro-steps; re-check updated size overlays; release when the width shrinks by ~40% within tolerance.
  • What breaks without pieces:
    • Without unified actions: You’d need to pick a special tool; errors picking tools derail tasks.
    • Without flows: Paths get jittery and overshoot.
    • Without reweighting: Starts miss handles; ends stop short.
    • Without direction regularization: Slanted moves break straight resizes.

04Experiments & Results

šŸž Top Bread (Hook): When you test a runner, you don’t just time the finish—you also watch their form at each stride.

🄬 The Concept: What they measured

  • What it is: Three main scores—how close each step is (trajectory error), how accurate the final landing is (endpoint accuracy), and whether the full task succeeds in a live loop (online success rate).
  • How it works: Offline open-loop checks point-by-point closeness and the final point. Online closed-loop simulates acting and re-seeing the screen, then counts success.
  • Why it matters: Endpoints alone can hide sloppy paths; online tests reveal if the agent can really adjust mid-task.

šŸž Bottom Bread (Anchor): A path that ends right but zigzags wildly isn’t good for delicate rotations; the tests catch that.

The competition:

  • Proprietary action-as-language models like Operator and Gemini-2.5-CUA.
  • Open-source action-as-language models like OpenCUA-7B/32B, Qwen3-VL series, UI-TARS.
  • Other continuous policies (diffusion) compared to ShowUI-π’s flow matching.

Scoreboard with context:

  • Online Success Rate (closed-loop): ShowUI-Ļ€ hits 26.98% overall. That’s like getting a solid B in a class where most strong students are at a C+ (Gemini-2.5-CUA at 22.18%) and many others are at C or below.
  • Strength by domain: It does especially well where paths are curved or need constant checking—PowerPoint rotations, handwriting on canvas, and rotate/slider/puzzle Captchas.
  • Offline Endpoint Accuracy: 78.55%—like sinking 4 out of 5 shots exactly where you aimed.
  • Offline Trajectory Error: Lowest among compared methods (159 px on average) when counting full paths—indicating smoother, closer tracking, not just lucky endpoints.

šŸž Top Bread (Hook): If you practice the most important parts of a trick, you land it more often.

🄬 The Concept: Ablations (what mattered most)

  • What it is: Systematic tests removing or changing one piece at a time.
  • How it works: Try smaller or larger action chunks; fewer or more execution steps before re-checking; adjust start/end weighting; remove direction regularizer; split clicks and drags into separate heads; swap flow with diffusion or language modeling.
  • Why it matters: Shows which ingredients make the recipe work.

šŸž Bottom Bread (Anchor): Emphasizing start/end steps and keeping direction straight gave big boosts, especially on Captchas and rotations.

Key findings (plain English):

  • Flow > Diffusion > Language for these tasks: Flow matching delivered the best endpoints and lowest path error. Deterministic velocity fields help land precise drags.
  • Chunk size and re-observation: Predicting around 20 tiny steps but re-checking the screen frequently (every step) struck the best balance—accurate and still efficient.
  • Temporal reweighting: Giving start/end steps 10Ɨ weight raised success notably, especially for Captchas where grabbing the knob and stopping at the notch are everything.
  • Directional regularization: Reduced wobble and wrong-way drift across domains, visibly improving slider and rotation accuracy.
  • Unified head vs. separate heads: One shared head for clicks and drags kept results strong while being simpler and smaller; no need to guess which head to use.

Surprising observations:

  • Some large chatty agents refused Captchas on safety grounds or tried to open a browser tool for handwriting—great for conversation, poor for dexterous action.
  • A smaller open-source baseline (OpenCUA-7B) beat much bigger ones on some actions—size alone doesn’t guarantee better control.
  • Token-based agents often stopped mid-drag or clicked repeatedly instead of committing to a continuous press-and-move, revealing a mismatch between language tokens and motor control.

Takeaway: To act with hands, think like hands. Continuous, closed-loop, direction-aware flow beats text commands for real GUI dexterity.

05Discussion & Limitations

šŸž Top Bread (Hook): Even great skaters wobble on rough ice or when they’re tired.

🄬 The Concept: Limitations

  • What it is: Places where ShowUI-Ļ€ isn’t perfect yet.
  • How it works: It’s a compact 450M model with a limited drag-focused dataset; vision grounding is weaker than huge VLMs; some rare or exotic GUI patterns aren’t covered; the online simulator uses recorded states, not live apps.
  • Why it matters: These limits can reduce generalization to unseen apps, unusual widgets, or long, multi-turn tasks that mix complex planning and dexterous action.

šŸž Bottom Bread (Anchor): On a brand-new video editor with unique handles, it might hesitate or misread the right grip point.

šŸž Top Bread (Hook): A sharper pencil and more practice pages help you write neater.

🄬 The Concept: Required Resources

  • What it is: What you need to train/use this system well.
  • How it works: A GPU box (e.g., a few high-memory GPUs) for training; the ScreenDrag-style data with dense trajectories; a VLM backbone; and the data-driven online evaluation environment.
  • Why it matters: Without dense, high-quality trajectory data and a responsive loop, the model can’t learn smooth control.

šŸž Bottom Bread (Anchor): Recording screen video plus exact mouse paths is like keeping both the sheet music and the performance audio while learning piano.

šŸž Top Bread (Hook): Don’t bring a paintbrush to hammer in a nail.

🄬 The Concept: When NOT to Use

  • What it is: Situations where other tools fit better.
  • How it works: Purely symbolic tasks (rename 1,000 files by rule) are faster with APIs or scripts; tasks that demand web auth or safety-sensitive Captchas may be disallowed; ultra-long sequences with little visual change might be simpler as discrete steps.
  • Why it matters: Pick the right tool: continuous GUI hands for tactile tasks; scripts or APIs for bulk logic tasks.

šŸž Bottom Bread (Anchor): Moving thousands of files based on names is a job for a file script, not a mouse-dragger.

šŸž Top Bread (Hook): Questions are the fuel for the next version.

🄬 The Concept: Open Questions

  • What it is: Things we still want to figure out.
  • How it works: How to blend a planner (text reasoning) with the dexterous hand? How to scale data to more apps and rare widgets? How to move from recorded-state rollouts to robust live OS environments? Can we share skills between robots and GUIs?
  • Why it matters: Solving these will turn today’s smooth drags into tomorrow’s full computer-use assistants.

šŸž Bottom Bread (Anchor): A future agent could read a multi-step editing brief, plan the steps in text, then execute each with ShowUI-π’s steady hand.

06Conclusion & Future Work

Three-sentence summary: ShowUI-Ļ€ treats clicks as tiny drags and uses a flow-based action expert to generate smooth, closed-loop mouse trajectories from live screen views. Trained on 20K dense drag demos and tested on the ScreenDrag benchmark, it outperforms larger token-based baselines on online success, especially for rotations, handwriting, and Captchas. This shows that GUI agents should act with continuous, direction-aware flows rather than only discrete text actions.

Main achievement: A compact, unified model that finally gives computer agents a ā€œdexterous hand,ā€ handling both clicks and real drags with stable, human-like motion.

Future directions: Scale the model and data; integrate a strong text planner for multi-step tasks; expand to more apps and widgets; move from recorded-state online testing to robust live OS environments; explore sharing motion skills between robots and GUIs.

Why remember this: It marks a shift from clicking to controlling—proving that continuous, flow-based action is the missing piece for computer agents to work like people do in real software.

Practical Applications

  • •Automate slide design tasks like rotating and resizing elements precisely in PowerPoint.
  • •Assist video editors by smoothly scrubbing timelines and applying effects in Premiere Pro.
  • •Solve legitimate slider and rotate Captchas in controlled, allowed environments (e.g., internal QA).
  • •Support accessibility by handling fine-motor GUI tasks for users with limited mobility.
  • •Batch-sort and organize desktop files by dragging them into the correct folders.
  • •Perform handwriting or signature placement on canvases or forms that require pen-like strokes.
  • •Adjust UI sliders and knobs (volume, brightness, color wheels) with fine-grained control.
  • •Demonstrate software tutorials by reproducing human-like drag paths step by step.
  • •Prototype GUI testing that needs realistic cursor movement, not only clicks.
  • •Aid low-code RPA workflows when no stable API exists, acting through the GUI with dexterity.
#GUI automation#continuous control#flow matching#trajectory generation#closed-loop policy#vision-language-action#drag benchmark#directional regularization#temporal reweighting#unified actions#online evaluation#ScreenDrag dataset#dexterous manipulation#real-time perception
Version: 1