FocusUI: Efficient UI Grounding via Position-Preserving Visual Token Selection

Mingyu Ouyang; Kevin Qinghong Lin; Mike Zheng Shou; Hwee Tou Ng

FocusUI: Efficient UI Grounding via Position-Preserving Visual Token Selection

Intermediate

Mingyu Ouyang, Kevin Qinghong Lin, Mike Zheng Shou et al.1/7/2026

arXiv PDF

Key Summary

•FOCUSUI makes computer-using AI faster and still accurate by looking only at the important parts of a screen.
•It learns which screen patches matter for the user’s instruction and ignores big empty or repetitive areas.
•A new trick called POSPAD keeps the sense of where things are on the screen even after dropping many image tokens.
•Compared with general visual token pruning, FOCUSUI avoids large accuracy drops on precise UI pointing tasks.
•On the ScreenSpot-Pro benchmark, FOCUSUI-7B beats GUI-Actor-7B by 3.7 percentage points.
•Keeping just 30% of the visual tokens only reduces accuracy by about 3.2 points while making inference up to 1.44× faster.
•Peak GPU memory drops by around 17–18% at high reduction, saving resources.
•The method works with popular backbones (Qwen2.5-VL, Qwen3-VL) and needs no decoder architecture changes.
•Dense supervision mixes instruction-aware box overlap with a UI-graph prior to teach the model what to keep.
•This research pioneers efficient UI grounding—doing more with fewer visual tokens without losing spatial precision.

Why This Research Matters

FOCUSUI helps computer-using AIs feel snappy and trustworthy on real screens, even very large ones. By keeping only what matters and preserving where things are, it stays accurate while saving time and memory. That means better accessibility tools that find buttons reliably for users with motor or visual challenges. It enables cheaper, faster automated testing of apps and websites at scale. It also supports lightweight on-device assistants on laptops or phones, not just giant servers. Overall, it’s a step toward practical, responsive GUI agents that click the right thing the first time.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Top Bread (Hook): You know how, when you use a tablet, your eyes jump straight to the search bar when you want to look something up, and you ignore the big empty wallpaper? Your brain saves effort by focusing on what matters.

🥬 Filling (The Actual Concept): What it is: UI grounding is when an AI sees a screenshot and a written instruction (like “Click the Share button”) and must point to the exact spot on the screen. How it works: 1) Split the screenshot into many small patches (like tiles). 2) Read the instruction text. 3) Match the instruction to the right tiles. 4) Output the target location to click. Why it matters: Without accurate grounding, a computer-using AI clicks the wrong thing, making it unusable.

🍞 Bottom Bread (Anchor): If you say “Open Settings,” the AI should highlight the gear icon, not the profile picture or the blank background.

🍞 Top Bread (Hook): Imagine reading a comic book where each panel is cut into tiny squares, and the AI must look at thousands of squares to answer one question. That feels slow, right?

🥬 Filling (The Actual Concept): What it is: Vision-Language Models (VLMs) are AIs that understand pictures and words together. How it works: 1) Turn an image into many visual tokens (patches turned into vectors). 2) Turn text into tokens too. 3) Mix them in a single sequence. 4) Use attention to connect the right words with the right image parts. Why it matters: VLMs enable agents that can read screens and follow instructions, but they can get overwhelmed by too many visual tokens.

🍞 Bottom Bread (Anchor): When you ask “Where is the Play button?” the VLM tries to link the word “Play” to the exact circle-with-a-triangle icon among thousands of tiny image patches.

🍞 Top Bread (Hook): Think of a 4K desktop screenshot. It’s huge. If you had to inspect every pixel to find a tiny menu icon, you’d waste tons of time.

🥬 Filling (The Actual Concept): What it is: Visual token overload is when a high-resolution screenshot becomes thousands of tokens (e.g., ~4700 at 2K), crowding out the text and slowing computation. How it works: 1) The image is cut into a grid. 2) Each grid cell becomes a token. 3) The model processes all these tokens along with a few text tokens. Why it matters: The model becomes slow and its “attention” is diluted because too many unimportant tokens compete for focus.

🍞 Bottom Bread (Anchor): A page with a giant white background and one little login button produces tons of background tokens that don’t help find “Login.”

🍞 Top Bread (Hook): Imagine cleaning your room: if you randomly throw away stuff, you might toss your homework by mistake. But if you keep the important things together, you find them later.

🥬 Filling (The Actual Concept): What it is: Visual token pruning removes less useful image tokens to speed up AI. How it works: 1) Score each patch’s importance. 2) Keep high-scoring ones, drop the rest. 3) Run the model on fewer tokens. Why it matters: In regular photos, this often works fine. But in UI grounding, naive pruning can break the sense of where things are, so the AI points slightly off-target.

🍞 Bottom Bread (Anchor): If you drop many tiles from a grid map without telling the AI how the map still lines up, it might point one row too high.

🍞 Top Bread (Hook): Think of a treasure map. If someone rips out chunks and doesn’t mark where they were, you can’t measure distances correctly anymore.

🥬 Filling (The Actual Concept): What it is: Positional continuity is the model’s ability to keep track of where each patch sits on the screen. How it works: 1) The model encodes both ‘what’ (visual content) and ‘where’ (position). 2) If tokens are dropped, gaps cause position jumps. 3) The model then misreads spatial layout. Why it matters: UI grounding needs pixel-precise pointing; losing continuity leads to noticeable mistakes.

🍞 Bottom Bread (Anchor): If the AI thinks the “Search” bar is one tile lower than it really is, it clicks the tab strip instead and fails.

The world before this paper: Researchers built powerful VLMs that could read high-res screens and do grounding. Accuracy got strong, but speed and memory costs ballooned because screenshots dominated the token budget (often over 85% visual tokens). People tried general pruning designed for natural photos to cut cost, but on UI tasks those methods broke positional continuity and accuracy plunged.

The problem: We need to cut lots of redundant visual tokens while keeping the precise spatial sense needed for UI pointing. The key challenge is dropping tokens without scrambling the spatial map in the model’s head.

Failed attempts: 1) Direct drop: simply delete low-importance tokens. Result: Fast, but position jumps ruin precise clicks. 2) Heavy padding: replace every dropped token with a placeholder. Result: Preserves position but doesn’t reduce sequence length much, so it’s not efficient.

The gap: There was no UI-specific, instruction-aware token selection that both: (a) picks the right patches to keep, and (b) preserves positional continuity compactly. We needed a way to “think like people do” (focus where the instruction points) while also “keep the map intact.”

Real stakes: Faster, lighter UI grounding means: assistants that run on cheaper hardware; smoother screen-reading help for accessibility tools; quicker automated testing of apps; and more responsive computer-use agents that don’t lag or misclick. In everyday life, it means an AI helper that opens the right menu the first time, even on a busy 4K desktop.

02Core Idea

🍞 Top Bread (Hook): Imagine you’re hunting for a sticker on a giant poster. You only look at spots that match your clue, and you use sticky notes to mark the holes where you skipped, so your mental map stays aligned.

🥬 Filling (The Actual Concept): The “Aha!” in one sentence: Select only the instruction-relevant visual tokens and keep spatial order intact by inserting one special marker at the end of each dropped chunk. How it works: 1) Learn which patches matter for the instruction (saliency). 2) Keep the top patches. 3) For any long stretch you remove, place a single POSPAD marker at its end to preserve continuity. 4) Feed this compact, position-faithful sequence to the VLM and ground with an action head. Why it matters: This keeps precision high while speeding up inference and lowering memory.

🍞 Bottom Bread (Anchor): For “Click the Play button,” the model keeps patches around the play icon and maybe nearby text, compresses big empty panels into a few POSPADs, and still knows exactly where the icon lives.

Three analogies for the same idea:

Museum tour: You skip empty hallways but pin a note at the end of each skipped corridor so your floor plan still lines up. 2) Book skimming: You skip dull pages but place a bookmark that says “we skipped up to here,” so page numbers stay meaningful. 3) City map: You fold boring suburbs into a single fold marker so distances on your city map don’t get distorted.

Before vs. After:

Before: The model chewed through every patch, wasting time on large, uniform backgrounds; pruning naively caused spatial confusion. - After: The model keeps fewer, smarter tokens and stays spatially calibrated thanks to POSPAD, so it clicks the right pixel faster.

Why it works (no equations, just logic):

Attention loves signal over noise: fewer irrelevant tokens mean cleaner attention maps. - UI screenshots are structured: big homogeneous regions (e.g., side margins, wallpapers) carry little instruction-specific meaning; dropping them saves compute. - Spatial alignment is fragile: removing tokens creates gaps. POSPAD acts like a spacer that says, “We skipped N tiles, and they ended here,” preserving the index layout the positional embedding expects. - Instruction-aware selection reduces candidate confusion: when the action head looks for the target, it competes among a smaller, more relevant set.

Building blocks, each like a small tool in a toolkit:

🍞 Top Bread (Hook): You know how a treasure map becomes easier if you color the squares closer to the X brighter? 🥬 Filling (The Actual Concept): Instruction-to-Patch Saliency Score is a heat map over patches that rates how relevant each tile is to the instruction. How it works: 1) Score overlap with the ground-truth box (if supervised). 2) Down-weight giant, look-alike regions via a UI-graph prior. 3) Fuse them. Why it matters: It teaches the selector what to keep and what to skip. 🍞 Bottom Bread (Anchor): For “Open Downloads,” tiles on and around the Downloads button glow brighter.

🍞 Top Bread (Hook): Picture grouping similar LEGO blocks into big clusters to spot unique pieces fast. 🥬 Filling (The Actual Concept): UI-Graph Prior groups visually similar neighboring patches and assigns lower weights to massive, homogeneous areas. How it works: 1) Connect similar neighbors. 2) Measure component sizes. 3) Give big components smaller weights. Why it matters: Suppresses backgrounds and boosts distinct elements so the model focuses on real UI widgets. 🍞 Bottom Bread (Anchor): A huge white panel gets low weight; a unique icon gets higher weight.

🍞 Top Bread (Hook): When you and a friend play “hot or cold,” you use their clues to move toward the target. 🥬 Filling (The Actual Concept): Query-Guided Saliency Scorer learns to predict per-patch relevance from how well patches match the instruction tokens. How it works: 1) Enhance features for vision and text lightly. 2) Compare each patch with instruction tokens. 3) Average similarities into one score per patch. Why it matters: It’s fast, plug-and-play, and accurate for selection. 🍞 Bottom Bread (Anchor): If the instruction is “Search,” patches similar to the word “Search” score higher.

🍞 Top Bread (Hook): Think of replacing a torn-out comic strip with a single sticky note that says “skipped until here,” so the story panels still line up. 🥬 Filling (The Actual Concept): POSPAD compresses every dropped run into one learnable marker at the run’s end index to preserve spatial continuity. How it works: 1) Find contiguous dropped sequences. 2) Keep just the last position and insert one POSPAD token. 3) Proceed with the compact sequence. Why it matters: The model retains the ‘where’ information and stays accurate even with strong pruning. 🍞 Bottom Bread (Anchor): After skipping a long sidebar, a single POSPAD at the sidebar’s end tells the model exactly where that region used to finish.

Together, these pieces let FOCUSUI be picky (keep what matters), tidy (compress what doesn’t), and trustworthy (don’t lose the map).

03Methodology

At a high level: Input (screenshot + instruction) → Build supervision heatmaps (instruction-to-patch saliency) → Train a lightweight Query-Guided Saliency Scorer → Select top-r% visual tokens → Apply POSPAD to preserve positions → Feed compact sequence to the VLM → Ground target with an action head → Output click.

Step 1. Build instruction-to-patch supervision

🍞 Top Bread (Hook): Imagine drawing a bright spotlight over the parts of the screen that match what you asked for and dimming boring backgrounds. 🥬 Filling (The Actual Concept): What it is: A fused saliency map that labels how relevant each patch is, mixing two ideas—box overlap and a UI-graph prior. How it works: 1) Cut the image into patches. 2) Compute a box-overlap score for patches near the ground-truth element (brighter near the center). 3) Build a UI-graph by uniting visually similar neighboring patches; give big components smaller weights. 4) Fuse them with a controllable weight (e.g., 0.8 to box, 0.2 to UI-graph). Why it matters: This gives dense, high-quality teaching signals for what to keep. 🍞 Bottom Bread (Anchor): On a browser screenshot for “Click Address Bar,” patches at the bar light up; the giant white page background dims.

Concrete mini-example: Suppose a 3840×2160 image is divided into 14×14 pixel patches. The ground-truth box covers 4×3 patches; those get high scores (close to 1.0). Elsewhere, a large blank sidebar forms one giant component and gets down-weighted (e.g., ~0.2). Fusing makes the address bar area clearly stand out.

Step 2. Train the Query-Guided Saliency Scorer

🍞 Top Bread (Hook): Think of a fast helper that glances at both the words and the picture and says, “These tiles look most like what you asked for.” 🥬 Filling (The Actual Concept): What it is: A tiny module that predicts each patch’s relevance to the instruction using similarities. How it works: 1) Take vision encoder patch embeddings and instruction token embeddings. 2) Lightly enhance both with a small self-attention layer. 3) Normalize and compute patch–token similarities. 4) Mean-pool over instruction tokens to get a per-patch score. 5) Train it so its softmax over patches matches the fused supervision’s softmax (minimize KL divergence). Why it matters: It’s lightweight, accurate, and easy to plug into many VLMs. 🍞 Bottom Bread (Anchor): With instruction “Pause music,” the scorer spikes over the pause icon region.

Toy numbers: If a patch’s similarity to the instruction averages to 0.8 and background patches hover around 0.2, the scorer will prefer the 0.8 patch in top-K selection.

Step 3. Select the top-r% patches (token selection)

🍞 Top Bread (Hook): Like packing a backpack: you keep the essentials (water, map, snacks) and leave behind duplicates. 🥬 Filling (The Actual Concept): What it is: Choose only the highest-scoring patches, given a target retention ratio r (like 30%, 50%, 100%). How it works: 1) Sort patches by saliency. 2) Keep the top K where K = floor(r × total). 3) Mark the rest as dropped. Why it matters: This step shrinks the visual sequence dramatically, speeding up the model. 🍞 Bottom Bread (Anchor): For r = 30%, a 6,400-patch grid keeps about 1,920 high-value patches; the rest are considered for compression.

Step 4. Preserve positions with POSPAD

🍞 Top Bread (Hook): If you cut a rope, you knot the ends so the length record stays meaningful. 🥬 Filling (The Actual Concept): What it is: POSPAD inserts one special marker at the end of each contiguous run of dropped tokens, compactly remembering their place. How it works: 1) In the flattened visual sequence, find maximal consecutive dropped runs. 2) For each run, keep only its last index and place a <pospad> token there; all earlier positions in the run are removed. 3) Everything else (kept tokens) remains. Why it matters: Spatial continuity is preserved, stabilizing precise grounding, especially at aggressive pruning. 🍞 Bottom Bread (Anchor): A long left sidebar becomes a single <pospad> at the sidebar’s end index; the model still understands where the sidebar ended along the screen.

Concrete numbers: If you drop 300 tokens that form 20 runs, POSPAD inserts 20 markers, so sequence length shrinks by ≈280 while positions stay coherent.

Step 5. Feed compact sequence to the VLM and ground with an action head

🍞 Top Bread (Hook): After packing the essentials and labeling the gaps, you can travel faster and still navigate perfectly. 🥬 Filling (The Actual Concept): What it is: Standard decoding with an extra action head (coordinate-free) that aligns language to selected patches. How it works: 1) Pass the compact sequence (kept patches + POSPADs + text) into the LM decoder (no architecture change). 2) The action head takes a special hidden state and attends over the selected visual tokens. 3) It outputs a distribution over patches to choose the click location. Why it matters: Fewer, more relevant candidates make the head’s job easier and more accurate. 🍞 Bottom Bread (Anchor): For “Click ‘Share’,” the head peaks sharply on the Share icon tile and outputs the click.

Step 6. Train end-to-end

🍞 Top Bread (Hook): Like practicing with answer keys and hints, so your shortcuts don’t hurt your scores. 🥬 Filling (The Actual Concept): What it is: Jointly optimize the saliency scorer and the grounding head with standard language modeling plus grounding supervision. How it works: 1) Saliency loss: match predicted per-patch saliency to fused supervision (KL). 2) Next-token prediction: train the LM as usual. 3) Attention loss: push the action head’s attention on patches that overlap the ground-truth box. Why it matters: The whole pipeline learns to keep the right tokens and point precisely. 🍞 Bottom Bread (Anchor): Over time, the model stops wasting attention on skies and backgrounds and nails the right icon, even at 30% retention.

The secret sauce:

Instruction-aware selection: chooses what humans would focus on. - POSPAD: compresses dropped regions without breaking the positional map. - Seamless integration: works with Qwen2.5-VL and Qwen3-VL, no decoder changes needed, compatible with FlashAttention.

Example with actual data: On ScreenSpot-Pro (many high-res desktop apps), FOCUSUI-7B with r = 30% keeps only about a third of patches, adds a modest number of POSPAD tokens, runs up to 1.44× faster, uses ~17% less peak GPU memory, and loses only ~3.2 accuracy points compared to full tokens.

04Experiments & Results

The test: The authors measured UI grounding accuracy (did the model point to the right place?), inference time (how fast?), and peak GPU memory (how much cost?) across four benchmarks: ScreenSpot-V2, ScreenSpot-Pro, OS-World-G, and UI-Vision. These span mobile, web, and high-res desktop apps, where precise clicks matter.

The competition: Strong baselines included GUI-Actor (coordinate-free grounding), Jedi, Qwen2.5-VL, and Qwen3-VL. The team also compared against general visual pruning approaches (Fast-V, HiPrune, Vision-Zip) to test whether generic pruning could keep spatial precision.

Scoreboard with context:

ScreenSpot-Pro (hard, high-res desktop): FOCUSUI-7B hits about 48.3% average, improving over GUI-Actor-7B by 3.7 points. In school terms, that’s going from an 84 to about an 88 when everyone else struggles with the tough exam. - Token reduction trade-off: With only 30% retention, FOCUSUI-7B drops modestly (to ~45.1%), while speeding up to 1.44× and lowering memory by ~17–18%. That’s like running a race carrying a lighter backpack and barely slowing your accuracy. - ScreenSpot-V2 (hybrid devices): FOCUSUI-3B/7B maintain ~91–93% at full tokens and stay remarkably strong even when keeping only 30–50% tokens. - OS-World-G (desktop tasks): FOCUSUI-3B around 53.4% at full tokens; even at 30% retention it stays close (~51.8%), showing robustness. - UI-Vision (desktop perception and actions): FOCUSUI-7B reaches ~24.9% overall, competitively matching or exceeding prior models, and holds up under pruning.

Against general pruning: When the authors plugged Fast-V, HiPrune, or Vision-Zip into UI grounding models, accuracy often fell sharply at 30% retention. For example, a Qwen2.5-VL-3B baseline at 26.1% on ScreenSpot-Pro could collapse to around 4.8–21.0% with these pruners, showing spatial breakage harms precise pointing. In contrast, FOCUSUI’s 30% setting stayed near its dense baseline (within ~7.3 points on ScreenSpot-Pro, ~0.5 on ScreenSpot-V2, and ~3.0 on OS-World-G), proving position-preserving selection matters.

Efficiency wins: On ScreenSpot-Pro, reducing retention from 100% to 30% gave up to 1.44× faster inference and ~17–18% lower peak GPU memory for both Qwen2.5-VL-7B and Qwen3-VL-2B variants. That means more runs per GPU and faster feedback loops—very practical for agents that must act quickly.

Surprising findings:

Position is everything: The team tried placing the POSPAD marker at the start, middle, or end of each dropped sequence. Best results came from placing it at the sequence end, which matches the raster ordering used by the vision encoder and positional embeddings. - Selection really understands relevance: A Patch Recall@K% analysis showed FOCUSUI’s scorer captures ground-truth areas very early (e.g., high recall by 10–25%), confirming it “finds the right neighborhoods first.” - Even small backbones gain: With Qwen3-VL-2B, FOCUSUI still improved over the vanilla model and kept good accuracy under pruning, showing generality.

Plain-English bottom line: Compared to doing everything the slow, dense way, FOCUSUI learns what to look at and how to skip safely. It keeps the screen’s “map” lined up using POSPAD, so it points accurately even when it sees far fewer tokens. That’s the key to being both smart and speedy.

05Discussion & Limitations

Limitations:

Mostly spatial, not temporal: FOCUSUI focuses on one screenshot at a time. Many real tasks involve sequences (e.g., open menu, then pick an item). Temporal token reduction or multi-step memory isn’t addressed here. - Supervision dependence: Best training uses instruction-conditioned supervision (box overlaps + UI-graph prior). If supervision is weak or noisy, selection quality may drop. - Extreme sparsity edge cases: In UIs where the relevant element is very small, visually similar to the background, or surrounded by many lookalikes, aggressive pruning can still cause minor misses.

Required resources:

A VLM backbone (e.g., Qwen2.5-VL or Qwen3-VL) and a GPU environment with FlashAttention support for best speedups. - Training data with reasonable element boxes helps (they also used a UI-graph prior to reduce annotation dependence). - Some engineering to insert the saliency scorer and POSPAD step before the LM decoder.

When not to use:

If you must process every pixel for forensics-level detail, dense processing might be safer. - If you run on tiny low-res screens with few tokens already, the gain from pruning will be small. - If instructions are extremely ambiguous (e.g., “Click that thing”), selection may need richer context or disambiguation steps first.

Open questions:

Temporal efficiency: How to extend position-preserving selection over time (video-like UI streams) so multi-step agents stay fast and precise? - Self-supervision and robustness: Can we reduce reliance on box-level labels further with purely self-supervised or synthetic signals? - Adaptive placement: Could POSPAD adaptively encode run length or local layout hints to help the LM even more? - Joint planning + selection: Can the agent reason about future steps and pre-keep regions likely needed next, improving multi-action tasks? - Generalization to other modalities: Would POSPAD-like continuity markers help in document layout analysis, medical images, or code editors with mini-maps?

Overall, FOCUSUI shows that the right selection plus careful position preservation beats generic pruning for UI grounding, but there’s plenty of room to make multi-turn, instruction-following agents even faster and smarter.

06Conclusion & Future Work

Three-sentence summary: FOCUSUI makes UI grounding efficient by selecting only instruction-relevant visual tokens and preserving spatial continuity with a new POSPAD marker. This keeps accuracy high even when keeping as little as 30% of the visual tokens, while speeding up inference and lowering memory use. It outperforms strong baselines on several benchmarks and integrates cleanly with modern VLMs.

Main achievement: The key contribution is a position-preserving selection pipeline—dense, instruction-aware supervision for token importance plus the POSPAD sequence transformation—that avoids the accuracy collapse seen with general visual pruning on precise UI pointing.

Future directions: Extend selection to temporal sequences, explore self-supervised saliency signals, make POSPAD more informative (e.g., encode skipped run length), and co-design selection with planning so agents keep what they’ll need next. Also, test in broader domains like document UI, mobile accessibility, and code IDEs.

Why remember this: It shows you don’t have to choose between speed and precision in UI grounding; by keeping the right tokens and keeping the map intact, you can get both. That design lesson—prune with purpose and preserve structure—applies widely across multimodal AI.

Practical Applications

•Speed up automated UI testing frameworks by pruning redundant visual tokens while keeping precise click accuracy.
•Build faster accessibility assistants that can robustly find and click the right controls on high-resolution screens.
•Deploy lighter-weight desktop or mobile agents that run locally with reduced GPU memory requirements.
•Improve reliability of customer-support bots that guide users by highlighting or clicking the correct UI elements.
•Accelerate data labeling tools that require precise selection of UI components in screenshots.
•Enhance RPA (Robotic Process Automation) systems to operate more accurately on complex enterprise UIs.
•Power smarter IDE or design-tool assistants that quickly locate panels, icons, and menus via instruction-aware selection.
•Optimize web automation crawlers to focus on relevant widgets (forms, buttons) and ignore large backgrounds.
•Enable multi-app computer-use agents to remain responsive when switching among 4K professional applications.
•Reduce cloud compute costs for large-scale GUI interaction experiments by lowering inference time and memory.

Version: 1