POINTS-GUI-G: GUI-Grounding Journey
Key Summary
- •This paper teaches a computer to find buttons, text, and icons on screens so it can click and type in the right places, a skill called GUI grounding.
- •The team started from a model that was not good at this skill and built it up step by step, rather than relying on a model that was already strong.
- •They cleaned and unified many messy datasets into one simple format and added harder, trickier examples to help the model learn.
- •They improved training by letting the vision part of the model keep learning (instead of freezing it) and by matching image sizes between training and testing.
- •They used reinforcement learning with verifiable rewards so the model gets a clear 'right or wrong' signal when it points to a location.
- •Their 8B-parameter model, POINTS-GUI-G-8B, reached state-of-the-art results on several popular benchmarks.
- •On ScreenSpot-Pro it scored 59.9, on OSWorld-G 66.0, on ScreenSpot-v2 95.7, and on UI-Vision 49.9.
- •Clear, checkable rewards made reinforcement learning especially effective for this perception-heavy task.
- •They open-sourced the model and evaluation suite to help others build better screen-using AI agents.
- •The result is a stronger foundation for agents that can shop online, book tickets, and automate everyday computer tasks safely and accurately.
Why This Research Matters
AI that can precisely find and click the right things on screens can save people hours on repetitive tasks like filling forms, booking services, or sorting files. This paper shows how to build that precision from the ground up, so teams are not forced to rely on giant, already-perfect models. By cleaning data, matching image sizes, and using simple yes/no rewards, the approach makes these agents more accurate and reliable in real software. That means fewer wrong clicks, fewer failed automations, and safer computer use at scale. It also helps smaller organizations build capable agents with modestly sized models. The open-source release encourages fair, consistent testing and faster progress for everyone.
Detailed Explanation
Tap terms for definitions01Background & Problem Definition
🍞 Top Bread (Hook): You know how you use your eyes to find the 'Play' button on a video or the 'Submit' button on a form? Your brain matches words and shapes to where they are on the screen so your finger clicks the right spot.
🥬 Filling (The Actual Concept): GUI grounding is the computer’s version of that—finding exactly where an interface element (like a button or icon) is so it can act on it.
- What it is: Mapping a natural-language instruction (like 'click Send') to precise screen coordinates.
- How it works: (1) Read the instruction, (2) Look at the screenshot, (3) Match the description to visible parts (text, shapes, icons), (4) Output a point or box at the exact place.
- Why it matters: Without GUI grounding, an AI agent would guess, clicking the wrong thing—like trying to text on a locked phone screen.
🍞 Bottom Bread (Anchor): When an agent sees 'Open Settings,' grounding helps it point to the gear icon at (0.842, 0.112) so a click really opens Settings.
🍞 Top Bread (Hook): Imagine hiring a helper robot that can use your computer like you do—shopping, booking flights, or filling forms. That helper must know where to click.
🥬 Filling (The Actual Concept): GUI agents are AI helpers that control apps by reading screens and interacting with them.
- What it is: An AI that breaks a big goal into steps and takes actions (click, type, drag) on the screen.
- How it works: (1) Understand the task, (2) Ground each step to a UI element, (3) Perform the action, (4) Check progress and repeat.
- Why it matters: Without precise grounding, even a smart plan fails—like knowing you should 'press Start' but pressing empty space.
🍞 Bottom Bread (Anchor): To 'book a flight,' the agent must find 'From', 'To', 'Date', and 'Search' boxes exactly, or the booking won’t work.
🍞 Top Bread (Hook): You know how different maps use different scales and legends? If they don’t match, it’s confusing.
🥬 Filling (The Actual Concept): Dataset heterogeneity made past GUI training messy.
- What it is: Different datasets used different coordinate scales, formats, and task styles.
- How it works (the problem): (1) Some used raw pixels, others used normalized [0,1], (2) Some asked for points, others for boxes, (3) Descriptions varied a lot.
- Why it matters: A model trained on mixed rules gets confused and underperforms.
🍞 Bottom Bread (Anchor): If one dataset says 'the button is at (1200, 900)' and another says '(0.6, 0.45)', the model must learn two systems at once—hard!
🍞 Top Bread (Hook): Think of learning in a quiet classroom versus a noisy playground. Bad noise makes learning harder.
🥬 Filling (The Actual Concept): Noisy labels in GUI data mislead the model.
- What it is: Wrong or sloppy coordinates attached to screenshots.
- How it works: If an annotation points slightly off the real button, the model learns the wrong target.
- Why it matters: The model becomes less precise and makes more mistakes.
🍞 Bottom Bread (Anchor): If the dataset says 'click Login' but the dot sits between 'Login' and 'Signup', the AI keeps missing the real button.
🍞 Top Bread (Hook): You don’t become great at piano by only practicing easy songs.
🥬 Filling (The Actual Concept): Training needs difficulty progression—easy to hard.
- What it is: A curriculum that starts simple and adds cluttered layouts, tiny icons, and distractors.
- How it works: (1) Remove overly simple screens, (2) Add complex, high-resolution, realistic cases, (3) Sort by difficulty to pace learning.
- Why it matters: Without challenge, the model plateaus and fails on real-world complexity.
🍞 Bottom Bread (Anchor): After mastering 'big blue button', the model must handle pro software toolbars where dozens of tiny icons look similar.
🍞 Top Bread (Hook): Photos look sharp on a 4K TV but blurry on an old screen. Size matters for details.
🥬 Filling (The Actual Concept): Resolution consistency is vital.
- What it is: Keeping training and testing image sizes in the same ballpark.
- How it works: (1) Raise the training resolution limit, (2) Cap inference resolution to match training.
- Why it matters: If the model only saw small images during training, it struggles with giant, high-res screenshots later.
🍞 Bottom Bread (Anchor): A model trained on small images misses a tiny gear icon in a huge 3K desktop screenshot.
🍞 Top Bread (Hook): Games are fun because you get points when you do the right thing.
🥬 Filling (The Actual Concept): Reinforcement Learning with Verifiable Rewards (RLVR) gives the model points when it pinpoints the right place.
- What it is: A learning method where the model tries, gets a reward if it’s correct, and improves from feedback.
- How it works: (1) Model outputs a location, (2) If it’s inside the true box, reward=1 else 0, (3) Adjust policy to get more 1s.
- Why it matters: Clear, checkable rewards reduce guesswork and sharpen precision in perception-heavy tasks.
🍞 Bottom Bread (Anchor): If the model clicks inside the 'Search' box, it gets a 'good job!' point; outside gets zero—simple and powerful.
🍞 Top Bread (Hook): Before this paper, many teams started with models that already had good screen sense—like starting a race halfway.
🥬 Filling (The Actual Concept): This work builds grounding from a weaker base model to learn the full pipeline.
- What it is: A ground-up strategy combining data cleanup, tailored training, and RL.
- How it works: (1) Unify data, (2) Train the vision backbone (don’t freeze it), (3) Align resolutions, (4) Add RL with exact rewards.
- Why it matters: We learn what truly moves the needle and achieve SOTA with an 8B model.
🍞 Bottom Bread (Anchor): The final scores—95.7 on ScreenSpot-v2 and 66.0 on OSWorld-G—show the pipeline works across phones, web, and desktops.
02Core Idea
🍞 Top Bread (Hook): Imagine teaching a friend to find the 'Save' icon in any app—first clean notes, then bigger practice sheets, and finally a quiz that says right or wrong instantly.
🥬 Filling (The Actual Concept): The key insight: Treat GUI grounding as a perception-first problem built on unified data, an actively trained vision encoder, and reinforcement learning with exact, verifiable rewards.
- What it is: A three-pillar pipeline—Refined Data Engineering, Improved Training Strategies, and RL with Verifiable Rewards—to greatly boost pointing accuracy.
- How it works: (1) Make all datasets speak the same language (formats and scales), (2) Let the vision encoder keep learning and keep resolutions consistent, (3) Use RL where a correct click earns a certain reward.
- Why it matters: Each pillar fixes a major failure point: confusion from messy data, blurry vision from frozen encoders/mismatched sizes, and lack of sharp feedback during training.
🍞 Bottom Bread (Anchor): The result is POINTS-GUI-G-8B scoring 59.9 on the tough ScreenSpot-Pro benchmark—like acing a final exam many others struggle with.
Multiple Analogies for the Same Idea:
- Toolbelt analogy: You first sort your tools (data unification), sharpen the lenses on your glasses (vision encoder + resolution), then practice with a scoreboard that lights up green only when you hit the bullseye (RL with verifiable rewards).
- Sports analogy: Standard drills (unified data), strength training targeted at weak muscles (unfrozen vision encoder), and scrimmages with a clear scoreboard (RL) together turn a good player into an MVP.
- Cooking analogy: Use consistent measurements (data standardization), adjust the oven precisely (resolution matching), and taste-test every batch with a yes/no judge (verifiable rewards).
Before vs After:
- Before: Models learned from mixed, noisy datasets; the vision part was often frozen; training and testing images didn’t match sizes; feedback during learning was fuzzy.
- After: Clean, unified data; a vision encoder adapted for GUIs; matched resolutions; crisp RL rewards. Accuracy jumps across varied benchmarks.
🍞 Top Bread (Hook): You know how finding tiny differences in two similar pictures gets easier when your eyes keep practicing on those pictures, not on landscapes?
🥬 Filling (The Actual Concept): Why it works (intuition):
- Focused perception: Unfreezing the vision encoder lets it specialize in GUI shapes, fonts, and tiny icons.
- Scale fairness: Matching resolutions prevents surprises, so learned features transfer reliably.
- Binary truth: Verifiable RL rewards (inside box vs outside) provide crystal-clear signals that push predictions toward precise hits.
- Hard reps: Tougher layouts build robustness, so clutter stops confusing the model.
🍞 Bottom Bread (Anchor): After many exact yes/no reward rounds, the model consistently lands inside the 'Download' button instead of a few pixels off.
Building Blocks (each as a mini sandwich):
- 🍞 Hook: You know how rulers all need the same units to measure correctly? 🥬 Concept: Standardization to [0,1] coordinates and uniform formats. Why it matters: Prevents conflicting rules that confuse learning. 🍞 Anchor: A 'Login' box is always (x0, y0, x1, y1) in [0,1], never raw pixels.
- 🍞 Hook: A friend double-checks your homework. 🥬 Concept: Noise reduction with OmniParser-v2 cross-checking annotations. Why it matters: Removes mislabeled targets. 🍞 Anchor: If the label says 'Search' but OmniParser finds no matching element overlap, the sample is dropped.
- 🍞 Hook: Leveling up in a video game. 🥬 Concept: Complexity enhancement via layout entropy and synthetic hard cases (GUI-CodeGen, GUI-Overlay). Why it matters: Builds skill for real-world, crowded screens. 🍞 Anchor: The model learns to find a tiny '⚙' among many similar icons.
- 🍞 Hook: Training eyes for tiny fonts. 🥬 Concept: Unfreezing the vision encoder and keeping resolution consistent. Why it matters: Extracts finer details and avoids scale shock. 🍞 Anchor: The model correctly boxes a 12px dropdown arrow.
- 🍞 Hook: A buzzer that sounds only for exact hits. 🥬 Concept: RL with verifiable rewards and GRPO. Why it matters: Encourages precise pointing and stable policy updates. 🍞 Anchor: Rewards = 1 only if the predicted point lies inside the ground-truth box.
03Methodology
High-Level Recipe: Input (instruction + screenshot) → [A: Refined Data Engineering] → [B: Improved Training Strategies] → [C: RL with Verifiable Rewards] → Output (point or box coordinates)
A. Refined Data Engineering
-
🍞 Top Bread (Hook): Imagine assembling a puzzle from boxes made by different companies—pieces won’t fit unless you standardize shapes. 🥬 Filling (Concept: Standardization)
- What it is: Convert all datasets to the same formats: [0,1]-scaled coordinates with three-decimal precision and a unified task style (point or box localization).
- How it works: (i) Normalize all coordinates to [0,1], (ii) Reformat prompts into two main types: center-point and bounding-box prediction, (iii) Store outputs as strict tuples/lists.
- Why it matters: Prevents the model from juggling inconsistent rules that harm learning. 🍞 Bottom Bread (Anchor): A 'Click the Avatar' task always returns (x, y) ∈ [0,1], not pixels.
-
🍞 Top Bread (Hook): You ask two friends to spot the 'Play' icon; if both agree, it’s more likely correct. 🥬 Filling (Concept: Noise Reduction with OmniParser-v2)
- What it is: Filter out wrong labels by checking whether a ground-truth point/box overlaps real UI elements detected by OmniParser-v2.
- How it works: (i) Expand point labels into small squares, (ii) Compute coverage with detected UI boxes, (iii) Keep samples exceeding a reliability threshold.
- Why it matters: Bad labels teach the model to miss; filtering keeps supervision trustworthy. 🍞 Bottom Bread (Anchor): If 'Search' points to empty space (no overlap), the example is discarded.
-
🍞 Top Bread (Hook): To get stronger, you lift heavier weights over time. 🥬 Filling (Concept: Complexity Enhancement)
- What it is: Make training harder in smart ways: measure layout entropy; add synthetic complex screens; prioritize high-resolution samples.
- How it works: (i) Compute 1D/2D layout entropy from element centers to rate clutter, (ii) Remove overly easy screens, (iii) Synthesize pro-style interfaces (GUI-CodeGen) and multi-window desktops (GUI-Overlay), (iv) Emphasize higher pixel counts.
- Why it matters: The model stops overfitting to simple cases and learns to handle crowded, realistic UIs. 🍞 Bottom Bread (Anchor): In a VS Code-like layout, the model still finds 'Extensions' among dense sidebars.
Mini-concepts inside Complexity Enhancement:
- 🍞 Hook: Sorting worksheets from easy to hard.
🥬 Concept: Layout Entropy
- What: A score of how spread-out and cluttered elements are.
- How: Count how elements distribute across grid cells (2D) and along several directions (1D), then combine.
- Why: Lets us bucket screens into easy/medium/hard for curriculum. 🍞 Anchor: A clean login page = low entropy; a design app toolbar = high entropy.
- 🍞 Hook: Building a model city from blueprints.
🥬 Concept: GUI-CodeGen
- What: Use an LLM to write HTML for pro-like UIs; render to high-res images; extract clickable elements.
- How: Generate, render, label.
- Why: Creates rich, diverse, tiny-element training data. 🍞 Anchor: A code editor mockup with tabs, sidebars, and icons at 1920×2560.
- 🍞 Hook: Putting windows on top of each other on a desktop.
🥬 Concept: GUI-Overlay
- What: Layer multiple apps over a background to add distractors and occlusions.
- How: Composite screenshots, keep true boxes for targets.
- Why: Trains robustness when real desktops get messy. 🍞 Anchor: The model still finds 'Download' on a browser behind a chat window.
B. Improved Training Strategies
-
🍞 Top Bread (Hook): If you never let your eyes practice reading tiny fonts, you’ll keep missing them. 🥬 Filling (Concept: Unfreezing the Vision Encoder)
- What it is: Keep training the vision backbone so it specializes in GUI features (fonts, icons, crisp edges).
- How it works: Jointly fine-tune the vision encoder, projector, and LLM during supervised learning.
- Why it matters: Frozen encoders stay generic; unfrozen encoders learn the fine-grained details GUIs demand. 🍞 Bottom Bread (Anchor): After fine-tuning, the model distinguishes 'O' vs. gear '⚙' at small sizes.
-
🍞 Top Bread (Hook): Practicing piano on one keyboard size and performing on a much bigger one can throw off your fingers. 🥬 Filling (Concept: Resolution Consistency)
- What it is: Match the scale between training and inference by raising the training cap and constraining test images to similar ranges.
- How it works: (i) Increase max training resolution, (ii) Cap inference resolution to reduce mismatch.
- Why it matters: Prevents performance drops on high-resolution benchmarks like ScreenSpot-Pro. 🍞 Bottom Bread (Anchor): Keeping both around the same size boosts accuracy by double digits on tough high-res tests.
C. Reinforcement Learning with Verifiable Rewards (RLVR)
-
🍞 Top Bread (Hook): A dart game gives you points only if you hit the board—simple feedback makes you better fast. 🥬 Filling (Concept: Verifiable Reward for GUI Grounding)
- What it is: Reward = 1 if the predicted point lies inside the ground-truth box; else 0.
- How it works: (i) Sample outputs (rollouts), (ii) Score each with binary rewards, (iii) Update the policy to increase the chance of correct hits.
- Why it matters: Precise, objective feedback directly trains pinpoint accuracy. 🍞 Bottom Bread (Anchor): The model learns to land inside 'Search' instead of just near it.
-
🍞 Top Bread (Hook): Practicing with a team where everyone compares scores pushes you to improve. 🥬 Filling (Concept: GRPO—Group Relative Policy Optimization)
- What it is: A stable RL method where the model’s advantage is normalized within a group of rollouts.
- How it works: (i) Generate multiple attempts per query, (ii) Compute rewards, (iii) Normalize advantages against the group, (iv) Update with clipped ratios for stability.
- Why it matters: Reduces training swings and speeds learning. 🍞 Bottom Bread (Anchor): With 8 rollouts per sample, the policy steadily increases average reward over time.
Putting It All Together (Example with Data):
- Input: 'Click Report an Invasive Species' + webpage screenshot.
- A: It’s standardized to [0,1]; noisy labels removed; selected as a hard sample due to clutter.
- B: The vision encoder is unfrozen; the screenshot is processed at a resolution like those used in training.
- C: The model outputs a point; if it falls within the true link box, reward=1; GRPO updates the policy.
- Output: A precise coordinate (e.g., (0.915, 0.032)).
Secret Sauce:
- Exact, checkable rewards perfectly fit GUI grounding’s tight output space (points/boxes).
- Unfreezing perception plus resolution alignment recovers a lot of lost detail.
- Curriculum via layout entropy and synthetic hard cases stretches the model’s capability ceiling.
04Experiments & Results
🍞 Top Bread (Hook): Think of five different school exams—some on phones, some on laptops, some crowded with tiny text. A good student must ace them all.
🥬 Filling (The Actual Concept): The team tested POINTS-GUI-G-8B on five benchmarks to measure how well it grounds UI elements across environments.
- What it is: A thorough comparison across ScreenSpot-v2, ScreenSpot-Pro, OSWorld-G, UI-Vision, and MMBench-GUI-L2.
- How it works: Evaluate accuracy for pointing/boxing targets on mobile, web, and desktop screenshots with text and icons.
- Why it matters: Real users face all kinds of screens; strong performance must generalize.
🍞 Bottom Bread (Anchor): Scoring well on all five means the model can handle your phone’s settings app and a high-res CAD tool window.
The Competition:
- Comparable open-source models like Qwen3-VL (2B/8B/32B), GTA1-7B, GUI-Owl, UI-Venus, MAI-UI (2B/8B), and others.
- Several larger models were included; outperforming some of them shows efficiency, not just size.
Scoreboard with Context:
- ScreenSpot-Pro (high-res, pro apps): 59.9. That’s like getting an A when many others are at B or C; it beats most 7B peers and even some bigger models.
- OSWorld-G (complex desktops): 66.0 overall, topping state-of-the-art among similar sizes—like being first in a tough league.
- ScreenSpot-v2 (general grounding across devices): 95.7 average—think near-perfect homework across mobile/desktop/web.
- UI-Vision (diverse instructions): 49.9 average—well ahead of many peers, especially on basic and functional categories.
- MMBench-GUI-L2: Competitive or leading across platforms (Windows, macOS, Linux, iOS, Android, Web), indicating balanced strength.
Surprising/Notable Findings:
- Reinforcement learning, often used for reasoning, also sharply boosts this perception-centric task when rewards are verifiable.
- Resolution matching (train≈test) yields large gains on high-res benchmarks—an overlooked factor now shown to be crucial.
- Unfreezing the vision encoder, previously kept static in the WePOINTS series, delivered substantial accuracy jumps, confirming GUI perception needs specialization.
RL Training Dynamics (What they watched and why):
- Reward trends upward then plateaus—evidence the policy learns to hit targets more often.
- Entropy (exploration) decreases with fluctuations—showing the model narrows in on better predictions while still exploring a bit.
- Using groups of rollouts (e.g., 8) balances stability and compute, enabling steady improvement.
Ablations / Key Factor Impacts (Plain-English Take):
- Data Engineering alone: big boost by removing confusion and noise.
- Unfreezing the Vision Encoder: another big leap by improving fine-detail reading.
- Resolution Consistency: double-digit gains on high-res tests—don’t skip this.
- RL with Verifiable Rewards: consistent, meaningful gains on top of supervised training.
Bottom Line:
- POINTS-GUI-G-8B ranks at or near the top across multiple benchmarks and device types, proving the three-pillar recipe is broadly effective.
05Discussion & Limitations
🍞 Top Bread (Hook): Even the best magnifying glass can struggle in fog.
🥬 Filling (The Actual Concept): Limitations
- What it is: Situations where POINTS-GUI-G may stumble.
- How it works: (1) If training data misses certain app styles or languages, the model may not generalize, (2) Extreme UI themes (unusual fonts, low contrast) can still be hard, (3) Fast-changing UI layouts in the wild can outpace dataset updates.
- Why it matters: Knowing the edges helps decide where to trust the model and where to add more data.
🍞 Bottom Bread (Anchor): A rare enterprise app with custom icon fonts might confuse the model until similar samples are added.
🍞 Top Bread (Hook): A strong athlete still needs a gym.
🥬 Filling (The Actual Concept): Required Resources
- What it is: What you need to train/deploy.
- How it works: (1) High-resolution GPU training for the unfrozen vision encoder, (2) Storage and I/O for large, unified datasets, (3) Compute for RL rollouts (groups of attempts).
- Why it matters: Under-provisioned systems may force lower resolutions or fewer rollouts, reducing accuracy.
🍞 Bottom Bread (Anchor): Training on tiny images to save memory will likely cost points on ScreenSpot-Pro-like tasks.
🍞 Top Bread (Hook): Don’t bring a bulldozer to plant a flower.
🥬 Filling (The Actual Concept): When NOT to Use
- What it is: Cases where a lighter tool is better.
- How it works: (1) Fixed-form UIs with stable coordinates can be automated via templates without heavy models, (2) Low-resolution kiosk screens may be handled by simpler detectors.
- Why it matters: Pick the simplest solution that works; save heavy models for varied, evolving UIs.
🍞 Bottom Bread (Anchor): A single-page check-in terminal with two big buttons might not need POINTS-GUI-G.
🍞 Top Bread (Hook): A map tells you where you are, but not where the treasure moved overnight.
🥬 Filling (The Actual Concept): Open Questions
- What it is: What we don’t fully know yet.
- How it works: (1) Best ways to update models as UIs evolve without full retraining, (2) Combining reasoning traces with grounding—can we get the best of both?, (3) Handling dynamic content (pop-ups, animations) during training, (4) Cross-language robustness for non-Latin scripts.
- Why it matters: Solving these expands reliability and reach.
🍞 Bottom Bread (Anchor): Tomorrow’s redesign of a shopping site might add a new banner—can the model adapt quickly without forgetting what it already knows?
06Conclusion & Future Work
Three-Sentence Summary:
- This paper introduces POINTS-GUI-G-8B, a GUI grounding model built from a weaker base by unifying data, adapting the vision encoder, aligning image resolutions, and adding reinforcement learning with verifiable rewards.
- The approach achieves state-of-the-art or competitive results across ScreenSpot-Pro, OSWorld-G, ScreenSpot-v2, UI-Vision, and MMBench-GUI-L2.
- Clear, checkable rewards plus specialized perception training make a powerful, general solution for screen understanding.
Main Achievement:
- Proving that a three-pillar recipe—refined data engineering, resolution-aware training with an unfrozen vision encoder, and RL with exact rewards—can push an 8B model to SOTA grounding across diverse platforms.
Future Directions:
- Expand multilingual and non-Latin robustness, handle dynamic/animated UIs, mix reasoning traces when helpful, and streamline continual learning to keep up with fast UI changes.
Why Remember This:
- It shows that careful data cleanup, right-sized images, specialized perception, and simple yes/no rewards can transform screen-pointing accuracy—laying a strong foundation for practical, reliable computer-using agents.
Practical Applications
- •Automate online shopping checkouts by reliably finding cart, address, and payment fields.
- •Speed up business workflows like invoice entry by clicking the correct form fields and buttons.
- •Assist IT support by navigating settings panels across different operating systems to change configurations.
- •Help accessibility tools by accurately focusing on the intended UI element when given a spoken command.
- •Enable testing bots to validate app UIs by locating exact buttons and inputs under different themes/resolutions.
- •Assist data entry across web portals, correctly grounding to text boxes even in cluttered pages.
- •Guide educational software that demonstrates how to use complex tools (e.g., IDEs) by clicking exact icons.
- •Support RPA (robotic process automation) to work on legacy systems without APIs by interacting via the GUI.
- •Improve customer support macros that navigate dashboards and ticketing systems reliably.
- •Power personal assistants that file downloads, manage tabs, and adjust settings on desktops.