MemGUI-Bench: Benchmarking Memory of Mobile GUI Agents in Dynamic Environments
Key Summary
- •MemGUI-Bench is a new test that checks how well phone-controlling AI agents can remember important information both during a task and across different tries.
- •Past benchmarks barely tested memory (only about 5–12% of tasks), so agents looked better than they really were when memory was needed.
- •This benchmark has 128 tasks in 26 apps, and almost 90% of them force the agent to remember details across time or across different apps.
- •A special automatic judge called MemGUI-Eval uses a three-stage 'Progressive Scrutiny' process to grade results accurately and cheaply.
- •New memory-focused metrics measure not just pass/fail, but how much information the agent successfully kept and used (IRR), and how fast it learns from mistakes (FRR).
- •Across 11 leading agents, short-term memory is absolutely required to function, while long-term memory gives big extra gains (+21.9 percentage points in success over multiple tries).
- •Hardest for agents is moving info across apps (like copying prices from Amazon into notes), which caused 16–40 point performance drops.
- •Giving the agent a longer conversation memory (long context) helped a lot (+18.8 points in single-try success for one agent).
- •However, stronger memory designs can be expensive in tokens and time, so smart, efficient memory use matters.
- •The paper also maps five common failure types and offers five practical design tips to build better, more reliable GUI agents.
Why This Research Matters
Real people use phones to move information across apps all the time—prices to notes, addresses to maps, dates to calendars. For AI assistants to be truly helpful, they must remember details across steps and learn from past mistakes. MemGUI-Bench focuses exactly on those memory skills, so we can see where agents fail and how to fix them. Better memory in agents means fewer do-overs, less copying by hand, and more reliable help with real tasks. It can also boost accessibility, helping users who find app-jumping or detail-tracking difficult. By balancing accuracy and cost, the benchmark points the way to assistants that are both smart and practical for everyday use.
Detailed Explanation
Tap terms for definitions01Background & Problem Definition
🍞 Hook: Imagine you’re helping a friend use a phone. You look up a price on Amazon, remember it, switch to a notes app, and type it in. Easy for you—because you remember things while you work and you learn tricks over time. 🥬 The Concept: Before this paper, many tests for phone-controlling AI agents mostly checked if they could tap buttons and follow simple steps, not whether they could remember across steps or across apps. How it works (then): 1) Old benchmarks gave lots of short, simple tasks. 2) They rarely made agents carry exact info (like a number or a name) across screens or apps. 3) They didn’t ask agents to try again and improve using what they learned. Why it matters: Without memory tests, we can’t tell if an agent will succeed on real phone jobs that need remembering. 🍞 Anchor: Think about copying a flight time from a browser into your calendar—if you forget the number mid-way, you can’t finish the job.
🍞 Hook: You know how you remember a one-time code when logging in? That’s short-term memory. And how you remember where the Settings are after using a phone for weeks? That’s long-term memory. 🥬 The Concept: AI phone agents also need two kinds of memory: short-term (within one task) and long-term (across different sessions). How it works: 1) Short-term holds steps, results, and UI changes while solving a problem. 2) Long-term keeps lessons from past successes and failures so the agent can do better next time. 3) Together, they help finish tasks now and get smarter later. What breaks without it: Without short-term memory, agents lose track of details mid-task; without long-term memory, they keep making the same mistakes every new try. 🍞 Anchor: If an agent reads an apartment’s address in one app and must paste it in a map app later, short-term memory makes that possible; if it learns a faster route the next time it sees a similar app, that’s long-term memory.
🍞 Hook: Picture a treasure hunt where clues are spread out over time and in different rooms. You must remember an old clue (time) and carry it to a new room (place). 🥬 The Concept: Real phone tasks often require cross-temporal retention (remembering info over many steps) and cross-spatial retention (remembering across different apps). How it works: 1) The agent finds details (like price or address). 2) It keeps them safe through later steps or app switches. 3) It uses them exactly when needed. Why it matters: If the info slips at any point, the whole plan falls apart. 🍞 Anchor: The benchmark has tasks like: read a rent price in Apartments.com, look up a work address in Bing, check commute time in Citymapper, then record findings in Joplin.
🍞 Hook: Teachers don’t just give one kind of test; they give quizzes, projects, and finals so you can show what you really know. 🥬 The Concept: Old agent tests lacked proper memory challenges and fair judging methods. How it worked before: 1) Too few memory-demanding tasks (only ~5–12%). 2) No multi-try grading to see if agents learn across attempts. 3) Judges that either used rigid rules (hard to scale) or dumped too much info into an LLM at once (overload). Why it matters: Agents looked strong on easy stuff but stumbled on real app-bridging memory tasks—and we couldn’t measure it well. 🍞 Anchor: It’s like claiming someone is great at basketball after only watching them dribble once, never seeing them pass, shoot, or play a full game.
🍞 Hook: Imagine a new game that’s designed to truly test both your short-term ‘remember-now’ and long-term ‘learn-over-time’ skills. 🥬 The Concept: This paper introduces MemGUI-Bench, a benchmark built from the ground up to test memory in realistic phone scenarios. How it works: 1) 128 tasks across 26 apps, with nearly 90% requiring memory across time or apps. 2) A snapshot-based system to reset phones consistently and let agents try again (pass@k). 3) An automated judge (MemGUI-Eval) that checks results carefully, step by step, with memory-centric metrics. Why it matters: Now we can see the true strengths and weaknesses of GUI agents—and help them get better, faster. 🍞 Anchor: When an agent succeeds on the second or third try by using what it learned, the benchmark captures that improvement instead of ignoring it.
02Core Idea
🍞 Hook: You know how a good backpack has two pockets—one for quick-grab items and one for long-term storage? Phone agents need both kinds of ‘memory pockets.’ 🥬 The Concept: The “aha!” is to benchmark GUI agents in a way that truly measures short-term retention within tasks and long-term learning across attempts, using realistic multi-app tasks and a careful, staged judge. How it works: 1) Create many real tasks that force remembering across time and apps. 2) Let agents retry (pass@k) so we can see learning. 3) Judge results with a progressive, evidence-first pipeline and memory-specific metrics (like IRR and FRR). Why it works: It isolates memory skills from general navigation, uncovers hidden gaps, and shows where longer context or long-term memory really helps. 🍞 Anchor: If the agent forgets a price halfway, IRR drops; if it fixes a mistake on the second try, FRR rises—now we can track both.
Three ways to picture it:
- School analogy: Daily class notes (short-term) + study guides built over weeks (long-term). A fair teacher grades quizzes, re-tests, and uses detailed rubrics (metrics) rather than only a final yes/no.
- Sports analogy: In-game memory (who’s open right now) vs. season memory (what strategies worked last week). The referee doesn’t watch only the last second; they check key plays stepwise.
- Kitchen analogy: Remember steps during a recipe (short-term) and improve next time by adjusting spices (long-term). The taste test checks both following directions and learned improvements.
Before vs After:
- Before: Agents often looked good on simple, single-app tasks with little memory load.
- After: On MemGUI-Bench, big capability gaps appear when tasks require carrying precise info across time/apps, and we can finally measure cross-session learning gains.
Why it works (intuition):
- Realism: Cross-app, multi-step tasks mirror real phone use.
- Retry-aware: pass@k reveals whether agents learn from mistakes.
- Smart judging: Progressive Scrutiny reduces overload and evaluates the exact evidence needed.
- Memory-focused metrics: IRR shows how much info was correctly kept and used; FRR shows how quickly learning turns failure into success.
Building blocks (with sandwiches):
- 🍞 Hook: You know how we sort books by genre? 🥬 Memory Taxonomy: a simple map separating short-term (in-session) and long-term (cross-session) memory; without it, you can’t diagnose which type failed. 🍞 Anchor: If an agent keeps losing codes mid-task, it’s short-term trouble; if it never gets faster over weeks, it’s long-term trouble.
- 🍞 Hook: Imagine bookmarking a page to revisit later. 🥬 Cross-Temporal Retention: remembering specific info over many steps; without it, details evaporate mid-journey. 🍞 Anchor: Keep a rent price in mind while switching screens to compute cost-per-square-foot later.
- 🍞 Hook: Think of carrying a sticky note from the kitchen to the garage. 🥬 Cross-Spatial Retention: holding info when switching apps; without it, app-jumping breaks the task. 🍞 Anchor: Copying a model number from Amazon into a notes app accurately.
- 🍞 Hook: Like getting three attempts on a quiz. 🥬 pass@k: grading whether you succeed within k tries, capturing learning; without it, we miss long-term gains. 🍞 Anchor: If you pass on the 2nd try, your pass@3 counts that win.
- 🍞 Hook: When solving a mystery, you first check the obvious clues, then dig deeper if needed. 🥬 Progressive Scrutiny: a staged judge (Triage → Semantic → Visual) that requests just the right evidence to decide; without it, judging becomes either too rigid or too confused. 🍞 Anchor: If last screenshots already prove success, it stops early; if not, it asks for specific earlier screens.
03Methodology
At a high level: Input (task + phone emulator) → [Task dispatch and agent action] → [Multi-step interaction across apps] → [Automated judging with Progressive Scrutiny] → Output (scores for memory, learning, and efficiency).
- Snapshot-based, plug-and-play environment
- 🍞 Hook: You know how games have save points so you can retry the hard level? 🥬 The Concept: A snapshot system resets the phone emulator to a clean, consistent state for fair retries and parallel runs; without it, one bad state can break the next test. How it works: 1) Preload apps on an Android image. 2) Launch multiple isolated emulators (by ports) to run many agents. 3) Snapshots let you instantly reset between attempts. Why it matters: Consistency makes results trustworthy and scalable. 🍞 Anchor: If attempt 1 fails at step 17, reset and cleanly try again (attempt 2) with the same starting phone state.
- Task suite that stresses memory
- Design: 128 tasks, 26 apps, 89.8% memory-heavy; 78.1% involve cross-app info transfer; balanced difficulties; mirror-task pairs to test transfer learning.
- Example flow: Apartments.com (address) → Bing (work address) → Citymapper (commute time) → Joplin (note the results). The agent must remember across time and apps.
- Multi-attempt protocol (pass@k)
- 🍞 Hook: Think of getting up to three swings to hit a baseball. 🥬 The Concept: pass@k counts success if the agent solves the task within k tries, capturing learning across attempts; without it, you can’t tell if the agent improved or just got lucky. How it works: 1) Attempt 1 runs. 2) If fail or timeout, reset snapshot. 3) Attempt 2, maybe with updated long-term memory. 4) Repeat until k or success. 🍞 Anchor: An agent that fails first but succeeds on attempt 2 shows real learning that gets recorded.
- Progressive Scrutiny judge (MemGUI-Eval)
- Stage 1: Triage Judge
- 🍞 Hook: Like skimming the last page to quickly confirm the ending. 🥬 What it is: A fast, cautious check using minimal evidence (raw actions + last screenshots); it only says “success” when proof is crystal-clear; otherwise, escalate. Why it exists: Saves cost/time on easy calls; without it, judging is slow and expensive. 🍞 Anchor: If the final screen shows the event created in the calendar exactly as requested, it can stop here.
- Stage 2: Step Descriptor + Semantic Judge
- 🍞 Hook: It’s like turning a messy play-by-play into a clear summary. 🥬 Step Descriptor: turns before/after screens into step-by-step text descriptions to add meaning. Why needed: Without summaries, long trails are hard to reason about. 🍞 Anchor: “Typed ‘RTX 4070 Super’ in Amazon search; opened product page; scrolled to price.”
- 🍞 Hook: Like a teacher checking your reasoning, not just your final answer. 🥬 Semantic Judge: uses the task goal + textual step summaries + last screenshots to decide; computes IRR for partial memory success when needed; asks for specific earlier screenshots if still unsure. Why needed: Without this, subtle memory wins/losses are missed. 🍞 Anchor: “Agent remembered 2/3 required items—IRR = 66.7%.”
- Stage 3: Visual Judge
- 🍞 Hook: When unsure, you pull the exact pages you asked for. 🥬 What it is: The judge receives only the specific earlier screenshots it requested, stitched together, to make a final, high-precision decision; no guessing allowed—missing proof means fail. Why needed: Prevents overload and ensures evidence-based truth. 🍞 Anchor: It checks the screenshot where the price first appeared to confirm the copied value is exact.
- Memory-focused metrics (the ‘rubric’)
- 🍞 Hook: A report card doesn’t just say pass/fail; it shows how well you did in each skill. 🥬 The Concepts:
- Success Rate (SR): percent of tasks finished in a single try; base performance. Without it, we can’t compare overall task ability. Example: 32.8% means about one in three tasks finished.
- Information Retention Rate (IRR): how much required info the agent correctly kept and used. Without it, we can’t see partial memory wins. Example: Remembered 2 of 3 needed prices → IRR 66.7%.
- Memory-Task Proficiency Ratio (MTPR): compares performance on memory-heavy tasks vs. standard tasks. Without it, we can’t isolate memory skill from general ability. Example: If SR is 30% on memory tasks and 60% on standard tasks, MTPR = 0.5.
- pass@k SR: multi-try success within k attempts (long-term learning). Without it, we miss improvement over attempts. Example: If success arrives on try 2 of 3, it counts.
- Failure Recovery Rate (FRR): rewards faster recovery from early failures using higher weight for earlier successes. Without it, slow learners look the same as fast learners. Example: Recovering on attempt 2 scores more than on attempt 3.
- Efficiency (Step Ratio, Time/Step, Cost/Step): shows how direct, fast, and affordable the agent is when it does succeed. Without it, a win that’s too slow or too costly might be unusable. 🍞 Anchor: A final report might say: “SR 32.8%, IRR 39%, pass@3 49.2%, FRR 21.5%, Step Ratio 1.8×, Time/Step 5.3s, Cost/Step $0.02.”
- Secret sauce
- Intelligent judging: Progressive Scrutiny minimizes overload and cost, while keeping accuracy high.
- Mirror task pairs: test whether success patterns transfer to related tasks.
- Snapshot reset: fair, fast, repeatable attempts that make long-term learning measurable.
04Experiments & Results
- The test: What and why
- Measured short-term memory (single try) and long-term learning (multi-try) on 128 tasks across 26 apps.
- Used 7 metrics, especially IRR (how much info was retained) and FRR (how fast failure turned into success).
- Goal: Reveal true memory abilities in realistic, multi-app, multi-step phone use.
- The competition: Who was tested
- 11 state-of-the-art agents from two families:
- Agentic Workflow (framework-based) like Agent-S2, M3A, T3A, Mobile-Agent-E.
- Agent-as-a-Model (end-to-end VLMs) like UI-Venus-7B, GUI-Owl-7B, UI-TARS-1.5-7B, CogAgent.
- All used a shared backbone setup for fairness.
- The scoreboard with context
- Big gap vs. old benchmarks: When tasks actually required memory, performance dropped sharply (e.g., GUI-Owl-7B fell from ~66% to ~6.2%). That’s like getting an A on spelling but a D when asked to write a full story.
- Best single-try (short-term) performance: M3A at 32.8% overall SR—like scoring 33 out of 100 on tough memory quizzes; it shows how hard these tasks are.
- Best multi-try (long-term) performance: Agent-S2 at 49.2% pass@3—like improving from a D to a solid C+ with extra attempts that reflect learning.
- Memory is mandatory: Removing short-term memory made agents basically non-functional (IRR often dropped to 0%).
- Long-term memory helps: Agent-S2 gained +21.9 percentage points with multiple attempts and had high FRR (21.5%), proving it learns from mistakes across tries.
- Long context helps too: Letting M3A use multi-turn, longer context boosted single-try SR by +18.8 percentage points (to 51.6% in that setting).
- Cross-app is the bottleneck: Success dropped 16–40 points from single-app to four-app tasks. Carrying info across apps is especially hard.
- Surprising insights
- Some agents did okay on standard tasks but collapsed on memory-heavy tasks (low MTPR), exposing “hidden weakness” not seen in earlier benchmarks.
- Token limits can collapse fancy memory systems: Under strict token budgets, a powerful long-term memory agent (like Agent-S2) fell to 0% pass@3, showing that practical deployment must balance smarts with cost.
- Lightweight models didn’t benefit from token limits—they already used few tokens—but also didn’t solve memory-heavy tasks; being cheap isn’t enough.
- Failure modes (why things go wrong)
- 🍞 Hook: When you tell a story and forget key parts, people get lost. 🥬 The Concept: Five main failure types were found:
- Partial Memory Hallucination (PMH): remembers some items wrong or missing.
- Process Memory Hallucination (ProcMH): loses the task plan mid-way.
- Output Memory Hallucination (OMH): writes down incorrect remembered info.
- Knowledge Deficiency (KD): doesn’t know facts needed to proceed.
- Intent Misunderstanding (IM): misreads the user’s goal. Why it matters: Over half of non-timeout failures were memory hallucinations—showing memory is the core struggle. 🍞 Anchor: Mixing up “USD to EUR” vs. “USD to GBP” or forgetting which apartment had the balcony.
- What the numbers mean in plain terms
- If 32.8% is like getting a tough pop quiz right only 1 out of 3 times, then +21.9 pp in pass@3 is like raising your grade by retaking the test after actually learning.
- A 16–40 point drop for 4-app tasks is like doing fine in your backyard but getting lost across four neighborhoods when you must carry clues between them.
05Discussion & Limitations
Limitations
- Coverage: Even with 128 tasks and 26 apps, the mobile world is bigger (logins, personalization, device sensors). Some real-life twists aren’t included to keep resets fast and fair.
- Emulator focus: The snapshot trick depends on emulator-ready apps and guest-mode workflows; certain popular apps that need device-only features aren’t included.
- Judge model choice: MemGUI-Eval uses strong LLM/VLMs with prompts; different model families might shift accuracy/cost slightly.
- Long-term beyond pass@k: Multi-try shows learning across attempts in one evaluation session; true weeks-long, human-like skill building is still an open frontier.
Required resources
- Android emulators with the provided snapshot image;
- Access to LLM/VLM APIs for MemGUI-Eval (e.g., Gemini family, or equivalents);
- GPUs or cloud credits if running heavy agent frameworks at scale;
- Storage for logs, screenshots, and snapshots.
When not to use
- If you only care about single-tap toy demos or one-shot UI clicks;
- If your app ecosystem requires personal logins or hardware-only features that snapshots can’t reset reliably;
- If your budget cannot support any LLM calls—the judge and some agents need them, though costs are controlled by staged scrutiny.
Open questions
- How to compress memory effectively: Can agents keep the ‘right’ bits while staying within small token budgets?
- Better long-term curricula: What task sequences best train durable, reusable skills across apps?
- Robustness in the wild: How do agents handle UI updates, A/B tests, or localization without forgetting?
- Human-in-the-loop memory: Can lightweight hints or corrections dramatically boost IRR and FRR without big costs?
- Safer agents: How can agents avoid risky actions while still retaining just enough info to be helpful?
06Conclusion & Future Work
Three-sentence summary
- MemGUI-Bench is a new memory-first benchmark for mobile GUI agents that forces real remembering across time and apps, uses multi-try grading, and applies a careful staged judge.
- It reveals large hidden deficits (4–10× gaps), proves short-term memory is required, shows long-term memory gives strong gains (+21.9 pp), and pinpoints cross-app info transfer as the main bottleneck.
- It also balances accuracy and cost through Progressive Scrutiny and offers concrete failure patterns and design tips to guide future agent building.
Main achievement
- Making memory measurable: The paper turns fuzzy “the agent forgot” into precise scores (IRR, FRR, pass@k) and trustworthy judgments that scale across realistic apps and tasks.
Future directions
- Lighter, smarter memory: Tighter representations and selective recall to work within token/time budgets;
- Better long context use: Structured buffers and retrieval that don’t just dump whole histories;
- Richer long-term learning: From pass@k to lifelong improvement across weeks and many apps;
- Stronger robustness: Handling UI changes, locales, and device differences gracefully.
Why remember this
- If we want phone agents that act like capable helpers—not just button clickers—we must test and build their memory. MemGUI-Bench provides the map, the yardstick, and the referee to get there.
Practical Applications
- •Evaluate your GUI agent’s short-term memory using IRR to spot where details get dropped in multi-step tasks.
- •Use pass@k to measure long-term learning, and tune your memory module until FRR shows faster recovery from failures.
- •Adopt a staged judge (like Progressive Scrutiny) to cut evaluation costs while improving decision accuracy.
- •Design tasks with mirror pairs to test whether your agent can transfer strategies across similar, later tasks.
- •Add multi-granularity memory buffers (e.g., slots for prices, addresses, dates) to reduce partial memory hallucination.
- •Introduce hierarchical task plans with persistent goals to prevent process memory hallucination mid-task.
- •Leverage longer context windows strategically (multi-turn summaries, selective retrieval) instead of dumping raw histories.
- •Track Average Step Ratio and Time/Step to ensure memory wins don’t come with unusable latency.
- •Stress-test cross-app info transfer (1→4 apps) to reveal your true memory bottlenecks before deployment.
- •Set token budgets and measure degradation to balance capability (LTM) with real-world cost constraints.