PersonalAlign: Hierarchical Implicit Intent Alignment for Personalized GUI Agent with Long-Term User-Centric Records
Key Summary
- •The paper tackles a real-life problem: people often give phones short, vague instructions, so agents must guess the missing details using what they know about the user.
- •It introduces PersonalAlign, a task that teaches GUI agents to align with two hidden layers of intent: preferences (what you usually choose) and routines (what you typically do at certain times/places).
- •To measure progress, the authors build AndroidIntent, a benchmark with long-term user records and clear labels for preferences and routines.
- •They design HIM-Agent, a memory system that keeps learning from a user’s past and organizes it into two helpful shelves: Preference Intent Memory and Routine Intent Memory.
- •A Streaming Aggregation Module turns many messy daily actions into stable “prototypes” (summaries) so the agent isn’t fooled by one-off moments.
- •An Execution-based Preference Filter matches both what you said and what you actually did (action paths) to recover your likely choices in vague commands.
- •A State-based Routine Filter uses time and place patterns to know when to proactively suggest help (and when to stay quiet).
- •On the AndroidIntent benchmark, HIM-Agent improves execution under vague instructions and gives better proactive suggestions, reducing critical mistakes and false alarms.
- •The work highlights remaining challenges like cold-start (not enough history), limited datasets, and privacy-friendly deployment.
- •If successful, phones and apps could feel more like attentive helpers that know your style and timing without constant micromanagement.
Why This Research Matters
Phones and computers are part of everyday life, and people don’t want to spell out every tiny detail each time they tap or type. By learning your usual choices and your daily rhythms, agents can reduce friction—fewer taps, fewer corrections, and less time wasted. Better proactive timing means fewer annoying pop-ups and more “just-in-time” help that feels considerate. A clear benchmark and memory design also guide the whole field toward fair testing and practical, privacy-aware solutions. Over time, such agents can make digital interactions feel more like working with a thoughtful assistant than programming a machine.
Detailed Explanation
Tap terms for definitions01Background & Problem Definition
🍞 Hook: You know how your best friend can guess your order at your favorite restaurant even if you just say “the usual”? Phones and apps don’t usually do that—they wait for very exact instructions.
🥬 The Concept: Implicit intent means what you really want, even if you don’t say every detail out loud.
- How it works:
- In real life, we skip details because they repeat ("order McDonald’s single meal on Meituan").
- A smart agent should look at your past actions to fill in the blanks.
- It should also notice your routines (like checking email at 9 a.m. at work) and gently help at the right time.
- Why it matters: Without understanding implicit intent, agents follow instructions too literally and make avoidable mistakes.
🍞 Anchor: If you say “Order McDonald’s,” a smart agent infers you mean “from the nearest McDonald’s on Meituan, single meal,” because that’s what you usually do.
The World Before: GUI agents (phone/computer helpers powered by multimodal LLMs) learned to click, type, and navigate when we told them exactly what to do. That worked in lab tests with full, tidy instructions. But daily life isn’t tidy. People say “Play the first song,” “Order my lunch,” or nothing at all—expecting the helper to remember patterns.
The Problem: Real users omit details. Two big kinds of hidden info show up:
- Preference intent: missing details about your usual choices (which app, which store, which option).
- Routine intent: things you often do at certain times/places even without asking (clock in at 8:50 a.m. at school; check weather at 7 a.m. at home). Agents fail when they treat vague instructions like full recipes.
Failed Attempts:
- Pure retrieval: Just fetch the most similar past example. Weak because phone tasks aren’t only about words—they’re also about action paths (what you tapped) and state (time/place).
- Profile summaries (LLM-written personas): Readable, but can be long, stale, and too general; also cost many tokens and miss fine-grained steps.
- Proactive-only systems: Try to help first, but often over-help (false alarms) without a careful model of when and why.
The Gap:
- No benchmark with long-term, user-centric labels showing which pieces are preferences and which are routines.
- No memory that cleanly separates “what you pick” from “when you act,” and that keeps evolving as you live your life.
🍞 Hook: Imagine a bookshelf with two shelves: one for “what I like” and one for “when I do it.” That’s what a great personal agent needs.
🥬 The Concept: Long-term user records are your diary of app interactions—what you asked, when/where you were, and what actions you took.
- How it works:
- Collect many days of instructions, times, scenarios (home/work/school), screens, and action steps.
- Group similar moments into stable summaries (so one-off weird days don’t dominate).
- Split into preferences (choices) vs routines (timed habits) to guide execution and proactive help.
- Why it matters: Without long-term records, the helper can’t learn your style—every day is “first day at school.”
🍞 Anchor: If your diary shows “open Meituan, pick nearest McDonald’s, order single meal” many times, the agent can safely auto-fill those details next time you say “Order McDonald’s.”
Real Stakes: This is about trust and time. A phone that knows your style makes fewer taps, fewer mis-clicks, and offers help at the right moments (not random pop-ups). It can speed up ordering lunch, checking in for work, or navigating home—tiny wins that add up daily.
02Core Idea
🍞 Hook: Picture a helpful librarian who not only remembers your favorite books (preferences) but also knows you visit every Friday at 4 p.m. (routines), so they set your picks aside before you arrive.
🥬 The Concept: Hierarchical Implicit Intent Alignment means teaching an agent to first fill in your usual choices (preferences) and then anticipate your timed habits (routines) using your long-term record.
- How it works:
- Learn from past interactions to spot stable patterns.
- Organize them into two layers: Preference Intent (fills missing details) and Routine Intent (suggests at the right times/places).
- Use these layers to improve both reactive execution and proactive assistance.
- Why it matters: Without this hierarchy, agents either under-help (miss preferences) or over-help (annoying proactive alerts).
🍞 Anchor: When you say “Send the email,” the agent picks the email app you always use; at 9 a.m. at work, it gently asks, “Open EmailA to check messages?”
The “Aha!” Moment (one sentence): If we separate what you usually pick from when you usually act—and update these from your long-term behavior—agents can handle vague commands and offer timely help.
Three Analogies:
- Chef: Preferences are your spice levels; routines are your meal times. The chef refills spices for any dish and has dinner ready at 6.
- Bus route: Preferences are your favorite seat; routines are the bus schedule. The system directs you to that seat when the 8:00 bus arrives.
- Sports coach: Preferences are your preferred drills; routines are your practice slots. The coach preps the right drill at the right hour.
Before vs After:
- Before: Agents needed exact instructions; vague ones caused fine-step failures. Proactive help was clumsy with too many false alarms.
- After: Agents infer missing details from Preference Memory and time/place habits from Routine Memory, reducing critical mistakes and over-eager nudges.
Why It Works (intuition, no equations):
- Many small days make strong patterns: repeated choices and timed habits rise above noise.
- Splitting choice-patterns from time-patterns makes reasoning simpler and more accurate.
- Using both what you said (semantics) and what you actually did (action paths) keeps the model honest.
🍞 Hook: You know how we tidy a messy closet into labeled boxes so we can find things fast?
🥬 The Concept: HIM-Agent (Hierarchical Intent Memory Agent) is a memory system that keeps evolving and sorts your life into two labeled boxes: preferences and routines.
- How it works:
- Streaming Aggregation turns many raw events into stable “prototypes.”
- Execution-based Preference Filter checks both words and action steps to lock in your usual picks.
- State-based Routine Filter checks time/place stability to decide when to suggest proactively.
- Why it matters: Without neat boxes, the agent confuses one-offs with habits and helps at the wrong times.
🍞 Anchor: After weeks of use, HIM-Agent knows you order “nearest McDonald’s single meal on Meituan” and prompts at noon on weekdays at home if you often do that then.
🍞 Hook: Think of a school exam that tests not just right answers, but how you handle unclear questions and when you should raise your hand to help.
🥬 The Concept: AndroidIntent Benchmark is a test built from real long-term user records that measures two things: fixing vague instructions (preferences) and smartly offering help (routines).
- How it works:
- Gather months of user interactions.
- Filter and verify candidates into preferences vs routines using simple statistics plus human checks.
- Evaluate execution and proactive performance with fair metrics.
- Why it matters: Without a good test, we can’t tell if an agent truly understands users over time.
🍞 Anchor: It’s like a driving test that includes foggy weather (vague commands) and knowing when to turn on headlights automatically (proactive help).
03Methodology
At a high level: Long-term Records → Streaming Aggregation (build prototypes) → Split into Preference Memory (execution-based filter) + Routine Memory (state-based filter) → Agent uses the right memory to execute vague instructions or to suggest proactively.
🍞 Hook: Imagine turning a messy diary into neat summary cards you can flip through quickly.
🥬 The Concept: Streaming Aggregation Module turns many small, noisy interaction records into stable prototypes that represent your recurring behaviors.
- How it works:
- Read each day’s interactions (instruction, time, scenario, actions).
- Group similar records by both meaning (what you asked) and behavior (how you tapped and typed).
- Update a “center” summary (the prototype) that best represents the group; keep doing this daily so memory evolves.
- Why it matters: Without prototypes, memory drifts with one-off events and becomes bulky and confusing.
🍞 Anchor: Instead of keeping 25 slightly different lunch orders, the module keeps one tidy “nearest McDonald’s single meal on Meituan” card.
Step A: Build Prototypes
- What happens: The module checks if a new record fits an existing prototype by measuring consistency. If yes, it joins; if not, it starts a new prototype.
- Why it exists: Keeps memory compact and stable; makes later matching faster.
- Mini example: Ten days of “Open EmailA to check inbox at 9 a.m. work” become one robust card.
🍞 Hook: When you say “Order McDonald’s,” how does the agent know you meant Meituan, nearest store, single meal?
🥬 The Concept: Execution-based Preference Filter locks in what you usually choose by looking at two clues: your words and your action path.
- How it works:
- Semantic match: Compare today’s instruction to a prototype’s instruction using embeddings and simple word overlap (helps with short, app-heavy phrases).
- Action match: Compare today’s action steps to the prototype’s steps using a path-matching idea (align sequences that are similar even if timings differ).
- Combine both to decide if this is a stable preference and store it in the Preference Intent Memory with its “center” instruction and action.
- Why it matters: Words can be vague; actions reveal what you truly did. Together, they’re much stronger.
🍞 Anchor: If your path is always “open Meituan → choose nearest McDonald’s → single meal,” the filter recognizes this as your go-to choice.
Step B: Secure Preferences
- What happens: The filter says, “Yes, this looks like your usual choice,” and records the best representative instruction and action plan.
- Why it exists: So the agent can auto-fill missing details during vague commands.
- Mini example with data: Instruction: “Order McDonald’s.” Prototype: I_c = “Order nearest McDonald’s single meal on Meituan.” A_c = [open(Meituan), tap(McD nearest), tap(single meal), pay()].
🍞 Hook: Think of how your smartwatch learns you jog every weekday at 7 a.m. in the park and asks, “Start workout?”
🥬 The Concept: State-based Routine Filter finds your timed habits by checking how consistent they are in time and place and how often they recur.
- How it works:
- Count how frequently a prototype appears.
- Measure stability of time (hours) and scenario (home, work, school) across its past cases.
- Combine these into a “proactive confidence.” If high, store it in Routine Intent Memory with its center intent and most common state.
- Why it matters: Without this, the agent either nags too much or never offers help at the right moment.
🍞 Anchor: If you often open EmailA to check messages around 9 a.m. at work, the agent suggests it then—and stays quiet otherwise.
Step C: Trigger Proactive Suggestions
- What happens: When current time/place matches a routine with high confidence, the agent offers one, short, faithful suggestion.
- Why it exists: Gentle, well-timed help builds trust; random pop-ups do not.
- Mini example: 12:05 p.m., home → “Do you want me to order your usual lunch on Meituan?”
Putting It Together (Input → Steps → Output)
- Input: Long-term interaction records (instruction, time, scenario, actions) + current state.
- Steps: Aggregate to prototypes → extract preferences (execution-based) → extract routines (state-based).
- Output: For vague commands, fill in missing details; for routine times, propose one precise suggestion.
The Secret Sauce
- Double vision: Look at both what you said and what you did.
- Two shelves: Keep preferences and routines separate to avoid confusion.
- Streamed learning: Update daily so the memory grows with you without forgetting or bloating.
04Experiments & Results
🍞 Hook: When you take a test, it’s not just right or wrong—the teacher looks at how early a mistake happens and how serious it is.
🥬 The Concept: Type Accuracy, Step-wise Success Rate (SSR), and Cumulative Error Rate (CER) are three lenses to judge agent performance.
- How it works:
- Type Accuracy: Did the agent pick the right kinds of actions overall?
- SSR: How many steps in the action path were completed successfully?
- CER: Weighs mistakes more if they happen early in the process (since early wrong turns ruin everything after).
- Why it matters: Vague instructions often cause early, critical slips (like opening the wrong app), which CER catches.
🍞 Anchor: If the agent opens the wrong app as Step 1, CER punishes that more than a tiny typo at Step 7.
The Test (AndroidIntent): Built from about 20k long-term records, with 775 labeled preference cases and 215 routine cases, plus human verification. It stresses two skills: fixing vague commands and offering appropriate proactive help.
The Competition: The authors evaluated several GUI agents (open and closed source). They compared “plain” agents to versions upgraded with different memory strategies: recent-retrieval, LLM-written user profiles, and the proposed HIM-Agent.
The Scoreboard (with context):
- Under vague instructions, many agents had only a small drop in Type Accuracy but much larger drops in SSR and big increases in CER—like getting most question types right but missing key early steps that sink the whole project.
- HIM-Agent improved execution and reduced critical errors under vagueness (higher Type and SSR, lower CER vs baselines). Think of it as moving from a B- to a solid A- on tricky, unclear questions.
- Proactive suggestions: HIM-Agent achieved better balance—good recall while cutting false alarms. Compared to systems that shouted “Can I help?!” too often, HIM-Agent felt more like a thoughtful assistant.
🍞 Hook: You know how a shopkeeper learns your usual order but also knows not to ask you at the wrong time?
🥬 The Concept: Identification Alignment (precision, recall, false-alarm, F1) checks if proactive help appears when it should—and stays quiet when it shouldn’t.
- How it works:
- Precision: Of the times it offered help, how often was that correct?
- Recall: Of the times you needed help, how often did it show up?
- False-Alarm: How often did it suggest help when you didn’t need it?
- F1: A balanced score for precision and recall.
- Why it matters: Proactive systems must be helpful, not pushy.
🍞 Anchor: A good assistant asks about lunch near noon if you usually eat then, not at midnight.
Surprising Findings:
- Vague instructions mainly break fine-grained steps, not the high-level goal. That’s why Type Accuracy barely moves, but SSR and CER show big changes.
- Without a strong state filter, proactive systems can become over-eager; adding time/place stability dramatically lowers false alarms.
- Summarizing user profiles with big LLM prompts consumed many extra tokens yet didn’t beat the more structured HIM-Agent approach.
05Discussion & Limitations
🍞 Hook: Think of training wheels that help you learn to ride but also limit how fast or far you can go.
🥬 The Concept: Limitations are the edges where today’s system still struggles.
- How it works:
- Dataset limits: Long-term, privacy-safe records are rare; current results lean on a single major source.
- Cold-start: With little or no history, the system can’t guess your style well.
- Environment quirks: Real phones, app versions, and devices vary, making online evaluation hard.
- Privacy: Personalization requires care—on-device learning/federated options are important next steps.
- Why it matters: Knowing limits helps researchers improve the right parts next.
🍞 Anchor: On day one with a new phone, the helper can’t know your favorites yet—so it asks more questions before it learns.
Required Resources:
- Access to long-term interaction logs (under consent and privacy safeguards).
- A capable MLLM-based GUI agent backbone.
- Compute/storage for streaming memory updates and matching.
When NOT to Use:
- New users with almost zero history (use conservative defaults first).
- Highly sensitive tasks where proactive suggestions could leak private intent on shared devices.
- Extremely dynamic contexts where habits change daily (old routines may mislead).
Open Questions:
- How to learn faster from fewer examples (few-shot personalization) without overfitting?
- How to combine more signals (sensors, calendar) safely to boost routine detection?
- How to evaluate online at scale on real devices while preserving privacy?
- How to recover gracefully when preferences or routines shift suddenly (concept drift)?
06Conclusion & Future Work
3-Sentence Summary: People give short, vague commands and expect agents to fill in the blanks; PersonalAlign teaches agents to align with two hidden layers—preferences and routines—by learning from long-term records. AndroidIntent provides a realistic benchmark to test this, and HIM-Agent supplies a memory that separates and updates preferences and routines. Together, they reduce critical execution errors and make proactive help more accurate and timely.
Main Achievement: A clean, practical split between “what you choose” (preference) and “when you act” (routine), backed by a continuously updated memory that makes vague instructions clearer and proactive suggestions smarter.
Future Directions: Scale to more devices and apps; handle cold-start with few-shot learning; add privacy-first training (on-device/federated); and progress toward safe, proactive executing (not just suggesting) with robust online evaluation.
Why Remember This: Because it reframes personal agents from command-followers into thoughtful partners—ones that learn your style, respect your timing, and help without fuss—bringing us closer to tech that feels like a considerate friend rather than a literal robot.
Practical Applications
- •Auto-fill missing details for shopping or food delivery based on your usual app, store, and item choices.
- •Suggest checking work email at your typical morning time and place, but stay quiet on weekends or vacations.
- •Speed up navigation by suggesting your common routes when you leave work or school.
- •Offer your regular workout or step-tracking at the times you usually exercise.
- •Recommend your preferred news or music sources when you open those apps with vague requests like “Play something.”
- •Pre-select common payment methods or delivery options to reduce form-filling.
- •Propose calendar or reminder actions tied to consistent times/locations (e.g., clock-in/clock-out).
- •Recover from vague commands during multitasking by inferring the right app and detailed steps.
- •Personalize multi-step app flows (e.g., coupon selection or membership options) that you frequently choose.
- •Reduce on-screen clutter by skipping steps you always skip and jumping straight to your usual choice.