RoboMME: Benchmarking and Understanding Memory for Robotic Generalist Policies

Yinpei Dai; Hongze Fu; Jayjun Lee; Yuejiang Liu; Haoran Zhang; Jianing Yang; Chelsea Finn; Nima Fazeli; Joyce Chai

RoboMME: Benchmarking and Understanding Memory for Robotic Generalist Policies

Intermediate

Yinpei Dai, Hongze Fu, Jayjun Lee et al.3/4/2026

arXiv

Key Summary

•RoboMME is a new, big test playground that checks whether robot brains can remember important things over time, not just what they see right now.
•It covers four kinds of memory: when things happened (temporal), where things are (spatial), which thing is which (object), and how to do a skill again (procedural).
•The benchmark has 16 long, tricky tasks built to force real memory use, like counting moves, tracking hidden objects, following language hints, and copying a demonstrated path.
•The authors also built 14 robot brain variants to compare three memory styles (symbolic, perceptual, recurrent) and three ways to plug memory into the model.
•No single memory works best for everything: symbolic shines at counting and short plans, while perceptual is key for motion-heavy and time-sensitive jobs.
•The best non-oracle overall performer combines perceptual memory with a 'modulator' plug-in strategy, scoring about 44.5% success across all tasks.
•Prior systems and small benchmarks weren’t enough; RoboMME gives a standardized, tougher, and fair way to measure progress on memory for robots.
•Humans still make mistakes on these tasks too, showing the benchmark is truly challenging and realistic.
•Results in simulation transfer to real robots: counting favored symbolic memory, while drawing precise patterns favored perceptual memory.

Why This Research Matters

Robots that can remember reliably are safer, more helpful, and more trustworthy in real homes, hospitals, and factories. Many real tasks demand memory: counting repetitions, recalling which item belongs to whom, or following a demo precisely. RoboMME gives a fair and challenging way to measure and improve these abilities, so progress is visible and comparable. The study shows that different jobs need different memory tools, steering designers toward the right match instead of one-size-fits-all solutions. The modulator approach with perceptual memory is a practical sweet spot for performance and efficiency, which matters for deployment on limited hardware. Finally, the trends transfer from simulation to the real world, increasing confidence that what we learn here will help real robots help us.

Reading Workflow

Turn this paper into a decision

Scan fast. Promote only the papers that survive triage.

No workflow history yet.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine playing a long board game. If you forget earlier moves, you’ll make a bad choice now. Robots face the same problem: they must remember what already happened to act smartly next.

🥬 The World Before: Robots have gotten good at reacting to what they see right now—like grabbing a visible block or pressing a button in front of them. But many real tasks stretch over time. A home robot may need to: put every book back exactly where it came from, wipe a table exactly three passes, or copy a move it watched in a video. These are not just see-and-do moments; they require remembering past events. Most past tests let robots succeed without real memory—current pictures often gave enough hints to guess the next step.

🍞 Anchor: If a robot must place a red cube into a bin three times, a single snapshot won’t tell it whether it’s on the first, second, or third placement. It must remember the count.

🍞 Hook: You know how a magician hides a coin under a cup and moves the cups around? If you lose track of the coin, you guess wrong.

🥬 The Problem: We lacked a big, fair, and varied test that truly needs memory across many situations. Past evaluations were small, inconsistent, or solvable without deep remembering. So it was unclear which memory designs were actually best and when.

🍞 Anchor: A robot might succeed on a tiny cup game in one lab but fail when cups move faster or when language clues are added. Without a standardized test, we can’t compare methods well.

🍞 Hook: Think of three ways you remember: words (notes in a notebook), pictures in your head, and a feeling for how to do a skill like riding a bike.

🥬 Failed Attempts: Prior work tried three main memory styles:

Symbolic memory (notes): high-level words like “next, pick blue cube.” These are clear but can miss fine motion details.
Perceptual memory (pictures): stacks of visual features from past frames; great for tracking movement but can be heavy.
Recurrent memory (compressed summaries): a small hidden state updated over time; efficient but tricky to train and can forget specifics. Unfortunately, different studies used different robot brains and non-uniform tasks. Results didn’t transfer well.

🍞 Anchor: It’s like testing bikes, scooters, and skateboards on different tracks with different rules—hard to tell which is truly best for city travel.

🍞 Hook: Imagine sorting memory into four everyday questions: when did it happen, where is it, what is it, and how do I do it?

🥬 The Gap: We needed a unified benchmark that stresses all four key memory needs:

Temporal (when): counting events and ordering
Spatial (where): tracking objects even if hidden or moved
Object (what): keeping identity across time and clues
Procedural (how): repeating a demonstrated motion Plus, we needed enough high-quality demos so models can learn effectively, and a standard way to plug different memory types into the same robot brain.

🍞 Anchor: RoboMME fills this by giving 16 long tasks that force different memory types, like counting repetitions, following language timing hints, tracking masked objects that swap places, and copying demo paths.

🍞 Hook: Think of a school spelling bee. Everyone gets similar words and the same rules, so the winner truly earned it.

🥬 Why It Matters: Without a strong, fair test, we can’t measure progress or pick the right memory tool for the job. Real-world robots in homes, factories, and hospitals must remember instructions, counts, and hidden objects safely and reliably.

🍞 Anchor: A home robot setting the table must recall how many plates it already laid out (temporal), which cup belongs to Dad (object), where the silverware drawer is even if the door briefly blocks it (spatial), and how to fold a napkin the way it was shown (procedural).

02Core Idea

🍞 Hook: You know how different games test different skills—chess tests planning, basketball tests movement, and memory games test recall? One tournament can’t crown a world champion of all skills unless it truly tests them all.

🥬 The “Aha!” Moment (one sentence): Build a single, large, standardized benchmark plus a controlled set of robot-brain variants to fairly test and understand which kinds of memory help which kinds of long, tricky tasks.

🍞 Anchor: RoboMME is that tournament: 16 tasks across four memory types, with apples-to-apples comparisons.

🍞 Hook: Imagine three ways to remember your day: write a diary (words), replay a slideshow (pictures), or keep a smooth running summary in your head (compressed feeling).

🥬 Multiple Analogies:

School supplies analogy:

Symbolic memory = notes in a notebook (clear steps)
Perceptual memory = photo album (visual details)
Recurrent memory = tiny cheat-sheet you update (compact but lossy)

Detective analogy:

Symbolic = timeline of events
Perceptual = security camera footage
Recurrent = detective’s hunch updated each clue

Cooking analogy:

Symbolic = recipe lines
Perceptual = video of the chef
Recurrent = your practiced muscle memory

🍞 Anchor: Counting cookies? The recipe (symbolic) is enough. Copying a swirl of icing? Watching the video (perceptual) helps more.

🍞 Hook: Imagine plugging a memory pack into a game console in three different ways.

🥬 Before vs After:

Before: Many tasks could be solved from the current frame; memory methods weren’t compared fairly; results were mixed and hard to generalize.
After: With RoboMME and a matched family of 14 variants, we learn that no single memory wins at everything; task type decides which memory helps most; perceptual+modulator gives the best balance overall.

🍞 Anchor: It’s like discovering that sprint spikes aren’t best for marathons; you pick shoes based on the race.

🍞 Hook: You know how you scan a class photo differently depending on what you need—names list (words), faces (pixels), or just a sense of who sat where (summary)?

🥬 Why It Works (intuition):

Symbolic memory shines when tasks boil down to event logic (e.g., count to three), because words are crisp.
Perceptual memory helps when fine motion timing and longer visual history matter (e.g., repeat that curve), because pictures preserve nuances.
Recurrent memory can be efficient, but if not deeply integrated and pretrained for long horizons, it can forget or blur details.
The “modulator” plug-in works well because it nudges the action-decider with memory without scrambling the original pretrained features.

🍞 Anchor: If you’re timing a jump rope (time-sensitive motion), a quick look at visual history and a gentle nudge to your move-decider beats reading a sentence like “jump… now!”

🍞 Hook: Think of building blocks—memory type, memory size, and how to plug it in.

🥬 Building Blocks:

Memory types: symbolic, perceptual, recurrent
Memory token budget: fixed size for fairness
Integration strategies:
1. Memory-as-context: stick memory next to inputs
2. Memory-as-modulator: use memory to tune the action layers
3. Memory-as-expert: give memory its own pathway that the action head can attend to
Tasks grouped by primary need: counting, permanence (occlusion/tracking), reference (identity across cues), imitation (repeat motions)

🍞 Anchor: Like Lego pieces, you can build different robots: a counter bot (symbolic-leaning), a tracker bot (perceptual-leaning), or a mimic bot (perceptual with modulator).

03Methodology

🍞 Hook: Imagine you’re following a treasure map: first read the instructions, then check the pictures, then remember past clues, and finally choose an action.

🥬 Overview (at a high level): Input (language instruction + current images, sometimes a demo video) → Build/Update Memory (symbolic/perceptual/recurrent) → Plug Memory Into the Robot Brain (context/modulator/expert) → Predict Next Action (move gripper, press button, place object)

🍞 Anchor: For “Put two green cubes in the bin,” the robot reads the goal, recalls how many it already placed, looks at the scene now, then decides whether to pick another cube or stop.

— New Concept: RoboMME Benchmark — 🍞 Hook: You know how a school sports day has sprints, long jumps, relays, and throws to test different skills? 🥬 The Concept: RoboMME is a big set of 16 robot tasks built to force memory use across time. How it works:

Four suites target four memory types: Counting (temporal), Permanence (spatial under occlusion), Reference (object identity), Imitation (procedural).
Each task is long and non-Markovian: the current picture alone is not enough.
1,600 demos with 770k timesteps provide rich training data; test seeds are different for fairness. Why it matters: Without tasks that really need memory, robots can cheat by using the latest frame only and still look smart. 🍞 Anchor: In ButtonUnmaskSwap, objects get masked and swap spots; you must remember where the target went even though you can’t see it.

— New Concept: Memory Types —

Temporal memory 🍞 Hook: Counting jumping jacks. 🥬 The Concept: Remembering event counts/order. How: Track how many repeats happened; know when to stop or switch. Why: Without it, a robot may keep placing cubes forever. 🍞 Anchor: In PickXTimes, stop after exactly X picks.
Spatial memory 🍞 Hook: The shell game (which cup hides the ball?). 🥬 The Concept: Remember where things are, even when briefly hidden or moved. How: Link earlier locations to later frames despite occlusion or swapping. Why: Without it, the robot uncovers the wrong object. 🍞 Anchor: In VideoUnmaskSwap, follow the correct container through swaps.
Object memory 🍞 Hook: Remembering which twin wore the red hat earlier. 🥬 The Concept: Keep identity consistent across time and cues (visual, language, action). How: Bind an object to hints (e.g., the one highlighted before pressing a button). Why: Without it, the robot grabs the wrong cube that only looks similar. 🍞 Anchor: In PickHighlight, collect the cubes that were briefly highlighted earlier.
Procedural memory 🍞 Hook: Replaying a dance you just watched. 🥬 The Concept: Repeat demonstrated motions and strategies. How: Map the video path to your own moves (pick–place, push, circle). Why: Without it, motions become wobbly or off-target. 🍞 Anchor: In PatternLock, trace the same linear path shown in the demo.

— New Concept: Memory Representations —

Symbolic memory 🍞 Hook: A to-do list. 🥬 The Concept: Summaries written as words (subgoals) that say what to do next. How: A vision-language model writes the next step, e.g., “pick the green cube at (x, y).” The robot reads it along with the main instruction. Why: Without concise steps, the robot may lose track of high-level progress. 🍞 Anchor: “Press the button, then pick the cube that was highlighted” is a clear plan for short sequences.
Perceptual memory 🍞 Hook: A photo scrapbook. 🥬 The Concept: Keep selected visual tokens from past frames. How:

Frame sampling: keep evenly spaced frames and pool tokens.
Token dropping: keep only the most changed patches across time. Why: Without visual history, the robot can’t time moves or replicate nuanced motion. 🍞 Anchor: In StopCube, seeing the cube’s approach history helps press exactly at the right pass.

Recurrent memory 🍞 Hook: A tiny suitcase you repack each step. 🥬 The Concept: Compress history into a compact hidden state that updates over time. How: Use a small recurrent module (e.g., RMT or test-time-trained fast weights) to summarize visual tokens. Why: Without careful design and pretraining, it may forget important details. 🍞 Anchor: If the suitcase is too tiny or poorly packed, key clues get lost.

— New Concept: Integration Mechanisms —

Memory-as-context 🍞 Hook: Adding a study guide to your textbook. 🥬 The Concept: Concatenate memory tokens with the current inputs for joint processing. How: The vision-language expert reads everything together. Why: Simple to add, but may overload the input and blur roles. 🍞 Anchor: For short histories, this can work fine; for long ones, it can be noisy.
Memory-as-modulator 🍞 Hook: Noise-canceling headphones that tune sound based on the room. 🥬 The Concept: Use memory to gently scale and shift the action layers (AdaLN-style), after the action features attend to memory tokens. How: Extract memory-aware features, turn them into per-layer knobs (scale/shift), and modulate action computation. Why: Keeps the strong pretrained backbone stable while injecting just-enough history. 🍞 Anchor: This is the top performer with perceptual memory (FrameSamp+Modul).
Memory-as-expert 🍞 Hook: Bringing a specialist onto the team. 🥬 The Concept: A separate memory expert processes memory tokens; the action head attends to both main VLM and memory experts. How: Three experts; action can look at both, while VLM and memory stay separate to avoid interference. Why: Adds capacity, but is heavier and not always better. 🍞 Anchor: Works well in some cases, but modulator often edges it out in efficiency and stability.

— Example With Actual Data — Task: “Watch the video and then place the cube on the second target after the button press.”

Input: instruction + a short video (initial) + current frame.
Perceptual memory (frame sampling): keep tokens from every N frames of the video, pooled to fit the memory budget.
Integration (modulator): action features attend to these memory tokens; memory outputs become layer-wise scale/shift.
Output: the predicted action chunk (e.g., move gripper to pick, then place on the correct target) that advances the episode. Without step 2, the robot may forget the temporal reference (the target after the button press). Without step 3, it may not use history effectively.

— Secret Sauce —

Standardized, diverse tasks that really force memory
A controlled family of memory variants on the same backbone and same token budget
The modulator strategy that injects history with minimal disruption to pretrained features

04Experiments & Results

🍞 Hook: Think of a science fair where all projects are judged by the same rules, same judges, and on the same day—that’s how you get fair results.

🥬 The Test: The authors evaluated 14 of their memory-augmented variants and 4 prior methods on 16 tasks (50 evaluation episodes each, different seeds from training). The main metric is success rate: did the robot complete the task? This measures real memory use because tasks are non-Markovian—current views often don’t reveal the needed history.

🍞 Anchor: For a counting task, it only counts as success if the robot stops exactly at the correct number.

🍞 Hook: Who did they compare against? Not just each other—but also prior published ideas.

🥬 The Competition:

Baseline Pi-0.5 (no memory)
Pi-0.5 with past actions concatenated (symbolic-lite)
SAM2Act+ (perceptual memory bank with a motion planner)
MemER (keeps keyframes and predicts symbolic subgoals via a VLM) Plus human performance via an oracle planner interface (humans choose high-level actions; a perfect low-level planner executes them).

🍞 Anchor: This sets a spectrum: from no-memory baselines to human decision-making upper bounds.

🍞 Hook: Scoreboard time—but let’s make numbers meaningful, not just raw percentages.

🥬 The Scoreboard (with context):

Best non-oracle overall: Perceptual memory with Frame Sampling + Modulator averages about 44.5% across all 16 tasks. That’s like getting a solid B in a very hard class where most students fail some parts.
Symbolic memory can be great with perfect subgoals (oracle): Grounded subgoals with oracle cues score very high on many tasks, showing language is powerful for logic and counting—but it drops on motion-heavy, time-critical tasks.
MemER does well overall (~42.4%), especially where keeping keyframes helps with dynamic scenes and identity.
Baselines without memory lag far behind (often under 20%).
Humans score about 90.5% with the oracle planner—strong but not perfect, proving the benchmark is genuinely challenging even for people.

🍞 Anchor: On StopCube (press exactly when the moving cube reaches the target at a specific occurrence), perceptual memory helps timing more than a written subgoal like “press now.”

🍞 Hook: Did anything surprising pop up?

🥬 Surprising Findings:

No single memory type or integration strategy dominates across all tasks. This overturns the hope for a one-size-fits-all memory.
Token dropping can hurt when global context matters; frame sampling preserved better holistic cues for timing and motion.
Recurrent methods underperformed here, likely needing deeper integration and long-horizon pretraining to shine.
The modulator strategy consistently balanced performance and efficiency: it injects memory in a way that respects the pretrained backbone’s strengths.
Efficiency trade-offs: external VLM calls (for symbolic subgoals) add 3– $5× more$ compute than the base, while perceptual+modulator gives steady gains as memory budget grows with modest overhead.

🍞 Anchor: Like choosing gear for a hike: a compact camera (perceptual+modulator) that adds a little weight but a lot of useful history beats calling a tour guide on the phone every few minutes (heavy external VLM).

05Discussion & Limitations

🍞 Hook: Imagine packing for a trip—you can’t take everything, so you choose wisely and know where each choice works best.

🥬 Limitations:

Environment focus: tabletop manipulation in a simulator; mobile robots and richer worlds are future work.
Single backbone: all variants use the Pi-0.5 family; results may differ with other backbones or training recipes.
Recurrent underuse: shallow recurrent add-ons may be too weak without special pretraining; not the final word on recurrence.
Data domain shift: prompt-only large VLMs struggled without fine-tuning; real deployments may need adaptation.

🍞 Anchor: A great race track for cars isn’t the same as a mountain trail for bikes; future tests should broaden terrains.

🍞 Hook: What do you need to use this well?

🥬 Required Resources:

Access to the RoboMME simulator setup and dataset (16 tasks, 1,600 demos, 770k steps)
GPUs for training (e.g., 3–4 days on several A40s per variant in the study)
Engineering to implement memory interfaces with fixed token budgets

🍞 Anchor: Think of it like setting up a game tournament with standard gear and referees.

🍞 Hook: When should you not use a given memory style?

🥬 When NOT to Use:

Symbolic-only memory: Avoid for time-sensitive motion imitation or fine control tasks (e.g., pressing at the exact pass or tracing a curve) where words lack timing/geometry fidelity.
Perceptual-heavy memory: Avoid if resources are tiny and tasks don’t need visual history (simple one-shot pick/place from a static scene).
Recurrent-light add-ons: Avoid if you expect long-horizon dependencies without proper pretraining; they may forget or blur details.

🍞 Anchor: Like choosing a notepad vs. a camera vs. a compact summary—pick the tool that matches the job.

🍞 Hook: What’s still unknown?

🥬 Open Questions:

Can hybrid memory (symbolic + perceptual + recurrent) consistently beat each alone in a unified model?
How much specialized pretraining improves recurrent long-context retention?
What are the best strategies for dynamic token budgeting over very long videos?
How to make language grounding robust in clutter and under occlusion?
How well do results scale to mobile manipulation and multi-room tasks?

🍞 Anchor: The next step is blending the diaries, photo albums, and compact summaries into one smart, efficient memory system.

06Conclusion & Future Work

🍞 Hook: Think of RoboMME as the Olympics of robot memory—many events, one fair arena, real challenge.

🥬 3-Sentence Summary: RoboMME introduces a large, standardized benchmark that truly requires memory across four cognitive types: temporal, spatial, object, and procedural. Using 16 long-horizon tasks and 14 matched model variants, the authors reveal that no single memory design wins everywhere—symbolic helps with counting and logic, while perceptual is vital for motion and timing, and the modulator integration best balances performance and efficiency. These insights hold both in simulation and on real robots, establishing a strong foundation for future memory-augmented generalist policies.

🍞 Anchor: Picture a robot that can both keep count like a librarian and trace smooth curves like an artist because it chooses the right memory tool for the task.

🥬 Main Achievement: A comprehensive, fair, and challenging benchmark plus a carefully controlled study that maps which memory styles work best for which job features—and why.

🥬 Future Directions:

Hybrid memory systems that combine symbolic, perceptual, and recurrent strengths
Stronger long-horizon pretraining for recurrence
Extending to mobile manipulation and richer, dynamic environments
Smarter token budgets and efficient caching for long videos and real-time use

🥬 Why Remember This: RoboMME changes how we measure and reason about memory in robot control—it’s a clear, rigorous step toward trustworthy, long-horizon, real-world robot helpers that can remember what matters.

Practical Applications

•Home assistance: Remember how many items have been set and who each belongs to (temporal and object memory).
•Cleaning routines: Track the number of passes over a surface to meet hygiene standards (temporal memory).
•Healthcare support: Recall patient-specific placement and timing instructions, even with brief occlusions (spatial and temporal memory).
•Warehouse picking: Keep identity of items across bins and times, even when boxes are moved (object and spatial memory).
•Manufacturing: Repeat precise tool paths learned from demonstration for assembly or inspection (procedural memory).
•Education and training: Teach robots tasks via video, then have them reproduce the behavior safely (procedural memory).
•Human–robot collaboration: Maintain a shared task history so the robot can anticipate the next step (symbolic + perceptual memory).
•Elderly care: Track sequences like medication dispensing counts and proper object identification (temporal and object memory).
•Kitchen tasks: Follow language instructions with temporal references like ‘after you press the timer, place the tray on the second shelf’ (object and temporal memory).
•Field service: Work in cluttered, changing environments by remembering where tools were before occlusion or movement (spatial memory).

Version: 1