MMA: Multimodal Memory Agent

Yihao Lu; Wanru Cheng; Zeyu Zhang; Hao Tang

MMA: Multimodal Memory Agent

Intermediate

Yihao Lu, Wanru Cheng, Zeyu Zhang et al.2/18/2026

arXiv

Key Summary

•Long-horizon AI assistants can grab old, low-quality, or conflicting memories and then answer with too much confidence, which is dangerous.
•MMA (Multimodal Memory Agent) gives every retrieved memory its own reliability score using three signals: who said it (source credibility), how old it is (temporal decay), and whether related memories agree (conflict-aware consensus).
•The agent then reweights evidence by these scores and can choose to abstain when trustworthy support is too weak.
•The authors also build MMA-Bench, a controlled test with text–image conflicts and known speaker reliabilities, and a scoring system that rewards safe abstention.
•They discover a Visual Placebo Effect: merely seeing an image can make some agents feel falsely certain, even when the image is irrelevant or ambiguous.
•On FEVER, MMA matches baseline accuracy but is 35.2% more stable across runs and has better selective (abstention-aware) utility.
•On LoCoMo, a safety-leaning MMA setup reduces wrong answers and slightly improves actionable accuracy compared to the baseline.
•On MMA-Bench, MMA reaches 41.18% Type-B accuracy (where visuals back an unreliable source) in Vision mode; the baseline collapses to 0.0% under the same rules.
•Ablations show each component matters: removing source credibility leads to paralysis, removing consensus invites visual overreach, and removing time hurts stability—especially with images.
•Overall, MMA is a practical recipe for more reliable multimodal memory use and calibrated, safe behavior.

Why This Research Matters

Real assistants must remember over weeks and juggle texts and images without being fooled by old, weak, or conflicting information. MMA makes assistants safer by preferring reliable sources, favoring fresh updates, and cross-checking evidence before acting. This reduces overconfident mistakes in everyday tasks like schedules, shopping, travel, and home automation. In high-stakes areas—health triage, enterprise support, or safety monitoring—its ability to abstain is crucial, because the right choice is sometimes to pause and ask for more information. MMA-Bench gives teams a way to test for this prudence, especially around images that can look convincing but prove nothing. The result is AI that is not just smart, but wisely cautious when it matters most.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

You know how your class notes from September might not be as helpful in March unless you check who wrote them, when they were written, and whether they match what the teacher said recently? AI assistants have a similar problem. They remember lots of things across days, weeks, or months—texts, images, conversations—and later pull some of those memories to answer new questions. But without a way to judge which memories to trust, they can cling to old or shaky info and speak with too much confidence.

🍞 Top Bread (Hook): Imagine asking three friends if the museum is open today. One friend is super reliable, one often guesses, and one saw an old sign last month. If you treat all three answers as equally true, you'll probably make a bad plan.

🥬 Filling (The Actual Concept): What goes wrong in long-horizon AI agents is that retrieved memories are often treated as equally trustworthy. The agent uses similarity to your question to pick memories and then answers, even if those memories come from iffy sources, are stale, or contradict each other.

How it works today: (1) Retrieve by similarity, (2) Stuff memories into context, (3) Generate an answer. No careful trust check.
Why it matters: Without reliability checks, small memory mistakes snowball into overconfident wrong answers—especially across long timelines and multiple modalities (text + images).

🍞 Bottom Bread (Anchor): A baseline agent sees a recent chat about “bears” and another about “Bear Café hours.” It grabs the high-similarity “bears” chat by mistake and confidently tells you the café is in hibernation—oops.

Now, the problems in more detail:

The World Before: Memory-augmented agents got good at retrieving lots of context. But similarity-based retrieval can surface off-topic or outdated items, and most systems didn’t explicitly score the trustworthiness of each memory.
The Problem: Agents mix in stale, low-credibility, or conflicting memories and still answer like they’re certain. They rarely say, “I’m not sure,” even when the evidence is thin.
Failed Attempts: People tried better retrieval, longer context windows, or token-level uncertainty (e.g., self-consistency). Helpful, but they don’t fix the core failure: unreliable retrieved memories trigger overconfident commitments.
The Gap: We need memory-level reliability scoring and an evaluation that rewards saying “I don’t know” when the evidence is weak or contradictory.
Real Stakes: In daily life, you might rely on an assistant for schedules, health tips, or home devices. Overconfident mistakes cause missed flights, broken routines, or risky actions. In safety-critical settings, the cost is much higher.

Here come the building blocks, introduced with the Sandwich pattern in a kid-friendly way:

🍞 Source Credibility — Hook: You know how you trust the school nurse about health more than a random rumor? 🥬 Concept: Source credibility is judging trust based on who said it.

How: Keep a trust prior per source (e.g., “official website” > “random user”).
Why: If you ignore who said it, you treat a rumor like a rule. 🍞 Anchor: When the city website says the museum opens at 10 and a friend guesses 9, you trust the website.

🍞 Temporal Decay — Hook: Bread gets stale over time, right? 🥬 Concept: Temporal decay lowers trust as info gets older.

How: Newer memories get higher weight; older ones fade.
Why: Otherwise, last year’s policy keeps beating this week’s update. 🍞 Anchor: A bus schedule from today beats one from last winter.

🍞 Dynamic Reliability Scoring — Hook: Imagine star ratings that update every time you re-check a place. 🥬 Concept: Dynamic scoring is a changing trust score per memory, combining who said it, how old it is, and what neighbors say.

How: Compute a score using source credibility, temporal decay, and cross-memory agreement.
Why: Without a per-item trust score, you can’t safely mix evidence. 🍞 Anchor: A photo caption from a newspaper today that many notes agree with gets a high score; a lone, old blog post gets a low one.

🍞 Conflict-Aware Network Consensus — Hook: When friends disagree, you ask others to see which story fits best. 🥬 Concept: Consensus checks whether nearby, related memories support or contradict a memory.

How: Look at semantic neighbors; reinforce aligned memories and penalize contradictions.
Why: Without this, one loud, off-topic note can sway the answer. 🍞 Anchor: If five field-trip notes say “10 AM open” and one says “closed,” you downweight the odd one unless there’s strong proof.

These pieces set the stage for MMA, the proposed agent that uses them to answer carefully—or abstain when evidence is weak.

02Core Idea

🍞 Hook: Picture a courtroom where every witness gets a trust badge that fades with time, and the jury checks if other witnesses back each story before deciding. If the evidence is shaky, the judge says, “We’ll wait.”

🥬 The Concept (Aha! in one sentence): MMA treats each retrieved memory like a witness with a live-updating trust score—and if trustworthy support is insufficient, it safely abstains.

How it works (like a recipe):

For each retrieved memory, assign a trust score from three ingredients: source credibility (who), temporal decay (when), and conflict-aware consensus (who agrees/disagrees).
Reweight the evidence in reasoning using these scores so strong, fresh, and well-supported memories lead.
If, after reweighting, the support is too weak or contradictory, the agent abstains rather than guessing.

Why it matters: This stops the “retrieval trap” where high-similarity but unreliable items mislead the model, and it rewards prudent silence over confident errors.

🍞 Anchor: When asked, “Did the art museum open early today?”, MMA trusts the official tweet from this morning supported by multiple notes over last week, and ignores an old forum rumor. If evidence is unclear, it says, “Not enough information.”

Three analogies for the same idea:

Courtroom: Each memory is a witness. Badges = trust scores. Jury cross-checks witnesses (consensus). The judge can say “case postponed” (abstain).
Library: Books have checkout dates (time), publishers (source), and citations (consensus). Outdated or uncited claims stay on the shelf.
Group chat: Friends post photos and messages. You weigh who’s reliable, how recent the posts are, and if others back them up—otherwise you don’t commit.

Before vs After:

Before: Agents grabbed high-similarity memories and answered, even if sources were weak or the info was old.
After: Agents score each memory, boost the solid ones, damp the risky ones, and can choose to abstain when evidence doesn’t add up.

Why it works (intuition, not math): Errors often come from a few noisy memories sneaking in. By (a) preferring credible sources, (b) fading old info, and (c) cross-checking with neighbors, you sharply lower the chance that a single bad memory dominates. Adding abstention means you avoid flipping a risky coin toss.

Building blocks with mini Sandwiches:

🍞 Source Credibility — Hook: Trusting the school newsletter over a hallway rumor. 🥬 What/How/Why: Tag sources with priors, use them to weight memories, so rumors don’t overrule reliable posts. 🍞 Anchor: City.gov beats a random blog about road closures.

🍞 Temporal Decay — Hook: Milk expires. 🥬 What/How/Why: Older info counts less; this lets new updates overtake outdated policies. 🍞 Anchor: Today’s cafeteria menu beats last month’s.

🍞 Conflict-Aware Network Consensus — Hook: When two friends disagree, check the rest of the team. 🥬 What/How/Why: Boost aligned memories, penalize contradictions; isolates outliers. 🍞 Anchor: If most notes say “field trip is Friday,” the lone “Thursday” gets downweighted unless strong proof appears.

🍞 Dynamic Reliability Scoring — Hook: Live-updating star ratings. 🥬 What/How/Why: Combine source, time, and consensus into one per-memory score used during reasoning. 🍞 Anchor: A recent, widely supported, credible photo-caption pair earns top billing.

🍞 Multimodal Memory Agent (MMA) — Hook: A careful librarian who reads text, looks at pictures, and keeps a trust meter for every card. 🥬 Concept: MMA is a memory agent that scores each retrieved item, reweights evidence, and abstains if support is insufficient.

How: Retrieve → score (source, time, consensus) → reason with weights → commit or abstain.
Why: Prevents overconfident mistakes driven by one flashy but unreliable memory. 🍞 Anchor: Faced with an ambiguous image and old text, MMA resists the urge to guess and says, “Unknown.”

03Methodology

At a high level: Query → Retrieve candidate memories (text and/or images) → Score each memory (Source, Time, Consensus) → Reweight evidence in the prompt → Decide: Answer or Abstain → Output with confidence.

Step-by-step with Sandwiches for new ideas and concrete examples:

Retrieval of Candidate Memories

What happens: Given a question, the agent fetches semantically similar memories (snippets, captions, images) from its store.
Why this step exists: You can’t reason about what you can’t see. Retrieval sets the stage, though it may bring in noisy or off-topic items.
Example: Question: “Did the Greenfield Art Museum open early today?” The store returns: (A) City’s official tweet from this morning; (B) A blog post from last year; (C) A friend’s message yesterday; (D) An image of the museum door with a small blurry sign.

Per-Memory Scoring: Source Credibility 🍞 Hook: You trust the nurse’s health advice over a random TikTok. 🥬 What: Assign a trust prior to each source (e.g., “official city account” high; “anonymous forum” low).

How: Map source types to trust levels; attach to each memory as a score.
Why: Without source priors, a rumor can outweigh official info. 🍞 Anchor: The city tweet gets a high source score; the old forum post gets a low one.

Per-Memory Scoring: Temporal Decay 🍞 Hook: Fresh fruit beats leftovers. 🥬 What: Reduce a memory’s weight as it ages.

How: Newer timestamps score higher; older ones fade.
Why: Prevents last year’s schedule from overruling today’s update. 🍞 Anchor: Today’s tweet outranks last year’s blog even if both are about the museum.

Per-Memory Scoring: Conflict-Aware Network Consensus 🍞 Hook: When teammates disagree, you check how the rest of the team leans. 🥬 What: For each memory, look at its semantic neighbors. If many aligned neighbors agree, boost it; if they contradict, reduce it.

How: Build a small neighborhood of related memories; estimate alignment (text-text, image-text via embeddings/captions). Reinforce agreement, penalize contradiction.
Why: Stops a lone flashy but wrong memory from steering the answer. 🍞 Anchor: If two more notes echo “opened early” and the blurry image is unclear, the text cluster gets boosted while the image gets treated cautiously.

Combine Scores into a Dynamic Reliability Score 🍞 Hook: Like a report card that averages homework (source), recency (time), and peer review (consensus). 🥬 What: Merge the three signals into one per-memory trust score.

How: Normalize each signal and compute a weighted blend tuned to the domain (e.g., more weight on source for safety; more on consensus in high-noise).
Why: A single, clear score per memory makes downstream reasoning robust and simple. 🍞 Anchor: The official tweet (high source, very fresh, supported) lands near the top; the old blog sinks to the bottom.

Reweight Evidence in Reasoning

What happens: The agent constructs its working context so that high-score memories are emphasized (more space, stronger prompts), and low-score ones are compressed or excluded.
Why: If you give all memories equal stage time, the weak ones can still derail the answer.
Example: The prompt includes the tweet verbatim, summarizes the friend’s message, and mentions the image only as “unclear signage.”

Decide to Answer or Abstain (Selective Prediction) 🍞 Hook: If you’re not sure about an address, you’d rather say “Let me check” than send a friend to the wrong place. 🥬 What: The agent checks if reliable support crosses a threshold. If not, it abstains.

How: Aggregate the top memory scores and their agreement. If coverage and consistency are strong, answer; otherwise, return “Unknown/Not enough info.”
Why: Avoids overconfident mistakes when evidence is thin or conflicting. 🍞 Anchor: If the tweet said “special early opening at 9 AM” and two notes match, answer “Yes, at 9 AM.” If the image is ambiguous and text is old, say “Unknown.”

Multimodal Handling and Visual Caution 🍞 Hook: A shiny picture can be persuasive, but shiny isn’t always true. 🥬 What: Treat images as evidence that must pass the same reliability checks.

How: Use captions/vision embeddings; still apply source, time, and consensus. Don’t let an image override consistent, credible text without support.
Why: Prevents the Visual Placebo Effect—images that “feel” convincing but don’t truly prove the claim. 🍞 Anchor: A blurry door photo doesn’t beat a fresh, official tweet.

Output with Confidence and Rationale

What happens: The agent returns an answer or abstains, along with a short explanation that mentions the strongest evidence and its reliability.
Why: Transparency helps users trust the result and know when to double-check.
Example: “Yes—based on today’s official city tweet and matching notes. The older blog post was downweighted.”

The Secret Sauce

Memory-item trust is first-class: Each piece of evidence earns its own score, so one bad apple can’t spoil the batch.
Consensus filters conflict: The semantic neighborhood keeps the agent from chasing outliers.
Abstention is a feature, not a bug: The system is rewarded for saying “I don’t know” when appropriate.
Modular knobs: Safety-critical deployments can upweight source credibility; noisy domains can lean more on consensus; fast-changing settings can emphasize time.

Concrete Walkthrough with Data

Query: “Is the Greenfield Museum opening early today?”
Retrieved: (1) City tweet 8:05 AM: “Early opening at 9 AM due to festival.” (2) Blog post from last year: “Opens at 10.” (3) Friend’s message yesterday: “Heard it might open early.” (4) Photo of entrance from today, sign text unreadable.
Scores: (1) High source, very fresh, supported by (3) → high; (2) Low time freshness, average source, contradicted → low; (3) Medium source, fresh, aligns → medium-high; (4) Unknown source, fresh but ambiguous, low consensus → low.
Decision: Plenty of aligned, credible, fresh support → Answer: “Yes, 9 AM.” If the tweet were missing and the image unclear, it would abstain.

What breaks without each step

No source: You can’t prioritize official updates; rumors win too often.
No time: Old info overruns new updates; policies freeze.
No consensus: A single flashy item (especially an image) can hijack the decision.
No abstention: The agent is forced to guess, making high-stakes mistakes likely.

04Experiments & Results

The Test: The authors measure whether MMA is both accurate and prudent. They look at classic fact-checking (FEVER), long-term dialog QA (LoCoMo), and a new adversarial benchmark (MMA-Bench) that injects text–image conflicts and varying speaker reliabilities. They also use abstention-aware metrics that reward safe “Unknown” when evidence is insufficient.

The Competition: MMA is compared mainly to MIRIX, a strong memory system that retrieves and routes information but doesn’t attach per-memory trust scores the same way.

Scoreboard with Context:

FEVER (500 samples, 3 seeds): MMA matches baseline accuracy (59.93% vs. 59.87%) but is 35.2% more stable (±1.62% vs. ±2.50%). That’s like keeping an A- average with far fewer ups and downs across tests. MMA also edges out the baseline on selective utility (α=0.2), meaning it’s better when we reward safe abstention.
LoCoMo (long conversations over months): A safety-oriented MMA variant (without consensus) improves actionable accuracy (79.64% vs. 78.96%) and reduces wrong answers (298 vs. 317). In everyday terms, it gets slightly more of the questions it attempts correct and makes fewer risky mistakes.
MMA-Bench (new, high-noise, multimodal conflicts): In the tough Type-B setting (visuals support the unreliable speaker), MMA reaches 41.18% accuracy in Vision mode, while the baseline collapses to 0.0%. That’s like at least getting on the board in a tricky game where the opponent tries to fool you with pictures.

Surprising Findings:

Visual Placebo Effect (Sandwich): 🍞 Hook: A fancy-looking bottle can make water taste better—even if it’s just water. 🥬 Concept: Simply seeing an image can trick agents into feeling more certain than they should.
- How: Visual inputs, even when ambiguous, can override caution unless checked by source, time, and consensus.
- Why it matters: Images can induce overconfident mistakes in high-noise settings. 🍞 Anchor: A blurry sign photo makes the agent say “It opened early!” when text says “Unknown”—unless reliability scoring reins it in.
Baseline’s “good” abstention in unknowable cases was often an artifact of retrieval blindness: it failed to fetch the conflicting evidence, so it said “Unknown,” not out of wisdom but out of not seeing the trap.
Foundation models show an inherited leaning to trust images; MMA exposes and buffers this, but removing consensus makes the model vulnerable to visual overreach.

Deeper Diagnostics on MMA-Bench:

In Type B (reliability inversion), the baseline posted 0%—it couldn’t engage due to noise. MMA, with its reliability filter, engaged and solved 41.18% in Vision mode.
In Type D (unknowable), MMA did well in Text mode but dropped in Vision mode if consensus was removed, highlighting how images can push overconfident guesses unless balanced by consensus.

Contextualizing Numbers:

Stability gain on FEVER means deployers can expect more predictable performance from run to run.
Fewer wrong answers on LoCoMo means better everyday reliability in long chats.
Type-B wins on MMA-Bench show real progress in the hardest trust conflicts: when a shiny picture seems to back an unreliable speaker, MMA can still reason its way to safer behavior.

Takeaway: MMA doesn’t just chase higher accuracy; it improves the quality of decision-making—rewarding caution when the world is messy and moving.

05Discussion & Limitations

Limitations:

Retrieval dependence: MMA can only score what it retrieves. If the underlying retriever misses key evidence, MMA can’t magic it back. This explains some baseline-versus-MMA differences on hard cases.
Sparsity–consensus trade-off: In sparse, chatty contexts, semantic neighbors may be thematically related but factually irrelevant. Strict consensus can become too conservative or noisy, so the ‘st’ variant (no consensus) sometimes wins.
Visual bias inheritance: Foundation models often “believe” images more than text. While MMA mitigates this via consensus, a strong visual bias can still seep through when images are ambiguous and context is thin.
Latency and compute: Scoring many memories (especially multimodal) and running consensus checks add overhead. Tight real-time systems may need lighter configurations.

Required Resources:

An external memory store with timestamps and source metadata.
A retriever that handles text and image embeddings (or captions) to build neighborhoods.
A table of source priors (updatable over time) and a way to learn/refine them.
Compute to run multi-pass scoring and selective prediction.

When NOT to Use:

Single-turn, low-stakes Q&A where retrieval isn’t needed or the knowledge is static and clear.
Extremely sparse logs where consensus neighborhoods are mostly off-topic; use the ‘st’ (Source + Time) variant instead.
Settings where you cannot track timestamps or sources; without these, scoring becomes fragile.

Open Questions:

Can we learn source priors adaptively from user feedback and outcomes, rather than setting them manually?
How to auto-tune consensus strictness to context entropy (tight in conflict, loose in sparsity)?
Better visual uncertainty: can we detect when images are genuinely uninformative and discount them earlier?
Interactive abstention: when the agent abstains, can it ask for the smallest extra piece of info that would flip to an answer?
Joint training: could future models learn memory-item reliability end-to-end, reducing dependence on hand-tuned weights?

06Conclusion & Future Work

Three-Sentence Summary: MMA gives every retrieved memory its own live trust score—combining who said it, how old it is, and whether related memories agree—and then answers or safely abstains based on that. This reduces overconfident mistakes, improves stability, and exposes a key weakness in multimodal systems: the Visual Placebo Effect. Across FEVER, LoCoMo, and the new MMA-Bench, MMA matches or improves utility and safety, especially under conflict.

Main Achievement: Turning passive memory retrieval into active epistemic filtering with per-item reliability and principled abstention, restoring agency in noisy, multimodal, long-horizon settings.

Future Directions: Learn source priors from feedback; adapt consensus to context density; build stronger visual uncertainty detectors; and integrate interactive abstention to request just-in-time clarifications.

Why Remember This: When memory is long and the world is messy, being right isn’t only about finding more context—it’s about trusting the right context and admitting when you don’t know. MMA operationalizes that wisdom and shows how to measure it.

Practical Applications

•Customer support bots that track user history for months but abstain or escalate when evidence is inconsistent.
•Personal AI organizers that weigh official updates over rumors and prefer today’s information over last year’s.
•Newsroom fact-check tools that rank claims by source credibility and recency, flagging contradictions across articles and photos.
•Healthcare intake assistants that summarize patient history while downweighting outdated notes and abstaining on risky advice.
•Enterprise knowledge assistants that cross-check policy docs and meeting notes, avoiding action when departments disagree.
•Robotics and IoT controllers that treat sensor images as evidence but require consensus with logs before acting.
•Education tutors that remember learning progress across semesters yet ask clarifying questions when records conflict.
•Legal e-discovery helpers that prioritize documents from authoritative sources and recent filings, suppressing noisy chatter.
•Scientific literature triage that surfaces fresh, peer-supported findings and flags visually appealing but weak evidence.
•Home automation agents that trust device logs and owner profiles appropriately, ignoring ambiguous camera frames.

Version: 1