Multimodal Fact-Level Attribution for Verifiable Reasoning

David Wan; Han Wang; Ziyang Wang; Elias Stengel-Eskin; Hyunji Lee; Mohit Bansal

Multimodal Fact-Level Attribution for Verifiable Reasoning

Beginner

David Wan, Han Wang, Ziyang Wang et al.2/12/2026

arXiv

Key Summary

•This paper builds a new test, called MURGAT, to check whether AI models can back up each small fact they say with the right part of a video, audio, or figure.
•Instead of judging only the final answer, the test breaks explanations into tiny, checkable facts and asks for exact citations with modality (audio/visual) and timestamps.
•The evaluation has three steps: find sentences that are verifiable, split them into atomic facts, and check if the cited evidence really proves each fact.
•A single score, MURGAT-SCORE, combines how often models provide citations (coverage) with how correct those citations are (attribution).
•An automatic scoring pipeline closely matches human graders, reaching near-perfect correlation for citation coverage (r = 0.97) and strong overall alignment (r ≈ 0.86).
•Even strong multimodal models often get the answer right but still cite the wrong evidence or miss timestamps, showing a gap between reasoning and verifiable grounding.
•Adding citations can act like a 'reasoning tax' on simple recognition tasks but can help on complex reasoning tasks by scaffolding thinking.
•Program-like, step-by-step grounding boosts faithfulness (about +9.6 MURGAT-S on average) but can lower answer accuracy, revealing a trade-off between being careful and being flexible.

Why This Research Matters

When AI explains a lesson video or a lab demo, we need to know each claim is really in the footage or audio—not just imagined. MURGAT pushes models to provide exact, checkable citations for every observable fact, which helps teachers, students, journalists, and scientists trust and audit explanations. This is crucial for reducing misinformation, avoiding subtle errors in tutorials, and making complex analyses transparent. The automatic scoring pipeline means we can test many models quickly and fairly, speeding progress toward trustworthy AI. As AI takes on more educational and professional roles, per-fact grounding is a foundation for safety, reliability, and learning effectiveness.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine your friend tells a long story about a science video and says, “Trust me!” Wouldn’t you feel safer if they also pointed to the exact second in the video and the exact sentence in the audio that proves each part of their story?

🥬 The Concept: The world before this paper mostly tested whether AI could spot things in pictures or short clips, or answer questions by pointing to one obvious place. That worked fine for simple, “I can see it right there” questions, but not for real-life school or work problems where you must combine clues across video, audio, and graphs—and show your work.

What it is: Models today can write long answers and show some citations, but they often do this in simplified setups (mainly images, short outputs, single-step evidence).
How it worked before:
1. Ask a question about an image or video.
2. The model answers and maybe gives a timestamp or a link.
3. Evaluations mostly check if the answer is correct—not if every factual sentence is truly proven by the cited evidence.
Why it matters: In real life (class lessons, tutorials, labs), answers come from many steps. If one step is wrong or not proven, confidence collapses.

🍞 Anchor: Think of a science video explaining forces with a narrator (audio) and a graph (visual). If the model claims, “The green curve is repulsive,” it should cite the graph’s timestamp for the curve and the audio timestamp where the narrator defines the sign convention—otherwise, we can’t trust that claim.

🍞 Hook: You know how group projects go wrong when one person makes up a part and nobody checks it? That happens in AI, too.

🥬 The Concept: The problem researchers faced is verifying each small claim inside a long explanation, across different modalities, not just the final answer.

What it is: A need to judge fact-by-fact grounding in multi-step reasoning where evidence might live in video frames, audio narration, slides, or plotted data.
How it works (the challenge):
1. Explanations mix reasoning (not directly checkable) and observable facts (checkable).
2. Facts may need multiple pieces of evidence (e.g., audio + visual).
3. Each fact needs exact timestamps and modality labels.
Why it matters: Without this, models can sound smart while citing the wrong parts—or nothing at all—creating false confidence.

🍞 Anchor: If the model says, “At 1:16 the curve is positive,” we must see that exact visual frame. If it adds, “The narrator defines repulsive as positive,” we need the audio segment where that convention is stated.

🍞 Hook: Imagine trying to grade an essay where every sentence must be proven by a specific page in the textbook and a second in a lecture video. Tricky!

🥬 The Concept: Past attempts mostly tested eyesight (visual grounding) or used retrieval tools for text, but didn’t fully handle complex, mixed-modality reasoning with exact, per-fact citations.

What they tried and why it fell short:
1. Image-only grounding or simple video moment retrieval: misses audio and multi-source synthesis.
2. “LLM-as-a-judge” holistic scores: can be vague and don’t check each tiny claim.
3. Provide timestamps in prompts: the model just picks from given hints, not real self-grounding.
Why it matters: These setups can overestimate trustworthiness because they don’t force precise, fact-level proof.

🍞 Anchor: If the question needs combining a graph’s axis label, a narrator’s definition, and a later frame’s value, a one-shot “find the timestamp” task won’t catch whether every small claim was truly grounded.

🍞 Hook: Think of a mystery: you don’t just want the detective’s final answer—you want the clues pinned to exact times and places.

🥬 The Concept: The missing piece was a benchmark and scoring method that separates reasoning from observation and checks, for each observable claim, whether the cited multimodal evidence truly entails it.

What it is: A comprehensive, fact-level, multimodal attribution test.
How it works:
1. Identify which sentences are verifiable observations.
2. Break those into atomic, bite-sized facts.
3. Check whether each fact is fully supported by the cited segments (modality + timestamps).
Why it matters: This exposes when models “sound right” but don’t actually show the proof.

🍞 Anchor: If an answer paragraph says five things you can see or hear, all five must have correct, specific citations; otherwise, trust should drop.

🍞 Hook: Why should anyone care? Because in school, journalism, science, and safety, being right isn’t enough—you must show your sources.

🥬 The Concept: Real stakes include classroom learning, fact-checking, and technical tutorials where wrong or unverified steps mislead people.

What it is: Ensuring that each factual step is grounded protects users from subtle mistakes.
How it works:
1. Demand exact citations for observable claims.
2. Score how completely and precisely those citations prove the claims.
3. Prefer methods that align internal reasoning with external evidence.
Why it matters: It helps build AI that you can audit, trust, and learn from.

🍞 Anchor: When a model explains a physics problem from a lecture video, you want to pause at the cited second and confirm the narrator really said that and the graph really shows that.

02Core Idea

🍞 Hook: You know how a math teacher says, “Show your work,” and not just the final answer? This paper does that for multimodal AI.

🥬 The Concept (Aha! in one sentence): Separate what the model claims you can directly see or hear from how it reasons, then force each observable claim to be proven by exact timestamps and modalities.

What it is: A benchmark and scoring method (MURGAT and MURGAT-SCORE) that test per-fact grounding across video, audio, and figures.
How it works:
1. Find sentences that are verifiable observations (not pure reasoning).
2. Split them into atomic facts—small, stand-alone claims.
3. For each fact, verify that the cited segments (with modality + timestamps) fully support it, checking both sufficiency (recall) and necessity (precision).
Why it matters: It reveals when answers are correct but unproven, and pushes models to be trustworthy explainers.

🍞 Anchor: If the model says, “The label on the plot reads ‘Voltage (V)’,” it must point you to the exact visual timestamp where that label appears.

🍞 Hook: Imagine three different ways to explain the same idea so it really sticks.

🥬 The Concept (Multiple analogies):

Analogy 1—Detective board: The answer is the solved case, but each string on the board must connect to an exact clue (the cited second in the video or audio).
Analogy 2—Recipe with ingredients: The final dish (answer) only counts if every ingredient (fact) came from the right jar (modality) at the right time (timestamp).
Analogy 3—School essay: You get credit for each sentence only if your footnotes point to the exact page and line.

🍞 Anchor: Saying “the speaker defines a term” isn’t enough; you must show the audio moment where the word is defined, just like a footnote.

🍞 Hook: What changes before vs. after this idea?

🥬 The Concept (Before vs. After):

What it is: • Before: Evaluate mostly final answers, sometimes with loose or single citations. • After: Evaluate every checkable sentence as a set of atomic facts with exact multimodal citations.
How it works:
1. Don’t penalize pure reasoning sentences; only check observable claims.
2. Require per-fact evidence and judge both sufficiency and necessity.
3. Combine coverage (did you cite at all?) with attribution quality (were those citations good?) into one score.
Why it matters: Models can’t hide behind fluent language—they must actually show proof for each fact.

🍞 Anchor: Two models might both answer “B,” but only one can jump to the right seconds in the video and audio for each claim—that’s the one you trust.

🍞 Hook: Why does this work intuitively?

🥬 The Concept (Why it works):

What it is: A clean separation between uncheckable thinking and checkable observations avoids scoring confusion.
How it works:
1. Filter to verifiable sentences to reduce noise.
2. Atomize claims so mixed-true/false sentences are scored fairly.
3. Use two-sided checks: Recall (is the evidence enough?) and Precision (is each citation necessary?).
Why it matters: This pinpoints missing evidence vs. citation dumping, and pushes models toward faithful, minimal proofs.

🍞 Anchor: If a fact needs both the narrator’s definition (audio) and a plot region (visual), recall fails unless both are correctly cited; precision drops if the model also throws in random extra timestamps.

🍞 Hook: Let’s name the building blocks so the toolbox is clear.

🥬 The Concept (Building blocks):

Multimodal Grounding: Tie text to exact bits of video, audio, or figures.
Fact-Level Attribution: Map each tiny claim to the exact supporting segments.
Verifiable Claim Identification: Keep only sentences you can actually check.
Atomic Fact Decomposition: Split sentences into smallest checkable facts.
Attribution Quality: Judge sufficiency (recall) and necessity (precision).
MURGAT: The benchmark that wraps all of the above into tasks.
MURGAT-SCORE: Coverage × Attribution—one score that rewards both “I cited” and “I cited well.”

🍞 Anchor: It’s like grading a lab report: first circle the sentences that should be backed by data, then check each data point and whether it’s the only one needed.

03Methodology

🍞 Hook: Imagine grading a science notebook. You first mark which sentences need proof, then break them into checklist items, and finally confirm each item using the exact spot in the data. That’s the recipe here.

🥬 The Concept (High-level flow): Input → Step A (Verifiable Claim Identification) → Step B (Atomic Fact Decomposition + Decontextualization + Citation propagation) → Step C (Attribution Quality: Precision/Recall) → Metrics (Coverage, Attribution F1, MURGAT-SCORE)

What it is: A step-by-step pipeline that turns a long answer into many tiny, checkable facts with citations.
How it works:
1. Step A: Keep only sentences that claim something observable.
2. Step B: Split each kept sentence into atomic facts; resolve pronouns; copy over citations.
3. Step C: For each atomic fact, check whether all cited segments together are enough (recall) and which citations are truly needed (precision).
Why it matters: Without this structure, scoring would be fuzzy and models could get away with hand-wavy evidence.

🍞 Anchor: If the answer says, “The green curve is positive (visual, 1:16), and the narrator defines repulsive as positive (audio, 0:42–0:46),” we expect two atomic facts with those exact citations.

— Step A: Verifiable Claim Identification — 🍞 Hook: You know how some sentences are thoughts (like “Therefore…”) and others are observations (like “The label says X”)? Only observations can be fact-checked.

🥬 The Concept:

What it is: A filter that finds sentences you can check directly in the video/audio/figure.
How it works:
1. For each sentence, ask: can this be seen/heard/read in the sources?
2. Keep it if yes; drop if it’s just reasoning, general knowledge, or chit-chat.
3. Also keep only those with at least one citation (so there’s something to score).
Why it matters: We avoid punishing creative but unobservable reasoning and focus evaluation on claims that must be grounded.

🍞 Anchor: “The narrator says ‘repulsive is positive’ (audio, 0:42–0:46)” is kept; “Therefore, statement B is false” is not checked.

— Step B: Atomic Fact Decomposition (with Decontextualization & Citation Propagation) — 🍞 Hook: It’s easier to grade a multiple-choice quiz than a giant paragraph. So we turn each checkable sentence into a list of tiny checkpoint facts.

🥬 The Concept:

What it is: Break a sentence into minimal, independent facts; resolve pronouns using only earlier context; copy the sentence’s citations to each fact.
How it works:
1. Decontextualize: Replace pronouns (he, it, this) with concrete nouns known at that point.
2. Split: Turn one sentence into small atomic facts (e.g., “curve is green,” “curve is positive”).
3. Propagate citations: Every atomic fact inherits the sentence’s modality+timestamp citations; if the sentence had separate inline citations, assign them to the right fact.
Why it matters: Mixed-true/false sentences become fairly scorable; pronoun clarity stops accidental mismatches.

🍞 Anchor: “It is positive (visual, 1:16)” becomes “The green curve is in the positive region (visual, 1:16).” Now we can check that exact frame.

— Step C: Attribution Quality (Precision + Recall) — 🍞 Hook: Think of a proof: you need enough evidence (recall), but not extra fluff (precision).

🥬 The Concept:

What it is: A two-sided test for each atomic fact’s citations.
How it works:
1. Recall (Sufficiency): Do all cited segments together fully support the fact?
2. Precision (Necessity): Of the cited segments, which are truly needed? Extra, irrelevant timestamps lower precision.
3. Aggregate into an F1 score for attribution quality.
Why it matters: It penalizes both missing evidence and citation-dumping.

🍞 Anchor: If a fact needs audio definition and a visual frame, recall fails without both. If you also add an unrelated timestamp, precision falls.

— Metrics — 🍞 Hook: Imagine two grades: Did you remember to show sources (coverage)? And were those sources the right ones (attribution)?

🥬 The Concept (Coverage, Attribution, MURGAT-SCORE):

What it is: • Coverage: fraction of verifiable sentences that actually have citations. • Attribution: F1 from precision+recall on atomic facts. • MURGAT-SCORE = Coverage × Attribution.
How it works:
1. Compute coverage over sentences.
2. Compute precision/recall/F1 over atomic facts.
3. Multiply to avoid rewarding rare, cherry-picked perfect grounding.
Why it matters: A single great citation isn’t enough if most facts are uncited; and citing everything poorly shouldn’t score high either.

🍞 Anchor: A model that cites 95% of checkable sentences (high coverage) but often cites the wrong moments (low attribution) ends up with a moderate MURGAT-SCORE.

— Automatic Evaluation Pipeline — 🍞 Hook: Grading tons of videos by hand is slow. Can we build a fair robot grader that agrees with people?

🥬 The Concept:

What it is: An LLM-based, stepwise auto-grader tuned to match human annotations.
How it works:
1. Human-annotate a sample for all three subtasks on WorldSense and Video-MMMU.
2. Try different models/prompts per subtask and pick the best: e.g., Gemini-3-Pro for verifiability (JSON prompt), Gemini-3-Flash for decomposition, Gemini-2.5-Flash for entailment.
3. Show strong correlation with humans: near-perfect for coverage (r = 0.97), strong overall (≈ 0.86).
Why it matters: Enables scalable, reliable benchmarking.

🍞 Anchor: The auto-grader’s coverage scores almost perfectly track human grades, so we can evaluate many models quickly.

— Secret Sauce — 🍞 Hook: The magic is not one trick—it’s how the pieces fit together.

🥬 The Concept:

What it is: Disentangle reasoning vs. observation, atomize facts, and verify with two-sided checks—then validate the grader against humans.
How it works:
1. Fact-level granularity exposes subtle hallucinations.
2. Modality+timestamp citations prevent hand-wavy grounding.
3. Precision+recall balance punishes both missing and extra evidence.
Why it matters: It makes faithfulness measurable, not just a vibe.

🍞 Anchor: When a model sounds confident but cites the wrong second, the pipeline catches it and lowers the score appropriately.

04Experiments & Results

🍞 Hook: If two students both get the right answer, but only one shows the exact moments in the video and audio that prove each step, who do you trust more?

🥬 The Concept (The test):

What it is: Evaluate models on two challenging datasets—WorldSense (recognition-heavy video+audio scenes) and Video-MMMU (reasoning over visuals like plots, plus audio)—using Coverage, Attribution (Precision, Recall, F1), MURGAT-SCORE, and Answer Accuracy.
How it works:
1. Prompt models to produce step-by-step answers with citations.
2. Score with the automatic pipeline validated against human judgments.
3. Compare base answers, answers with citations, and post-hoc attribution.
Why it matters: We measure not just correctness, but verifiable proof quality.

🍞 Anchor: Think of scores like report cards: high accuracy is like a good final answer, while high MURGAT-SCORE means each important sentence is properly proven.

— Competition & Models — We tested state-of-the-art MLLMs: Gemini-2.5-Flash, Gemini-3-Flash, Gemini-3-Pro, Qwen-3-Omni-Instruct, Qwen-3-Omni-Thinking, and vision-only baselines (Qwen-3-VL variants, Molmo2).

— Scoreboard with Context —

Human-annotated sample (20 items): Even strong models struggled with attribution. For example, on WorldSense, Gemini-2.5-Flash reached Coverage ≈ 85% and Attribution F1 ≈ 62.6%, yielding MURGAT-S ≈ 59.9, while on Video-MMMU it fell to MURGAT-S ≈ 21.8 despite reasonable QA accuracy.
Automatic large-scale runs: On WorldSense, the best MURGAT-S reached about 69.2 (Gemini-3-Flash with post-hoc attribution). On Video-MMMU, the best MURGAT-S was about 56.9 (Gemini-3-Flash with citations). Interpretation: Coverage is often high (models attach citations), but attribution quality is the bottleneck—citations frequently don’t exactly prove the facts.

— Surprising Findings —

Correct ≠ Trustworthy: Some high-accuracy models still hallucinate citations or point to the wrong seconds. Example: On Video-MMMU, two models can tie on accuracy, but one has much lower MURGAT-S, revealing weaker evidence.
Citations can be a ‘reasoning tax’ on easy tasks: On WorldSense (recognition-heavy), forcing citations slightly drops accuracy for some models, likely due to extra formatting and grounding overhead.
Citations help on hard reasoning: On Video-MMMU, adding citations often improves accuracy by structuring the model’s thinking.
Post-hoc attribution helps recognition, hurts deduction: Adding citations after generation boosts grounding on easy perceptual tasks but can mis-assign evidence on complex reasoning (e.g., “citation salad”).
Vision-only models hallucinate audio: Some vision-language models, which cannot hear, still output audio citations—revealing modality-mismatch hallucinations.
More thinking isn’t always better: Scaling “thinking effort” helps some models align reasoning with evidence (e.g., Gemini-3-Pro on WorldSense) but can hurt others (Gemini-3-Flash shows MURGAT-S drop at high effort), suggesting misalignment between internal chains and external proof.
Program-aided grounding trade-off: Logic- or narrative-style planning with retrieval tools improved MURGAT-S by around +9.6 points on average, but tended to reduce answer accuracy (~−7.4 points), highlighting a precision–flexibility tension.

🍞 Anchor: Two models both say “Option B,” but one can take you to audio 0:42–0:46 for the definition and visual 1:16 for the plot; the other cites the wrong times. Their answer accuracy is the same, yet only one earns your trust.

05Discussion & Limitations

🍞 Hook: Imagine choosing between a careful scientist who proves every step but works slowly, and a fast problem-solver who sometimes waves their hands. Which do you want your AI to be?

🥬 The Concept (Honest assessment):

Limitations (what it can’t do):
1. If evidence is off-screen (e.g., common knowledge) rather than in the inputs, it’s not scored—this benchmark focuses on observable grounding.
2. Precise timestamping remains hard; small temporal slips can lower scores even when the general idea is right.
3. Audio understanding is tricky; models (especially vision-only) may hallucinate audio citations.
4. Program-aided methods can over-constrain reasoning, reducing final accuracy on complex tasks.
Required resources: • Multimodal inputs (video+audio+figures), capable MLLMs, and compute to run multi-step judging. • For best auto-grading fidelity, a combination of specialized “judge” models per subtask.
When not to use: • Purely text-only question answering without observable sources. • Tasks where evidence is not time-localized or cannot be cited (e.g., broad domain knowledge without media).
Open questions:
1. How to align internal chain-of-thought with external citations so added structure boosts both accuracy and faithfulness?
2. Can we train models to prefer minimal, sufficient evidence sets by design (precision-first training)?
3. How to robustly handle audio nuances (overlapping speech, noise) and visually dense plots or diagrams?
4. How to prevent modality-mismatch hallucinations (e.g., audio citations from vision-only models)?

🍞 Anchor: The next generation system we want is like a student who both solves the problem and pins each step to the exact line in the lab log or the video frame—no fluff, no guesses.

06Conclusion & Future Work

🍞 Hook: Picture an AI that not only answers your question but lets you click straight to the proof in the video or audio—every time.

🥬 The Concept (3-sentence summary): This paper introduces MURGAT, a benchmark that checks whether each small, checkable fact in an AI’s explanation is proven by exact multimodal citations. It splits evaluation into finding verifiable sentences, decomposing them into atomic facts, and testing whether the cited segments are sufficient and necessary, then rolls it up into MURGAT-SCORE (coverage × attribution). A validated automatic pipeline strongly matches human judgment, revealing that models often reason correctly but fail to ground their claims.

Main achievement: Making fact-level multimodal attribution measurable and scalable, with high human correlation (near-perfect for coverage).
Future directions: Train models to align internal reasoning with external evidence, improve audio/plot grounding, and design program-aided methods that keep both accuracy and faithfulness high.
Why remember this: It shifts the goal from “sounds right” to “proven right,” moving multimodal AI toward trustworthy, auditable explanations.

🍞 Anchor: The next time an AI explains a physics video, you’ll be able to jump to the exact second where each claim is shown or said—like clickable proof for every sentence.

Practical Applications

•Educational video tutors that cite the exact seconds where each concept is defined or shown.
•Fact-checked lecture summaries with per-sentence visual/audio timestamps for quick verification.
•STEM homework helpers that ground graph readings and definitions to specific frames and narration.
•News verification tools that link claims to exact moments in press briefings or recorded events.
•Corporate training QA that checks if safety procedures cited by the model actually appear in the training video.
•Scientific video abstracts where each reported measurement links to the precise plotted region and narration time.
•Customer support agents that reference exact demo-video segments when explaining how to fix a device.
•Compliance audits that demand per-claim evidence from instructional media before approval.
•Video search interfaces that return not just clips for an answer, but per-fact timestamps inside the clip.
•Research benchmarks for training models to minimize citation hallucinations and modality mismatches.

Version: 1

Key Summary

Why This Research Matters

Detailed Explanation

01Background & Problem Definition

02Core Idea

03Methodology

04Experiments & Results

05Discussion & Limitations

06Conclusion & Future Work

Practical Applications

Notes