MAD: Modality-Adaptive Decoding for Mitigating Cross-Modal Hallucinations in Multimodal Large Language Models
Key Summary
- âąMultimodal AI models can mix up what they see and what they hear, making things up across senses; this is called cross-modal hallucination.
- âąThe paper introduces MAD (Modality-Adaptive Decoding), a training-free way to ask the model which senses matter for a question and to use that answer during decoding.
- âąMAD turns the modelâs self-assessed modality choice into weights that guide how strongly to use or suppress vision and audio while generating each token.
- âąIt extends contrastive decoding by computing four branches (both clean, vision-perturbed, audio-perturbed, both-perturbed) and fusing them with adaptive weights.
- âąOn two tough benchmarks (CMM and AVHBench), MAD reduces cross-modal hallucinations and improves overall accuracy without retraining the model.
- âąGains are strong across models (e.g., VideoLLaMA2-AV and Qwen2.5-Omni), including up to +12% in dominance categories and around 81% overall accuracy.
- âąMAD is robust to prompt wording, needs only one extra modality query step, and keeps decoding speed close to other contrastive methods.
- âąAblations show that adaptive weighting beats uniform or winner-take-all strategies, and that using three weights (audio, video, both) works best.
- âąMAD slightly boosts general audio-visual QA too, hinting that it improves grounding even beyond hallucination-focused tests.
Why This Research Matters
Multimodal AI is moving into real-life settings where mixing up senses can lead to serious mistakes. MAD makes models ask, âWhich senses matter here?â and then follow that guidance, so answers are grounded in the right evidence. This reduces made-up sounds or visuals, improves trust, and supports safer use in education, accessibility, and safety monitoring. Because MAD is training-free, it can upgrade existing systems without costly retraining. Its robustness to prompt wording and modest compute overhead make it practical. By improving general QA too, MAD suggests better grounding even beyond hallucination tests.
Detailed Explanation
Tap terms for definitions01Background & Problem Definition
đ Hook: Imagine youâre watching a cartoon with the volume off. You can guess whatâs happening by looking, but you might still be wrong about the sounds. Now imagine turning on a podcast with your eyes closed. You can hear the story, but you canât see who is speaking. Using both together is powerfulâbut only if you know when to trust which one.
đ„Ź Concept 1 â Multimodal Input
- What it is: A multimodal input is when an AI uses more than one sense at the same time, like video (seeing), audio (hearing), and text (reading a question).
- How it works:
- The video gets turned into visual tokens (like tiny picture pieces).
- The audio gets turned into audio tokens (like little chunks of sound info).
- The text question becomes text tokens (words turned into numbers).
- The model mixes these tokens to answer the question.
- Why it matters: Without multiple senses, the model misses important clues (like hearing a siren it canât see). đ Anchor: If you ask, âWhat is the person saying?â the audio is key. If you ask, âWhat color is the car?â the video is key.
đ„Ź Concept 2 â Hallucinations
- What it is: A hallucination is when an AI says something that isnât supported by the input, like describing a cat that isnât there.
- How it works:
- The model guesses using patterns it learned.
- If it leans too much on guesses (priors) instead of evidence, it makes things up.
- Weak grounding (not checking the input) leads to wrong tokens in the answer.
- Why it matters: Hallucinations make answers unreliable. đ Anchor: If asked âWhatâs on the table?â and the model always says âa cupâ even when the table is empty, thatâs a hallucination.
đ„Ź Concept 3 â Cross-Modal Hallucinations
- What it is: Cross-modal hallucination is when one sense (like vision) wrongly influences what the model says about another sense (like audio), or vice versa.
- How it works:
- The model sees something (like a hammer) and assumes a matching sound (âhammer hittingâ).
- Or it hears cheering and assumes visuals like âcrowd waving,â even when not shown.
- The problem is poor control over how senses influence each other.
- Why it matters: It causes made-up details that sound plausible but arenât in the input. đ Anchor: The video shows a silent concert crowd, but the model says, âYou can hear loud music.â Thatâs vision causing an audio mistake.
The World Before: Multimodal Large Language Models (MLLMs) became good at understanding videos with sound and answering questions about them. But they often mixed signals, especially when one modality was strong and tempting (like vivid visuals). Prior work reduced single-modality hallucinations (mostly visual), often by comparing a normal input with a corrupted one and down-weighting language-only guesses.
The Problem: In real life, tasks vary: some require only hearing, some only seeing, and some both. However, many decoding fixes treated all inputs the same way, using uniform settings regardless of the question. That meant the model couldnât decide, âFor this question, trust audio more,â or âIgnore the sound here; itâs irrelevant.â
Failed Attempts:
- Visual Contrastive Decoding (VCD): helped for images/videos by subtracting a vision-corrupted branch, but assumed vision was always the main problem.
- AVCD: extended contrastive ideas to audio+video but still used uniform, non-adaptive distortions, not tuned to the specific question.
- Other decoding tweaks: helpful for general hallucinations, but not designed to control how modalities influence each other per task.
The Gap: Models lacked modality-appropriateness judgmentâthe skill to decide which senses matter for each question and by how much.
Real Stakes: Think of assistive tech describing videos to someone with low visionâmaking up sounds or visuals can mislead. In safety videos, wrongly hearing a âgunshotâ or seeing a âfireâ that isnât there could cause panic. In classrooms, a tutoring system that guesses from text patterns instead of real video/audio might teach the wrong thing. We need models that can honestly say, âI should trust audio more here,â or âUse both fairly,â to be accurate and safe.
02Core Idea
đ Hook: You know how a chef decides which ingredient should shineâchocolate in brownies, tomatoes in pastaâso the dish tastes right? The chef doesnât treat all ingredients the same for every recipe.
đ„Ź Concept 4 â Contrastive Decoding
- What it is: Contrastive decoding compares two versions of the modelâs viewâone with full, clean input and one with a deliberately disturbed versionâto highlight whatâs truly supported by the input.
- How it works:
- Run the model on the clean input.
- Run it again with a modality perturbed (e.g., blur video or mute audio).
- Compare token scores; down-weight tokens that donât change much (likely ungrounded guesses).
- Generate using the âclean minus disturbedâ signal to reduce made-up parts.
- Why it matters: It suppresses tokens that are not tied to the evidence, lowering hallucinations. đ Anchor: If the answer still says âbarking dogâ after muting audio, itâs likely a guess; contrastive decoding will push that guess down.
đ„Ź Concept 5 â Self-Assessment Mechanism
- What it is: The model briefly asks itself, âWhich modality do I needâaudio, video, or bothâto answer this question?â
- How it works:
- Add a tiny query prompt like: âWhich modality is needed (audio, video, both)?â
- The model returns scores for audio, video, and both.
- Convert them into probabilities (weights) that sum to 1.
- Why it matters: Without self-assessment, the model treats senses uniformly, causing cross-modal mix-ups. đ Anchor: For âWhat sound is heard?â, the model gives audio a high weight; for âWhat color is the car?â, video gets the high weight; for âDo the hands move with the beat?â, âbothâ gets the highest.
đ„Ź Concept 6 â Task-Specific Modality Weights
- What it is: These are the importance numbers (weights) for audio, video, and both, tailored to the current question.
- How it works:
- Take the self-assessed scores for audio, video, both.
- Softmax them into probabilities between 0 and 1.
- Use these as knobs to turn each contrastive branch up or down.
- Why it matters: If the question is audio-only, we shouldnât punish audio; we should suppress irrelevant visual influence instead. đ Anchor: If w_audio = 0.7, w_video = 0.1, w_both = 0.2, decoding will lean on audio evidence.
đ„Ź Concept 7 â Modality-Adaptive Decoding (MAD)
- What it is: MAD is a training-free way to link those task-specific weights to contrastive decoding, so the model emphasizes the right senses and quiets the wrong ones while generating every token.
- How it works:
- Ask the modality question (self-assessment) to get weights for audio, video, and both.
- Compute four contrastive branches: clean AV, video-perturbed, audio-perturbed, both-perturbed.
- Fuse the branches using the weights, so relevant modalities speak louder and irrelevant ones whisper.
- Pick the next token from this weighted, contrast-enhanced mix.
- Why it matters: Without MAD, strong visuals can force fake sounds, or strong audio can invent visuals. MAD puts the right sense in charge for each question. đ Anchor: The video shows a hammer but there is no sound. With MAD, when asked âDid you hear hammer hits?â, the model prioritizes audio, answers âNo,â and avoids inventing a sound.
The âAha!â Moment (one sentence): If the model first asks itself which senses matter for this question and then decodes by weighting each sense accordingly, cross-modal hallucinations drop sharply without any retraining.
Three Analogies:
- Volume Mixer: Like adjusting the sliders for music and vocals depending on the song partâturn up vocals for lyrics, turn up instruments for solos.
- Sports Replay: Use slow-motion video for a disputed catch (video weight high), use crowd sound for momentum shifts (audio weight high), or use both for timing plays (both weight high).
- Detective Work: Trust the security camera for what happened (video), the witnessâs voice for what was said (audio), or both to match lips and words.
Before vs After:
- Before: One-size-fits-all decoding; strong modalities could bully weaker ones, causing made-up details.
- After: Question-aware decoding; the right modality (or both) leads, and irrelevant influence is suppressed.
Why It Works (intuition): Contrastive decoding highlights what is grounded in the input. MAD adds the missing piece: deciding which sense should carry that grounding for the current question. By weighting branches based on the modelâs own self-assessed needs, MAD reduces wrong cross-signal âbleed-through.â
Building Blocks:
- A modality query prompt to get weights (self-assessment).
- A four-branch contrastive setup (clean, audio-perturbed, video-perturbed, both-perturbed).
- A simple formula that multiplies each branch by its task-specific weight, then fuses them for token selection.
03Methodology
High-Level Overview: Input (Video, Audio, Question) â Step A: Self-Assess Needed Modalities (weights for audio, video, both) â Step B: Build Four Contrastive Branches (clean AV, video-perturbed, audio-perturbed, both-perturbed) â Step C: Weight-and-Fuse Branches (using the self-assessed weights) â Output (hallucination-reduced answer)
đ„Ź Concept 8 â Modality Query Prompt
- What it is: A tiny extra instruction asking the model which modalities are needed.
- How it works:
- Add a short question like: âTo answer this, which modality is needed: audio, video, or both?â
- The model predicts token scores for âaudioâ, âvideoâ, âbothâ.
- Turn these scores into probabilities (weights).
- Why it matters: This gives a task-specific compass for decoding. đ Anchor: Question: âWhat instrument is playing?â Prompt says: âPick modality.â Model picks âaudioâ with highest weight.
Step A: Self-Assessment and Weight Extraction
- What happens: We run the model once with the modality query to get three numbers: w_audio, w_video, w_both (they sum to 1). These reflect how important each sense is for this specific question.
- Why it exists: Without it, decoding treats all senses the same, inviting cross-modal interference.
- Example: For âIs the folded paper white?â â w_video â 0.8, w_audio â 0.1, w_both â 0.1. For âCan you hear seagulls?â â w_audio â 0.6.
đ„Ź Concept 9 â Perturbation (Degraded Input)
- What it is: A harmless âmess-upâ applied to a modality (like blurring video or muting audio) to test whether a token truly depends on that modality.
- How it works:
- Create versions where video is perturbed (e.g., masked/blurred) and/or audio is perturbed (e.g., reduced or masked).
- Run the model on the clean input and each perturbed version.
- Compare their token scores to spot evidence-dependent tokens.
- Why it matters: If a tokenâs score barely changes when a modality is perturbed, it likely wasnât grounded in that modality. đ Anchor: If the token âengine revvingâ stays high when audio is muted, it may be a visual guessâcontrastive decoding will down-weight it.
đ„Ź Concept 10 â Logits (Pre-Softmax Scores)
- What it is: Logits are the raw scores the model assigns to each possible next word before turning them into probabilities.
- How it works:
- For each branch (clean and perturbed), the model outputs a vector of logits for the next token.
- We combine these logits using weights to build a final, grounded score.
- The top-scoring token is chosen as the next word.
- Why it matters: Working at the logit level lets us nudge the model toward evidence-backed tokens. đ Anchor: If âNoâ has a higher final logit than âYes,â the model answers âNo.â
Step B: Build Four Contrastive Branches
- What happens: For each decoding step, compute logits for:
- Clean audio + clean video (the full, normal input).
- Video-perturbed + clean audio (tests reliance on vision).
- Audio-perturbed + clean video (tests reliance on audio).
- Both-perturbed (tests reliance on either when none is reliable).
- Why it exists: Each branch reveals how much a token depends on a particular modality. This exposes ungrounded guesses.
- Example with data: Question: âDid you hear a hammer hitting?â
- Clean AV suggests âYesâ (tempted by seeing a hammer).
- Audio-perturbed suggests âYesâ drops a lot if true sound matters.
- If the clean-vs-audio-perturbed gap is small, that âYesâ might be a visual-driven guess. The method will suppress it.
Step C: Weight-and-Fuse Branches (MAD)
- What happens:
- Use w_audio, w_video, w_both as dials to scale the contrastive power of each branch.
- Heavier weight â stronger penalty on ungrounded tokens in that modality.
- Fuse the adjusted logits to get the final score for the next token.
- Why it exists: Different questions need different senses. The weights make decoding question-aware.
- Example: For âWhat instrument is playing?â, w_audio is high. The audio-perturbed branch gets more influence, so tokens not supported by audio get pushed down.
Putting It All Together (like a recipe):
- Take inputs: video frames, audio waveform, and the question.
- Ask the modality query to get three weights.
- For each next word:
- Compute logits on clean AV.
- Compute logits with video perturbed.
- Compute logits with audio perturbed.
- Compute logits with both perturbed.
- Weight and fuse them using the three weights and a shared âstrengthâ knob (Îł).
- Pick the token with the highest fused score.
- Repeat until the answer is complete.
The Secret Sauce:
- Not just contrasting, but adapting contrast strength to the task.
- Using a âbothâ weight (w_both) to handle questions that truly need joint reasoning, so we donât throw away helpful cross-signal cues.
- Training-free: no new data, no model updatesâjust smarter inference.
Why Step A, B, C are all needed:
- Without Step A (weights), we canât be task-aware.
- Without Step B (four branches), we canât tell whatâs grounded in each modality.
- Without Step C (weighted fusion), we canât suppress the wrong influence while preserving the right one.
Concrete Walkthrough:
- Input: Video shows a hammer but the clip is silent. Question: âDid you hear hammer hits?â
- Step A: Weights come out as w_audio high, w_video low, w_both moderate.
- Step B: Clean AV says âYes?â (vision tempts the model). Audio-perturbed shows little change (no true audio evidence). Video-perturbed is irrelevant because audio matters most.
- Step C: Weighted fusion dampens the visually-driven âYes,â boosting the correct âNo.â The model answers: âNo.â
04Experiments & Results
The Test: The authors used two specialized benchmarks that try to trick models into mixing senses:
- CMM (Curse of Multi-Modalities): Checks whether the model is overly dominated by one sense (visual, audio, or language priors) and measures overall accuracy across categories.
- AVHBench: Directly tests video-driven audio hallucinations and audio-driven video hallucinations. They also checked regular audio-visual QA sets (OmniBench, Worldsense, MUSIC-AVQA) to see if MAD helps beyond hallucination-specific tests.
The Competition (Baselines):
- Base decoding: The modelâs normal way of generating.
- VCD-Extended: Applies visual-style contrastive decoding to all modalities uniformly.
- AVCD: Contrastive decoding for audio+video but with uniform, non-adaptive settings.
The Scoreboard (with context):
- On CMM:
- VideoLLaMA2-AV with MAD: Overall accuracy 81.3% (vs. 73.5% base). Thatâs like raising a solid C to a strong B+/A-.
- Big category boosts: Visual dominance +9.3%, Language dominance +5.5% for VideoLLaMA2-AV.
- Qwen2.5-Omni with MAD: Overall accuracy 81.4% (vs. 72.7% base). Even larger category jumps: visual dominance +12.3%, audio dominance +12.0%.
- On AVHBench:
- Video-driven audio hallucination improved by about +4.0% for VideoLLaMA2-AV and +5.7% for Qwen2.5-Omni.
- Audio-driven video hallucination also improved (e.g., +3.7% for Qwen2.5-Omni).
- Across both datasets and both models, MAD beat VCD-Extended and AVCD, showing that adaptive, question-aware weighting outperforms uniform approaches.
Surprising/Useful Findings:
- Adaptive beats uniform: An ablation compared three fusions on CMM with VideoLLaMA2-AV:
- Uniform weights (treat all senses equally): Overall ~79.4%.
- Argmax (pick only the single best sense): ~78.7%.
- Weighted (MAD): ~81.3% (best). Softly mixing all branches with the right weights balances evidence without throwing away helpful signals.
- Each weight matters: Removing any one of w_audio, w_video, or w_both hurt performance. Using all three gave the best overall accuracy. Especially, w_both helps when joint reasoning is needed.
- Weights match intuition: On a custom study (100 videos, 300 questions), visual questions got high video weights, audio questions got high audio weights, and audio-visual questions got high âboth.â This shows the self-assessment step is sensible.
- Prompt robustness: Changing the wording of the modality query barely changed results (tiny standard deviations), so MAD isnât fragile to prompt phrasing.
- General AVQA: On OmniBench, Worldsense, and MUSIC-AVQA, MAD was comparable or slightly better than base (e.g., +1.0% on MUSIC-AVQA), hinting at better grounding overall.
- Efficiency: MADâs decoding speed was similar to other contrastive methods; only a small overhead for the extra query and branches.
Contextualizing the Numbers:
- Think of accuracy like grades: many baselines hover in the 70s (C range). MAD lifts them into the low 80s (B to B+), while cutting down the noisiest mistakes where one sense bullies the other.
- In the dominance categories, jumps of 9â12 percentage points are bigâit means the model is much less likely to be tricked by tempting-but-irrelevant signals.
Takeaway: Asking the model which senses matter first, then steering decoding with those weights, consistently outperforms one-size-fits-all contrastive decoding across models and datasets.
05Discussion & Limitations
Limitations:
- Reliance on self-assessment: If the model answers the modality query poorly, the weights may be off, and decoding could under- or over-suppress a modality.
- Audio+video only: The paper focuses on two modalities; adding more (e.g., depth, thermal, sensors) needs careful extension and more branches.
- Quality of perturbations: If the perturbation doesnât effectively disrupt a modality, contrastive signals get weaker and less informative.
- Extra compute vs. plain decoding: While close to other contrastive methods, MAD is more expensive than vanilla decoding because it runs multiple branches.
Required Resources:
- An audio-visual LLM that can accept a short modality query.
- The ability to run multiple forward passes per token (clean and perturbed branches).
- Modality perturbation tools (e.g., image masking/blurring, audio masking).
- A small hyperparameter search for the contrastive strength Îł (the paper found 2.5 works well generally).
When NOT to Use:
- If the question is clearly single-modality and the base model already performs near-perfectly, MAD may add overhead for little gain.
- In ultra-low-latency settings where any extra decoding cost is unacceptable.
- If the model is very weak at following prompts; poor self-assessment can reduce MADâs benefit.
Open Questions:
- Can we learn a tiny, fast weight-predictor to avoid the self-assessment prompt and speed things up?
- How does MAD generalize to more than two modalities (e.g., text+image+audio+video+sensor data) without exploding compute?
- Can better, principled perturbations boost the contrast signal further?
- Is there a way to calibrate or sanity-check the weights mid-generation and adjust on the fly?
- How do we best detect when the model is overconfident in the wrong modality and auto-correct weights safely?
06Conclusion & Future Work
3-Sentence Summary: Cross-modal hallucinations happen when one sense (like vision) wrongly influences what the model says about another (like audio). MAD fixes this by first asking the model which senses matter for the current question and then decoding with contrastive branches weighted by that self-assessment. This training-free, question-aware approach reduces hallucinations and improves accuracy across strong audio-visual LLMs and benchmarks.
Main Achievement: Showing that explicit, self-assessed modality weightsâplugged into a multi-branch contrastive decoderâare the missing piece to robust, per-question control of modality influence.
Future Directions: Build a lightweight learned module to predict modality weights without a prompt; extend MAD to more modalities (e.g., thermal+RGB, sensors); design smarter perturbations; and adapt weights dynamically during generation.
Why Remember This: It turns a simple ideaââask which senses matter, then listen to themââinto a practical, training-free tool that makes multimodal models more trustworthy. In a world full of mixed signals (sounds, sights, texts), MAD teaches models to choose the right sense at the right time.
Practical Applications
- âąVideo accessibility tools that give accurate audio and visual descriptions without inventing details.
- âąCustomer support agents that analyze product demo videos and avoid guessing unheard sounds or unseen visuals.
- âąClassroom tutoring systems that explain lab videos while correctly emphasizing the right modality for each question.
- âąContent moderation that checks claims about a clipâs audio or visuals without cross-sense contamination.
- âąSports highlight analysis that trusts visuals for plays and audio for crowd or whistle cues appropriately.
- âąSurveillance review assistants that avoid inventing alarms or hazards when only one modality supports them.
- âąMedical training videos where the model distinguishes visual findings from auscultation-like audio cues.
- âąRobotics logs analysis that separates what was seen from what was heard to avoid wrong inferences.
- âąNews verification tools that check whether a claimed sound or sight truly appears in a clip.
- âąCreative editing assistants that describe footage faithfully, preventing invented background sounds or actions.