MAD: Modality-Adaptive Decoding for Mitigating Cross-Modal Hallucinations in Multimodal Large Language Models

Sangyun Chung; Se Yeon Kim; Youngchae Chee; Yong Man Ro

MAD: Modality-Adaptive Decoding for Mitigating Cross-Modal Hallucinations in Multimodal Large Language Models

Intermediate

Sangyun Chung, Se Yeon Kim, Youngchae Chee et al.1/29/2026

arXiv PDF

Key Summary

•Multimodal AI models can mix up what they see and what they hear, making things up across senses; this is called cross-modal hallucination.
•The paper introduces MAD (Modality-Adaptive Decoding), a training-free way to ask the model which senses matter for a question and to use that answer during decoding.
•MAD turns the model’s self-assessed modality choice into weights that guide how strongly to use or suppress vision and audio while generating each token.
•It extends contrastive decoding by computing four branches (both clean, vision-perturbed, audio-perturbed, both-perturbed) and fusing them with adaptive weights.
•On two tough benchmarks (CMM and AVHBench), MAD reduces cross-modal hallucinations and improves overall accuracy without retraining the model.
•Gains are strong across models (e.g., VideoLLaMA2-AV and Qwen2.5-Omni), including up to +12% in dominance categories and around 81% overall accuracy.
•MAD is robust to prompt wording, needs only one extra modality query step, and keeps decoding speed close to other contrastive methods.
•Ablations show that adaptive weighting beats uniform or winner-take-all strategies, and that using three weights (audio, video, both) works best.
•MAD slightly boosts general audio-visual QA too, hinting that it improves grounding even beyond hallucination-focused tests.

Why This Research Matters

Multimodal AI is moving into real-life settings where mixing up senses can lead to serious mistakes. MAD makes models ask, “Which senses matter here?” and then follow that guidance, so answers are grounded in the right evidence. This reduces made-up sounds or visuals, improves trust, and supports safer use in education, accessibility, and safety monitoring. Because MAD is training-free, it can upgrade existing systems without costly retraining. Its robustness to prompt wording and modest compute overhead make it practical. By improving general QA too, MAD suggests better grounding even beyond hallucination tests.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine you’re watching a cartoon with the volume off. You can guess what’s happening by looking, but you might still be wrong about the sounds. Now imagine turning on a podcast with your eyes closed. You can hear the story, but you can’t see who is speaking. Using both together is powerful—but only if you know when to trust which one.

🥬 Concept 1 — Multimodal Input

What it is: A multimodal input is when an AI uses more than one sense at the same time, like video (seeing), audio (hearing), and text (reading a question).
How it works:
1. The video gets turned into visual tokens (like tiny picture pieces).
2. The audio gets turned into audio tokens (like little chunks of sound info).
3. The text question becomes text tokens (words turned into numbers).
4. The model mixes these tokens to answer the question.
Why it matters: Without multiple senses, the model misses important clues (like hearing a siren it can’t see). 🍞 Anchor: If you ask, “What is the person saying?” the audio is key. If you ask, “What color is the car?” the video is key.

🥬 Concept 2 — Hallucinations

What it is: A hallucination is when an AI says something that isn’t supported by the input, like describing a cat that isn’t there.
How it works:
1. The model guesses using patterns it learned.
2. If it leans too much on guesses (priors) instead of evidence, it makes things up.
3. Weak grounding (not checking the input) leads to wrong tokens in the answer.
Why it matters: Hallucinations make answers unreliable. 🍞 Anchor: If asked “What’s on the table?” and the model always says “a cup” even when the table is empty, that’s a hallucination.

🥬 Concept 3 — Cross-Modal Hallucinations

What it is: Cross-modal hallucination is when one sense (like vision) wrongly influences what the model says about another sense (like audio), or vice versa.
How it works:
1. The model sees something (like a hammer) and assumes a matching sound (“hammer hitting”).
2. Or it hears cheering and assumes visuals like “crowd waving,” even when not shown.
3. The problem is poor control over how senses influence each other.
Why it matters: It causes made-up details that sound plausible but aren’t in the input. 🍞 Anchor: The video shows a silent concert crowd, but the model says, “You can hear loud music.” That’s vision causing an audio mistake.

The World Before: Multimodal Large Language Models (MLLMs) became good at understanding videos with sound and answering questions about them. But they often mixed signals, especially when one modality was strong and tempting (like vivid visuals). Prior work reduced single-modality hallucinations (mostly visual), often by comparing a normal input with a corrupted one and down-weighting language-only guesses.

The Problem: In real life, tasks vary: some require only hearing, some only seeing, and some both. However, many decoding fixes treated all inputs the same way, using uniform settings regardless of the question. That meant the model couldn’t decide, “For this question, trust audio more,” or “Ignore the sound here; it’s irrelevant.”

Failed Attempts:

Visual Contrastive Decoding (VCD): helped for images/videos by subtracting a vision-corrupted branch, but assumed vision was always the main problem.
AVCD: extended contrastive ideas to audio+video but still used uniform, non-adaptive distortions, not tuned to the specific question.
Other decoding tweaks: helpful for general hallucinations, but not designed to control how modalities influence each other per task.

The Gap: Models lacked modality-appropriateness judgment—the skill to decide which senses matter for each question and by how much.

Real Stakes: Think of assistive tech describing videos to someone with low vision—making up sounds or visuals can mislead. In safety videos, wrongly hearing a “gunshot” or seeing a “fire” that isn’t there could cause panic. In classrooms, a tutoring system that guesses from text patterns instead of real video/audio might teach the wrong thing. We need models that can honestly say, “I should trust audio more here,” or “Use both fairly,” to be accurate and safe.

02Core Idea

🍞 Hook: You know how a chef decides which ingredient should shine—chocolate in brownies, tomatoes in pasta—so the dish tastes right? The chef doesn’t treat all ingredients the same for every recipe.

🥬 Concept 4 — Contrastive Decoding

What it is: Contrastive decoding compares two versions of the model’s view—one with full, clean input and one with a deliberately disturbed version—to highlight what’s truly supported by the input.
How it works:
1. Run the model on the clean input.
2. Run it again with a modality perturbed (e.g., blur video or mute audio).
3. Compare token scores; down-weight tokens that don’t change much (likely ungrounded guesses).
4. Generate using the “clean minus disturbed” signal to reduce made-up parts.
Why it matters: It suppresses tokens that are not tied to the evidence, lowering hallucinations. 🍞 Anchor: If the answer still says “barking dog” after muting audio, it’s likely a guess; contrastive decoding will push that guess down.

🥬 Concept 5 — Self-Assessment Mechanism

What it is: The model briefly asks itself, “Which modality do I need—audio, video, or both—to answer this question?”
How it works:
1. Add a tiny query prompt like: “Which modality is needed (audio, video, both)?”
2. The model returns scores for audio, video, and both.
3. Convert them into probabilities (weights) that sum to 1.
Why it matters: Without self-assessment, the model treats senses uniformly, causing cross-modal mix-ups. 🍞 Anchor: For “What sound is heard?”, the model gives audio a high weight; for “What color is the car?”, video gets the high weight; for “Do the hands move with the beat?”, “both” gets the highest.

🥬 Concept 6 — Task-Specific Modality Weights

What it is: These are the importance numbers (weights) for audio, video, and both, tailored to the current question.
How it works:
1. Take the self-assessed scores for audio, video, both.
2. Softmax them into probabilities between 0 and 1.
3. Use these as knobs to turn each contrastive branch up or down.
Why it matters: If the question is audio-only, we shouldn’t punish audio; we should suppress irrelevant visual influence instead. 🍞 Anchor: If w_audio = 0.7, w_video = 0.1, w_both = 0.2, decoding will lean on audio evidence.

🥬 Concept 7 — Modality-Adaptive Decoding (MAD)

What it is: MAD is a training-free way to link those task-specific weights to contrastive decoding, so the model emphasizes the right senses and quiets the wrong ones while generating every token.
How it works:
1. Ask the modality question (self-assessment) to get weights for audio, video, and both.
2. Compute four contrastive branches: clean AV, video-perturbed, audio-perturbed, both-perturbed.
3. Fuse the branches using the weights, so relevant modalities speak louder and irrelevant ones whisper.
4. Pick the next token from this weighted, contrast-enhanced mix.
Why it matters: Without MAD, strong visuals can force fake sounds, or strong audio can invent visuals. MAD puts the right sense in charge for each question. 🍞 Anchor: The video shows a hammer but there is no sound. With MAD, when asked “Did you hear hammer hits?”, the model prioritizes audio, answers “No,” and avoids inventing a sound.

The “Aha!” Moment (one sentence): If the model first asks itself which senses matter for this question and then decodes by weighting each sense accordingly, cross-modal hallucinations drop sharply without any retraining.

Three Analogies:

Volume Mixer: Like adjusting the sliders for music and vocals depending on the song part—turn up vocals for lyrics, turn up instruments for solos.
Sports Replay: Use slow-motion video for a disputed catch (video weight high), use crowd sound for momentum shifts (audio weight high), or use both for timing plays (both weight high).
Detective Work: Trust the security camera for what happened (video), the witness’s voice for what was said (audio), or both to match lips and words.

Before vs After:

Before: One-size-fits-all decoding; strong modalities could bully weaker ones, causing made-up details.
After: Question-aware decoding; the right modality (or both) leads, and irrelevant influence is suppressed.

Why It Works (intuition): Contrastive decoding highlights what is grounded in the input. MAD adds the missing piece: deciding which sense should carry that grounding for the current question. By weighting branches based on the model’s own self-assessed needs, MAD reduces wrong cross-signal “bleed-through.”

Building Blocks:

A modality query prompt to get weights (self-assessment).
A four-branch contrastive setup (clean, audio-perturbed, video-perturbed, both-perturbed).
A simple formula that multiplies each branch by its task-specific weight, then fuses them for token selection.

03Methodology

High-Level Overview: Input (Video, Audio, Question) → Step A: Self-Assess Needed Modalities (weights for audio, video, both) → Step B: Build Four Contrastive Branches (clean AV, video-perturbed, audio-perturbed, both-perturbed) → Step C: Weight-and-Fuse Branches (using the self-assessed weights) → Output (hallucination-reduced answer)

🥬 Concept 8 — Modality Query Prompt

What it is: A tiny extra instruction asking the model which modalities are needed.
How it works:
1. Add a short question like: “To answer this, which modality is needed: audio, video, or both?”
2. The model predicts token scores for ‘audio’, ‘video’, ‘both’.
3. Turn these scores into probabilities (weights).
Why it matters: This gives a task-specific compass for decoding. 🍞 Anchor: Question: “What instrument is playing?” Prompt says: “Pick modality.” Model picks ‘audio’ with highest weight.

Step A: Self-Assessment and Weight Extraction

What happens: We run the model once with the modality query to get three numbers: w_audio, w_video, w_both (they sum to 1). These reflect how important each sense is for this specific question.
Why it exists: Without it, decoding treats all senses the same, inviting cross-modal interference.
Example: For “Is the folded paper white?” → w_video ≈ 0.8, w_audio ≈ 0.1, w_both ≈ 0.1. For “Can you hear seagulls?” → w_audio ≈ 0.6.

🥬 Concept 9 — Perturbation (Degraded Input)

What it is: A harmless “mess-up” applied to a modality (like blurring video or muting audio) to test whether a token truly depends on that modality.
How it works:
1. Create versions where video is perturbed (e.g., masked/blurred) and/or audio is perturbed (e.g., reduced or masked).
2. Run the model on the clean input and each perturbed version.
3. Compare their token scores to spot evidence-dependent tokens.
Why it matters: If a token’s score barely changes when a modality is perturbed, it likely wasn’t grounded in that modality. 🍞 Anchor: If the token “engine revving” stays high when audio is muted, it may be a visual guess—contrastive decoding will down-weight it.

🥬 Concept 10 — Logits (Pre-Softmax Scores)

What it is: Logits are the raw scores the model assigns to each possible next word before turning them into probabilities.
How it works:
1. For each branch (clean and perturbed), the model outputs a vector of logits for the next token.
2. We combine these logits using weights to build a final, grounded score.
3. The top-scoring token is chosen as the next word.
Why it matters: Working at the logit level lets us nudge the model toward evidence-backed tokens. 🍞 Anchor: If “No” has a higher final logit than “Yes,” the model answers “No.”

Step B: Build Four Contrastive Branches

What happens: For each decoding step, compute logits for:
1. Clean audio + clean video (the full, normal input).
2. Video-perturbed + clean audio (tests reliance on vision).
3. Audio-perturbed + clean video (tests reliance on audio).
4. Both-perturbed (tests reliance on either when none is reliable).
Why it exists: Each branch reveals how much a token depends on a particular modality. This exposes ungrounded guesses.
Example with data: Question: “Did you hear a hammer hitting?”
- Clean AV suggests “Yes” (tempted by seeing a hammer).
- Audio-perturbed suggests “Yes” drops a lot if true sound matters.
- If the clean-vs-audio-perturbed gap is small, that “Yes” might be a visual-driven guess. The method will suppress it.

Step C: Weight-and-Fuse Branches (MAD)

What happens:
1. Use w_audio, w_video, w_both as dials to scale the contrastive power of each branch.
2. Heavier weight → stronger penalty on ungrounded tokens in that modality.
3. Fuse the adjusted logits to get the final score for the next token.
Why it exists: Different questions need different senses. The weights make decoding question-aware.
Example: For “What instrument is playing?”, w_audio is high. The audio-perturbed branch gets more influence, so tokens not supported by audio get pushed down.

Putting It All Together (like a recipe):

Take inputs: video frames, audio waveform, and the question.
Ask the modality query to get three weights.
For each next word:
- Compute logits on clean AV.
- Compute logits with video perturbed.
- Compute logits with audio perturbed.
- Compute logits with both perturbed.
- Weight and fuse them using the three weights and a shared “strength” knob (γ).
- Pick the token with the highest fused score.
Repeat until the answer is complete.

The Secret Sauce:

Not just contrasting, but adapting contrast strength to the task.
Using a “both” weight (w_both) to handle questions that truly need joint reasoning, so we don’t throw away helpful cross-signal cues.
Training-free: no new data, no model updates—just smarter inference.

Why Step A, B, C are all needed:

Without Step A (weights), we can’t be task-aware.
Without Step B (four branches), we can’t tell what’s grounded in each modality.
Without Step C (weighted fusion), we can’t suppress the wrong influence while preserving the right one.

Concrete Walkthrough:

Input: Video shows a hammer but the clip is silent. Question: “Did you hear hammer hits?”
Step A: Weights come out as w_audio high, w_video low, w_both moderate.
Step B: Clean AV says “Yes?” (vision tempts the model). Audio-perturbed shows little change (no true audio evidence). Video-perturbed is irrelevant because audio matters most.
Step C: Weighted fusion dampens the visually-driven “Yes,” boosting the correct “No.” The model answers: “No.”

04Experiments & Results

The Test: The authors used two specialized benchmarks that try to trick models into mixing senses:

CMM (Curse of Multi-Modalities): Checks whether the model is overly dominated by one sense (visual, audio, or language priors) and measures overall accuracy across categories.
AVHBench: Directly tests video-driven audio hallucinations and audio-driven video hallucinations. They also checked regular audio-visual QA sets (OmniBench, Worldsense, MUSIC-AVQA) to see if MAD helps beyond hallucination-specific tests.

The Competition (Baselines):

Base decoding: The model’s normal way of generating.
VCD-Extended: Applies visual-style contrastive decoding to all modalities uniformly.
AVCD: Contrastive decoding for audio+video but with uniform, non-adaptive settings.

The Scoreboard (with context):

On CMM:
- VideoLLaMA2-AV with MAD: Overall accuracy 81.3% (vs. 73.5% base). That’s like raising a solid C to a strong B+/A-.
- Big category boosts: Visual dominance +9.3%, Language dominance +5.5% for VideoLLaMA2-AV.
- Qwen2.5-Omni with MAD: Overall accuracy 81.4% (vs. 72.7% base). Even larger category jumps: visual dominance +12.3%, audio dominance +12.0%.
On AVHBench:
- Video-driven audio hallucination improved by about +4.0% for VideoLLaMA2-AV and +5.7% for Qwen2.5-Omni.
- Audio-driven video hallucination also improved (e.g., +3.7% for Qwen2.5-Omni).
Across both datasets and both models, MAD beat VCD-Extended and AVCD, showing that adaptive, question-aware weighting outperforms uniform approaches.

Surprising/Useful Findings:

Adaptive beats uniform: An ablation compared three fusions on CMM with VideoLLaMA2-AV:
- Uniform weights (treat all senses equally): Overall ~79.4%.
- Argmax (pick only the single best sense): ~78.7%.
- Weighted (MAD): ~81.3% (best). Softly mixing all branches with the right weights balances evidence without throwing away helpful signals.
Each weight matters: Removing any one of w_audio, w_video, or w_both hurt performance. Using all three gave the best overall accuracy. Especially, w_both helps when joint reasoning is needed.
Weights match intuition: On a custom study (100 videos, 300 questions), visual questions got high video weights, audio questions got high audio weights, and audio-visual questions got high “both.” This shows the self-assessment step is sensible.
Prompt robustness: Changing the wording of the modality query barely changed results (tiny standard deviations), so MAD isn’t fragile to prompt phrasing.
General AVQA: On OmniBench, Worldsense, and MUSIC-AVQA, MAD was comparable or slightly better than base (e.g., +1.0% on MUSIC-AVQA), hinting at better grounding overall.
Efficiency: MAD’s decoding speed was similar to other contrastive methods; only a small overhead for the extra query and branches.

Contextualizing the Numbers:

Think of accuracy like grades: many baselines hover in the 70s (C range). MAD lifts them into the low 80s (B to B+), while cutting down the noisiest mistakes where one sense bullies the other.
In the dominance categories, jumps of 9–12 percentage points are big—it means the model is much less likely to be tricked by tempting-but-irrelevant signals.

Takeaway: Asking the model which senses matter first, then steering decoding with those weights, consistently outperforms one-size-fits-all contrastive decoding across models and datasets.

05Discussion & Limitations

Limitations:

Reliance on self-assessment: If the model answers the modality query poorly, the weights may be off, and decoding could under- or over-suppress a modality.
Audio+video only: The paper focuses on two modalities; adding more (e.g., depth, thermal, sensors) needs careful extension and more branches.
Quality of perturbations: If the perturbation doesn’t effectively disrupt a modality, contrastive signals get weaker and less informative.
Extra compute vs. plain decoding: While close to other contrastive methods, MAD is more expensive than vanilla decoding because it runs multiple branches.

Required Resources:

An audio-visual LLM that can accept a short modality query.
The ability to run multiple forward passes per token (clean and perturbed branches).
Modality perturbation tools (e.g., image masking/blurring, audio masking).
A small hyperparameter search for the contrastive strength γ (the paper found 2.5 works well generally).

When NOT to Use:

If the question is clearly single-modality and the base model already performs near-perfectly, MAD may add overhead for little gain.
In ultra-low-latency settings where any extra decoding cost is unacceptable.
If the model is very weak at following prompts; poor self-assessment can reduce MAD’s benefit.

Open Questions:

Can we learn a tiny, fast weight-predictor to avoid the self-assessment prompt and speed things up?
How does MAD generalize to more than two modalities (e.g., text+image+audio+video+sensor data) without exploding compute?
Can better, principled perturbations boost the contrast signal further?
Is there a way to calibrate or sanity-check the weights mid-generation and adjust on the fly?
How do we best detect when the model is overconfident in the wrong modality and auto-correct weights safely?

06Conclusion & Future Work

3-Sentence Summary: Cross-modal hallucinations happen when one sense (like vision) wrongly influences what the model says about another (like audio). MAD fixes this by first asking the model which senses matter for the current question and then decoding with contrastive branches weighted by that self-assessment. This training-free, question-aware approach reduces hallucinations and improves accuracy across strong audio-visual LLMs and benchmarks.

Main Achievement: Showing that explicit, self-assessed modality weights—plugged into a multi-branch contrastive decoder—are the missing piece to robust, per-question control of modality influence.

Future Directions: Build a lightweight learned module to predict modality weights without a prompt; extend MAD to more modalities (e.g., thermal+RGB, sensors); design smarter perturbations; and adapt weights dynamically during generation.

Why Remember This: It turns a simple idea—“ask which senses matter, then listen to them”—into a practical, training-free tool that makes multimodal models more trustworthy. In a world full of mixed signals (sounds, sights, texts), MAD teaches models to choose the right sense at the right time.

Practical Applications

•Video accessibility tools that give accurate audio and visual descriptions without inventing details.
•Customer support agents that analyze product demo videos and avoid guessing unheard sounds or unseen visuals.
•Classroom tutoring systems that explain lab videos while correctly emphasizing the right modality for each question.
•Content moderation that checks claims about a clip’s audio or visuals without cross-sense contamination.
•Sports highlight analysis that trusts visuals for plays and audio for crowd or whistle cues appropriately.
•Surveillance review assistants that avoid inventing alarms or hazards when only one modality supports them.
•Medical training videos where the model distinguishes visual findings from auscultation-like audio cues.
•Robotics logs analysis that separates what was seen from what was heard to avoid wrong inferences.
•News verification tools that check whether a claimed sound or sight truly appears in a clip.
•Creative editing assistants that describe footage faithfully, preventing invented background sounds or actions.

Version: 1