Taming Hallucinations: Boosting MLLMs' Video Understanding via Counterfactual Video Generation

Zhe Huang; Hao Wen; Aiming Hao; Bingze Song; Meiqi Wu; Jiahong Wu; Xiangxiang Chu; Sheng Lu; Haoqian Wang

Taming Hallucinations: Boosting MLLMs' Video Understanding via Counterfactual Video Generation

Intermediate

Zhe Huang, Hao Wen, Aiming Hao et al.12/30/2025

arXiv PDF

Key Summary

•Multimodal Large Language Models (MLLMs) often hallucinate on videos by trusting words and common sense more than what the frames really show.
•This paper builds counterfactual videos (things that look real but break expectations) using controllable, diffusion-based editing to force models to look carefully.
•The DualityForge pipeline edits real videos into counterfactual versions and embeds structured context so it can auto-generate high-quality captions and question–answer pairs.
•The dataset, DualityVidQA, contains 144K training QA pairs and a 600-pair human-checked test with real-versus-edited video pairs that share the same question but require different answers.
•Training uses DNA-Train: first Supervised Fine-Tuning on both real and edited videos, then Reinforcement Learning with a special duality-normalized advantage to balance learning from both sides.
•The model must answer the same question differently for the original and the edited video, which forces visual grounding instead of guessing from language priors.
•On the DualityVidQA-Test benchmark, the method achieves 76.8% and delivers a 24.0% relative improvement over a strong 7B baseline.
•It also boosts scores on general video understanding benchmarks (like TempCompass and MVBench), showing the gains generalize beyond counterfactuals.
•Ablations show paired data and the duality-normalized advantage are key to stability and accuracy.
•The dataset and code will be open-sourced, enabling wider testing and safer video AI.

Why This Research Matters

Many real-world tools—from sports review systems to home safety cameras—need models that look at what actually happens, not what usually happens. This work shows how to build those habits by pairing each normal video with a tricky twin and forcing the model to answer differently when the frames change. Because the approach also lifts performance on regular benchmarks, it improves day-to-day reliability, not just corner cases. The dataset and code will be open, so others can strengthen and test video models more fairly. In the long run, fewer hallucinations mean safer assistants for robots, vehicles, and monitoring systems. It also offers a recipe for other modalities: generate counterfactuals, ask shared questions, and balance learning so the model truly grounds its answers.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine watching a magic show. Your brain expects the coin to fall when dropped, but the magician makes it float. If you only rely on your expectation, you’ll miss the trick happening in front of you.

🥬 The Concept: Multimodal Large Language Models (MLLMs) are AIs that read, watch, and listen, then answer questions or follow instructions. They often rely on language priors—what usually happens in the world and what words commonly go together—more than on the actual pixels in the video. How it works (before this paper):

MLLMs are trained on huge text corpora and much smaller, noisier video data.
When asked about a video, they combine visual cues with learned common sense.
If the video shows something unusual, the model may still answer with the most plausible story from language, not the real visuals. Why it matters: Without stronger visual grounding, models can hallucinate—confidently stating things that do not match the frames—especially when videos break common expectations.

🍞 Anchor: Think of a video of corn pouring into a trailer: a model might answer “arches up and floats into the sky” because it sounds like a dramatic description, even if the frames show it falling into the trailer.

🍞 Hook: You know how your teacher sometimes gives trick questions to check you’re really reading, not just guessing? Video tasks can be like that.

🥬 The Concept: Visual ungrounded hallucinations are answers that sound reasonable but don’t match the actual video. How it works:

Model hears a familiar story in the question (like corn pouring) and fills in typical details.
If the video contradicts those details, the model still picks the usual answer.
The mismatch grows when videos show counterfactual events (things that defy common sense or usual physics). Why it matters: In real life—like safety cameras, sports analysis, or robotics—trusting the scene beats trusting a guess.

🍞 Anchor: If a rock suddenly floats upward in a video, a model that trusts common sense might deny it. A grounded model will say “the rock floats,” because that’s what the frames show.

🍞 Hook: Imagine training for a spelling bee with only stories about words, not the words themselves. You’d get lots of context, but you’d still miss real spellings.

🥬 The Concept: Language priors are the model’s built-in habits about what words and events usually go together. How it works:

Massive text training teaches typical patterns (like corn pours down, not up).
When video is scarce or noisy, those text habits dominate decisions.
The model becomes overconfident in typical answers. Why it matters: If the video is unusual, these habits cause errors that sound polished but are wrong.

🍞 Anchor: Asked “What happens to the lanterns?”, many models answer “They float up” even if the video shows them staying hung on boats.

🍞 Hook: Think of a science experiment where you flip a variable to see what really causes a change.

🥬 The Concept: Counterfactual videos are edited clips that keep most of the scene the same but flip one key detail to create an unusual event (like erasing an object mid-clip). How it works:

Start with a real video.
Edit it to break an expectation (object disappears, motion reverses, light behaves oddly).
Ask the same question on both original and edited videos. Why it matters: If the model answers both the same, it wasn’t looking. If it answers differently and correctly, it used real visual evidence.

🍞 Anchor: Original: girl stands on rocks (rocks unchanged). Edited: rocks suddenly grow bigger. Same question, different correct answers.

🍞 Hook: When older fixes tried to fight hallucinations with more text instructions, it was like adding louder stage directions but not changing the show.

🥬 The Concept: Prior attempts mainly tweaked text (captions/prompts) or used decoding tricks at inference. How it works:

Text-only fixes: rebalance language, but don’t add new visual evidence.
Contrastive decoding: compare different decoding paths, but it adds runtime cost and doesn’t change the model’s core understanding. Why it matters: These don’t fix the root cause: a shortage of hard visual cases that force true grounding.

🍞 Anchor: If you only rewrite the question, the model may still guess the same typical answer. Changing the video itself forces it to look.

🍞 Hook: Now picture a training program that always pairs a normal clip with a tricky twin and demands the right answer for both.

🥬 The Concept: The gap this paper fills is a scalable way to synthesize counterfactual videos plus shared-question QA, and a training recipe that balances learning from both. How it works:

Generate controlled counterfactuals with diffusion-based editing.
Embed structured context so an MLLM can auto-write dense captions and QA reliably.
Train so the same question requires different answers for original vs edited videos. Why it matters: This breaks the habit of guessing from language priors and builds true visual grounding.

🍞 Anchor: The model is rewarded only if it says “lanterns float” for the real clip (if true) and “lanterns freeze mid-air” for the edited one (if that’s what the frames show).

02Core Idea

🍞 Hook: Imagine a coach who trains you with twins of every drill: one normal, one with a sneaky twist. You only pass if you adapt to what you actually see.

🥬 The Concept: The key insight is to generate paired real-and-counterfactual videos with the same question but different correct answers, then train with a balanced, two-stage method so the model must ground answers in the frames. How it works:

Use controllable, diffusion-based editing to turn real videos into counterfactuals.
Embed structured context (what was changed, when, where) to auto-generate captions and QA.
Build shared-question, contrastive pairs that require different answers for original vs edited videos.
Train with DNA-Train: Supervised Fine-Tuning, then Reinforcement Learning with duality-normalized advantage to balance learning from both video types. Why it matters: This turns unusual visual events into training signals that correct the model’s over-trust in language priors.

🍞 Anchor: If corn actually goes downward in one clip and vanishes mid-air in its edited twin, the model must answer differently for each—no more safe guessing.

Three analogies for the same idea:

School test analogy: You get two versions of the same science question, but one lab video has a hidden twist. You must watch closely or you’ll miss the difference.
Spot-the-difference game: Two pictures look nearly the same, but one change flips the right answer. Careful observation wins.
Traffic cam analogy: Two clips of the same intersection; in one, the light sequence is normal; in the other, it reverses. The correct report depends on the exact frames.

Before vs After:

Before: Models leaned on what usually happens (language priors), often ignoring surprising frames.
After: Models are trained to switch answers when frames change, proving they saw the visual evidence.

🍞 Hook: You know how a photo-edit app lets you change small parts of an image? Now imagine doing that for videos with precise control.

🥬 The Concept: Diffusion-based controllable video editing is a way to change specific visual elements in a video while keeping the rest realistic. How it works:

Plan the change (object disappears, motion reverses, color warps) with structured context.
Use diffusion tools and video editors (e.g., VACE, FLUX-Kontext) to apply edits consistently over time.
Verify edits using multiple top MLLMs and masks (e.g., GroundingDINO + SAM) to ensure quality. Why it matters: High-quality, targeted edits make strong training examples that clearly test visual grounding.

🍞 Anchor: Original clip: person pours tea into a glass. Edited clip: a small goldfish appears inside the glass. Same question about the glass, different right answers.

🍞 Hook: Think of studying by comparing two examples side-by-side to learn what really matters.

🥬 The Concept: Contrastive training with shared questions makes the model choose different answers for twin videos that differ in one key way. How it works:

Ask the same question of both videos.
Only reward answers that match the specific video.
Over time, the model learns to attend to the exact visual evidence that decides the answer. Why it matters: This directly trains the habit of looking before answering.

🍞 Anchor: Question: What happens to the rocks? Real: they stay put. Edited: they float. The model learns to check motion in the frames, not just guess.

🍞 Hook: Imagine two teammates who are both important: one plays offense (real videos), the other plays defense (counterfactuals). Training must keep them balanced.

🥬 The Concept: Duality-Normalized Advantage Training (DNA-Train) is a two-stage training scheme—SFT then RL—that normalizes learning signals so real and counterfactual videos contribute equally. How it works:

Supervised Fine-Tuning on both real and edited samples with balanced batches.
Reinforcement Learning with verifiable rewards (correct vs incorrect) on shared-question pairs.
Compute advantage (learning signal) within groups and normalize across real vs counterfactual so neither dominates. Why it matters: Without normalization, the easier side could flood training, causing instability or bias.

🍞 Anchor: Early on, real videos might be easier, so their reward signals are bigger. DNA scales them down to match the edited ones, keeping learning fair and stable.

Building blocks (in small pieces):

Counterfactual video generation (to create the twin).
Structured context (to guide the edits and QA).
Contrastive QA pairs (same question, different correct answer).
SFT (to establish basic competence on both video types).
RL with verifiable rewards (to strongly reinforce grounded answers).
Duality-normalized advantage (to balance gradients and stabilize learning).

03Methodology

At a high level: Real video + context → Counterfactual video editing → Dense captions and QA generation → Paired contrastive QA (same Q for both videos) → Supervised Fine-Tuning → Reinforcement Learning with duality-normalized advantage → Grounded model outputs.

Step 1: Define and generate counterfactual contexts 🍞 Hook: You know how comic books sometimes show alternate universes where one small change flips the whole story? 🥬 The Concept: Counterfactual context is a precise plan for what change to introduce into the video (what changes, where, and when) so we can generate an unusual but controlled scenario. How it works:

Choose anomaly type: visual (pixels), semantic (objects), or commonsense (physics/logic).
Attach structured info: which object/region, timestamps, and edit instructions.
Use this as a blueprint for editing and later for QA generation. Why it matters: Clear context ensures edits are targeted and the resulting questions are unambiguous. 🍞 Anchor: Plan: From 1.5s to 2.5s, make the lanterns freeze mid-air above the river while the rest of the scene stays the same.

Step 2: Three editing pipelines A) Visual anomalies (pixel-level) 🍞 Hook: Think of applying a filter to only one part of a video. 🥬 The Concept: Visual anomalies are pixel changes (like abnormal brightness/contrast, blur) that don’t change the scene’s meaning. How it works:

Pick a time window.
Select frame regions (whole frame, a region, or an object mask via GroundingDINO + SAM).
Apply OpenCV operations (e.g., contrast spike) consistently across frames. Why it matters: Tests if the model notices low-level changes that could confuse recognition. 🍞 Anchor: Only the sky region becomes over-saturated midway; the question asks about lighting changes over time.

B) Semantic anomalies (object-level) 🍞 Hook: Imagine removing a character from a scene while keeping everything else. 🥬 The Concept: Semantic anomalies alter objects or their presence (disappear, appear, substitute) over time. How it works:

An MLLM suggests a target object.
Generate masks with GroundingDINO + SAM.
Edit with VACE to add/remove/replace the object.
Verify with multiple strong MLLMs via majority vote. Why it matters: Checks whether the model truly tracks objects through time instead of assuming they persist. 🍞 Anchor: A skateboard vanishes midway; the correct answer shifts from “riding a skateboard” to “running.”

C) Commonsense anomalies (physics/logic) 🍞 Hook: Picture water splashing up and then suddenly reversing back into the bottle. 🥬 The Concept: Commonsense anomalies break real-world expectations (impossible movement, causal reversal). How it works:

An MLLM proposes a plausible violation (e.g., reverse lava flow).
FLUX-Kontext edits key frames following that instruction.
Multiple MLLMs re-verify the edit.
VACE interpolates frames to make a smooth video. Why it matters: These are the hardest cases where language priors are strongest and most misleading. 🍞 Anchor: Lava briefly freezes mid-air; the best answer acknowledges the freeze rather than the usual churning.

Step 3: Dense captions and QA generation 🍞 Hook: You know how lab reports include a careful description before the questions? Same here. 🥬 The Concept: Dense captions describe the video chronologically with the anomaly context embedded (what changed, where, when), serving as a solid base for consistent QA. How it works:

Provide the model with frames plus metadata (anomaly type, region name, time window).
Produce detailed, time-stamped captions that mention important actions and anomalies.
Use those captions to auto-generate open-ended and multiple-choice questions. Why it matters: The richer the caption, the clearer and more precise the questions and answers become. 🍞 Anchor: Caption notes that from 3.0–3.6s, the lanterns stop moving; the question asks what happens to lantern motion later.

Step 4: Build paired, contrastive QA 🍞 Hook: Think of a spot-the-difference worksheet where the same question applies to both pictures. 🥬 The Concept: Shared-question pairs mean the original and the edited videos get the same question but have different correct answers. How it works:

Ask one question Q for both videos (V_ori, V_edit).
Keep distractor options plausible for both.
Ensure Q targets the changed detail so the correct option flips between the two videos. Why it matters: Forces the model to ground answers in the actual frames. 🍞 Anchor: Q: What happens to the rocks? Real: stay put. Edited: float upward. Options include both outcomes.

Step 5: Dataset organization 🍞 Hook: Like a textbook with training sets and a final exam. 🥬 The Concept: DualityVidQA has three splits: SFT (104K QA from 25K pairs), RL (20K shared-question pairs), and Test (600 pairs, human-curated). How it works:

SFT split: mix of real and counterfactual QA for supervised learning.
RL split: shared-question pairs with identical options but different correct answers.
Test: stricter pairwise accuracy—only count a sample if both real and edited answers are correct. Why it matters: The splits match the training recipe and the final evaluation goal. 🍞 Anchor: A model can’t pass by guessing one side; it must nail both.

Step 6: Supervised Fine-Tuning (SFT) 🍞 Hook: Think of this as learning the rules of the game before playing for points. 🥬 The Concept: SFT teaches the model to read both normal and edited videos without bias. How it works:

Balanced sampling: each batch includes equal real and counterfactual samples.
Cross-entropy loss on answers, training the base MLLM to follow instructions and attend to frames.
Keep the model strong on real videos while exposing it to counterfactuals. Why it matters: Prevents overfitting to either domain. 🍞 Anchor: After SFT, the model describes a typical pouring scene well and can also notice a fish suddenly appearing in the glass.

Step 7: Reinforcement Learning (RL) with verifiable rewards and duality-normalized advantage 🍞 Hook: Like getting points for correct answers and using those points to learn which habits to keep. 🥬 The Concept: RL with verifiable rewards gives a clear signal: 1 if the chosen option is correct, 0 otherwise; duality-normalized advantage balances learning between real and edited groups. How it works:

For each shared-question pair, sample multiple responses and compute rewards based on correctness and format.
Use a stable RL framework (DAPO-like) to update the policy.
Compute each group’s advantage magnitude and scale them so real vs counterfactual contribute equally to gradients. Why it matters: Balancing avoids training collapse toward the easier side and strengthens true visual grounding. 🍞 Anchor: If real clips are easier at first, DNA scales their learning signal so edited clips still shape the model.

Secret sauce:

High-quality, controllable counterfactuals + structured context make precise training data.
Shared-question, contrastive QA directly trains the habit of looking.
Duality-normalized advantage keeps training fair and stable across both video types.

04Experiments & Results

The test: The authors built DualityVidQA-Test, a 600-pair benchmark with four fine-grained counterfactual categories (counter physical, object/scene deformation, attribute change, causal reversal). Pairwise accuracy counts only if a model gets both the original and the edited video correct for the same question, which directly measures visual grounding.

The competition: The method is compared with popular closed-source MLLMs (GPT-4o, GPT-4.1, Gemini-2.5 Pro/Flash) and strong open-source baselines (Qwen2.5-VL-7B/32B/72B, LLaVA-Next-Video, Video-LLaVA-7B, VideoChat2-HD). The base model for the proposed method is Qwen2.5-VL-7B.

The scoreboard with context:

On DualityVidQA-Test, DNA-Train-7B achieves 76.8% pairwise accuracy. Compared to Qwen2.5-VL-7B at 52.8%, that’s a 24.0% relative improvement—like jumping from a mid B- to a solid A on a very strict test.
Across categories, DNA-Train-7B is notably strong on hard cases like counter physical (79.2% edited-side accuracy), where most models stumble.
On EventHallusion (another hallucination benchmark), DNA-Train-7B reaches 61.3%, far above typical open-source baselines (e.g., 33.5% for Qwen2.5-VL-7B), showing the gains aren’t limited to one dataset.
General-purpose video understanding improves too: TempCompass (73.5%), MVBench (63.8%), TVBench (53.0%), with consistent boosts over the base model. That’s like training for tough trick questions and still doing better on the regular test.

Surprising findings:

Training only on real videos or only on counterfactual videos hurts the core pairwise task. The magic comes from pairing them and asking the same question—this contrast is what unlocks visual grounding.
Interestingly, counterfactual-only training can still help general understanding, suggesting these unusual cases encourage stronger, more transferable visual features.
The duality-normalized advantage in RL outperforms strong baselines (GRPO, DAPO alone) on hallucination tests and also nudges up general benchmarks, confirming the balancing idea works.

Category-level trends:

All models show a large drop from real to edited videos—evidence of language priors at work—while DNA-Train reduces that gap substantially.
The toughest settings are physical violations and causal reversals; this is exactly where contrastive, shared-question training helps most.

Takeaway: The approach excels where most models fail—videos that break expectations—and it does so without sacrificing performance on everyday video tasks.

05Discussion & Limitations

Limitations:

Editing cost and complexity: High-quality, controllable counterfactual video generation at scale requires significant compute (tens of thousands of GPU hours reported) and careful verification.
Domain coverage: Although the dataset is large and diverse, some rare real-world edge cases or long, complex narratives may still be underrepresented.
Short-clip bias: Most clips are short (2–6 seconds), so performance on much longer videos with multi-scene plots needs further study.
Verification via MLLMs: Automated checks rely on current MLLMs; if they share biases, some flawed edits or QA could slip through.

Required resources:

Access to diffusion-based editors (e.g., VACE, FLUX-Kontext) and segmentation tools (GroundingDINO, SAM), plus GPU compute for video synthesis.
Infrastructure for SFT and RL (DAPO-like) with batched sampling and reward verification.

When not to use:

If you cannot synthesize or verify counterfactuals reliably (e.g., extremely domain-specific physics), the method’s advantages shrink.
If inference latency or compute must be ultra-low and training upgrades are impossible, simpler decoding tricks might be more practical (though less robust).

Open questions:

Long video reasoning: How to extend contrastive, shared-question training to minute-long videos with multiple intertwined anomalies?
Richer rewards: Can we create more graded, temporally localized rewards (not just correct/incorrect) to teach finer grounding (e.g., pinpointing the exact frame of change)?
Beyond QA: How well does the approach transfer to tasks like temporal localization, dense captioning under trick conditions, or safety-critical anomaly detection?
Editing realism: Can future editors produce even more photorealistic, physics-aware counterfactuals that further reduce distribution shift from real-world surprises?

06Conclusion & Future Work

Three-sentence summary: This paper tackles video hallucinations in MLLMs by generating paired real-and-counterfactual videos and asking the same question so the correct answer must flip when the frames change. The DualityForge pipeline creates high-quality counterfactuals with structured context for auto QA, and the DNA-Train recipe (SFT + RL with duality-normalized advantage) balances learning from both sides. The result is a model that looks before it answers, beating strong baselines on counterfactual tests and also improving on general video tasks.

Main achievement: Turning counterfactual video generation into a large-scale, contrastive training signal and stabilizing it with duality-normalized advantage, which together sharply cut hallucinations while preserving broad competence.

Future directions:

Extend to long-form, multi-scene videos with stepwise rewards.
Improve edit realism and physical consistency to cover more edge cases.
Generalize the duality-normalization idea to other modalities (audio, sensors) and tasks (temporal localization, step-by-step reasoning).

Why remember this: It shows that generation can teach understanding—carefully crafted counterfactuals, paired with a balanced training scheme, can retrain models’ instincts from guessing the usual story to actually checking the frames.

Practical Applications

•Safety monitoring: Detect unusual events (e.g., a person appearing where none should be) with fewer false stories from language priors.
•Sports analytics: Judge rare plays or rule violations by focusing on the exact frames, not typical play patterns.
•Robotics: Help robots react to real-time visual surprises (object disappears or moves oddly) by trusting sensors first.
•Education tools: Create spot-the-difference video quizzes that teach careful observation and evidence-based answers.
•Content moderation: Flag counterfactual-looking edits or physics-defying scenes more accurately.
•Medical training videos: Highlight subtle, unusual motions or instrument changes and test grounded understanding.
•Autonomous driving simulation: Inject controlled counterfactuals (unexpected pedestrian motion) to stress-test perception.
•Video QA assistants: Offer more reliable answers about security footage, assembly steps, or maintenance procedures.
•Creative editing QA: Verify whether an intended video edit (like removing a logo) truly appears in the final cut.
•Scientific visualization: Train models to notice and report rare phenomena in lab recordings without defaulting to common expectations.

Version: 1