Video Reality Test: Can AI-Generated ASMR Videos fool VLMs and Humans?

Jiaqi Wang; Weijia Wu; Yi Zhan; Rui Zhao; Ming Hu; James Cheng; Wei Liu; Philip Torr; Kevin Qinghong Lin

Video Reality Test: Can AI-Generated ASMR Videos fool VLMs and Humans?

Intermediate

Jiaqi Wang, Weijia Wu, Yi Zhan et al.12/15/2025

arXiv PDF

Key Summary

•This paper builds a new test called Video Reality Test to see if AI-made ASMR videos can fool both people and AI video watchers (VLMs).
•The test uses real ASMR videos with tightly matched sound and visuals, then asks video generators to create fakes and video understanders to spot them.
•A special peer-review game scores creators by how well they fool reviewers and scores reviewers by how well they catch fakes.
•Humans are still much better: people average about 89% accuracy, while top AI reviewers range widely and can drop to about 56% on the hardest fakes.
•Adding audio helps AI reviewers by roughly 5 percentage points because many fakes have sound that doesn’t perfectly match the picture.
•Watermarks can trick AI reviewers: when a Sora watermark is present, some AIs score as high as 95%, but after removal they drop to near coin-flip.
•The best video creator in this test (Veo3.1-Fast) fools most AI reviewers, with only about 12.54% of its videos detected as fake.
•Models tend to assume videos are real (about 71% of the time), which makes them easier to fool.
•Comparing two videos side-by-side (pick which is real) is easier for AI than judging a single video.
•This benchmark highlights where today’s AI still struggles: true audio–visual realism and avoiding shortcuts like watermarks.

Why This Research Matters

Realistic AI videos can shape what people believe, from product demos to news-like clips. This benchmark shows exactly where today’s AI reviewers still fall for fakes, especially when sound and sight nearly—but not perfectly—match. It also proves that simple shortcuts, like spotting a watermark, can inflate scores without true understanding. By focusing on ASMR, where tiny timing errors pop out, we get a tough, honest test of perceptual realism. Platforms can use this to improve moderation; researchers can use it to train better detectors; and the public benefits from safer, more trustworthy media. As AI video gets better, tools like Video Reality Test help society keep up.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine watching a slime-cutting video where the squish sound matches every slice perfectly. If the sound and sight fit like puzzle pieces, it feels real. But what if a computer made it?

🥬 The Concept (ASMR Videos): ASMR videos are relaxing clips where gentle sights and sounds (like tapping, brushing, or slicing) are tightly matched to trigger a calming feeling. How it works:

A camera captures tiny hand–object actions.
A mic records crisp, synced sounds (snips, taps, pours).
Your brain checks: “Does this sound match that motion—right now?” Why it matters: If sound and motion drift apart even a little, the magic breaks and you notice something’s off. 🍞 Anchor: Think of chopping a cucumber—if you see the knife hit before the ‘chop’ sound, it feels wrong instantly.

🍞 Hook: You know how cartoons are fun but don’t feel real because the sound effects don’t always match exactly?

🥬 The Concept (Audio–Visual Coupling): Audio–visual coupling is how sight and sound line up in time and texture so your brain believes what it sees. How it works:

Watch the motion frame by frame.
Align each motion beat with the right sound.
Keep timing tight and textures consistent (soft brush, soft sound; hard tap, sharp click). Why it matters: Break the coupling, and even beautiful videos feel fake. 🍞 Anchor: A door slam that sounds a second late makes the whole scene feel staged.

🍞 Hook: Imagine a talented art robot that can paint any scene you describe.

🥬 The Concept (Video Generation Models, VGMs): VGMs are AIs that create videos from text or images. How it works:

Read a scene description and/or see a start image.
Predict what each next frame should look like.
Sometimes also generate matching audio. Why it matters: If they get timing, motion, and textures right, their videos can look shockingly real. 🍞 Anchor: Ask a VGM for “hands slicing soap on a wooden board,” and it draws a moving scene that looks filmed.

🍞 Hook: Think of a super-careful movie critic who can pause, zoom in, and explain what’s happening in a video.

🥬 The Concept (Video Language Models, VLMs): VLMs are AIs that watch videos and answer questions about them. How it works:

Look at sampled frames (and sometimes audio).
Reason about objects, actions, and timing.
Decide if something is real-looking or AI-made. Why it matters: If VLMs can’t tell fake from real, they’re easy to fool. 🍞 Anchor: Show a VLM two soap-cutting clips and ask, “Which is real?”—it must notice tiny tells in light, motion, and sound.

🍞 Hook: Have you ever wondered if a picture online was edited or real?

🥬 The Concept (AI-Generated Content Detection): This is the job of telling human-made media from AI-made media. How it works:

Check for tiny visual or audio glitches.
Look for physics or timing errors.
Combine clues to vote: real or fake. Why it matters: Without it, anyone can spread convincing fakes. 🍞 Anchor: A video of a “new law announcement” could be AI-made; good detection stops misinformation.

The world before: Video generators (like Sora or Veo) got so good that many clips felt real—smooth motion, nice lighting, and sometimes even sound. Most tests judged visuals only (Was the object right? Did physics look okay?). Fewer looked closely at sound lining up with action. Meanwhile, detectors often focused on faces or simple artifacts, not on the tight sound–sight timing that ASMR depends on.

The problem: Can AI truly make immersive videos—especially ASMR—where sound and sight lock together so well that both people and VLMs get fooled? And can we measure this fairly?

Failed attempts: Older benchmarks:

Checked story questions (MCQs) instead of realism.
Ignored audio, even when videos were watched with sound.
Let models overfit to easy shortcuts like watermarks.
Pre-built a static set of fakes, which models could “learn.”

The gap: We needed a benchmark that:

Centers on ASMR, where audio–visual coupling is critical.
Pits creators (VGMs) against reviewers (VLMs) in a live, adversarial test.
Measures both human and AI realism judgments.
Controls for shortcuts like watermarks and checks how audio changes results.

Real stakes: If platforms or schools can’t tell when a video is AI-made, scams, fake “how-to’s,” and misleading product demos can spread fast. ASMR is a perfect stress test: even tiny timing errors stand out. If AI can pass here, it’s getting truly convincing—and if detectors fail here, they’ll likely fail elsewhere too.

02Core Idea

🍞 Hook: Imagine a magic show. One team creates illusions, and another team tries to catch the trick. Who wins tells you how good both teams are.

🥬 The Concept (Peer-Review Evaluation): It’s a two-player game where video generators (creators) try to fool video understanders (reviewers), and reviewers try to spot the fakes. How it works:

Start with real ASMR videos.
Creators (VGMs) generate fakes from a start frame and/or text.
Reviewers (VLMs) judge: real (1) or fake (0), sometimes with audio.
Score creators by how rarely they’re caught; score reviewers by accuracy. Why it matters: This direct duel measures true realism, not just checklists. 🍞 Anchor: If a creator’s soap-cutting video fools most reviewers, that creator climbs the leaderboard; if a reviewer catches many fakes, that reviewer rises too.

Aha (one sentence): Use tightly audio–visual-coupled ASMR videos and a creator-versus-reviewer game to measure how real AI videos feel to both humans and AI.

Three analogies:

Hide-and-seek: Creators hide artifacts; reviewers seek them—better hiding or better seeking wins.
Cooking contest: Chefs (VGMs) cook dishes (videos); judges (VLMs) taste with both eyes and ears; dishes that fool judges score higher.
Spot-the-difference: One picture is real, one is drawn—if your drawing is perfect or the judge is sloppy, you win.

Before vs After:

Before: Tests focused on vision or physics, ignored sound, or used static fake sets models could memorize.
After: A living benchmark where fresh fakes keep coming, audio matters, and both sides pressure each other to improve.

Why it works (intuition, no math):

ASMR is hypersensitive to timing; small errors reveal fakes.
Audio mismatches are tough to get right and easy to notice when they’re wrong.
A dynamic, adversarial setup prevents overfitting to one dataset.
Measuring both fooling rate and detection accuracy shows the true boundary of current realism.

Building blocks:

Data core: 1,400 social media ASMR videos; 149 carefully curated real clips spanning objects, actions, backgrounds, and two difficulty levels (easy/hard).
Generation conditions: Use a start frame and/or a text description to guide VGMs.
Review protocols: VLMs see 8 sampled frames (and sometimes audio) and vote real/fake.
Controls: With vs without audio; with vs without watermarks; direct judgment vs pairwise preference; chain-of-thought vs direct answers; varying number of frames.
Scoring: Reviewers get accuracy; creators get fakeness rate (lower is better).
Leaderboards: Public scoreboards for best creator and best reviewer keep pressure high.

🍞 Anchor: In practice, Veo3.1-Fast (creator) makes a hand-slicing-soap clip so convincing that Gemini 2.5-Pro (reviewer) drops close to 56%—barely above guessing—while human experts still do much better overall.

03Methodology

High-level recipe: Real ASMR video → curate and describe → sample and balance → creators make fakes (from start frame and/or text) → reviewers judge with or without audio → score creators and reviewers → analyze audio, watermarks, and biases.

Step 1: Collect and clean real ASMR videos

What happens: Gather 1,400 popular ASMR clips from social media; pick 149 high-quality, diverse, human-made videos (49 easy, 100 hard).
Why this step: We need rock-solid reals to compare against fakes; ASMR emphasizes tiny timing and texture details.
Example: Choose a 12-second peeling video with crisp, rhythmic scraping that matches each motion.

Step 2: Preprocess and segment

What happens: Split long compilations into single-theme snippets using color-histogram changes and Bhattacharyya distance (threshold 0.5); remove padded backgrounds and some watermarks; extract the first frame for later conditioning.
Why this step: Keep only coherent, single-action clips and a clean start frame so generation is fair and focused.
Example: From a 2-minute mashup, auto-cut a 7-second soap-slicing segment and save frame-1.jpg.

Step 3: Create text storyboards

What happens: Use a strong model (e.g., Gemini 2.5-Pro) to describe each video’s scene using 8 evenly spaced frames: objects, actions, background, lighting, and the implied ASMR sounds.
Why this step: VGMs need guidance—what to render, how it should move, and what it should sound like.
Example: “A gloved hand brushes foam across a wooden board, soft swishing sounds in a dim kitchen.”

Step 4: Cluster and balance the dataset

What happens: Use Qwen3-embedding-4B to cluster descriptions into 8 groups (maximize silhouette), then sample evenly so one type (like ‘cutting’) doesn’t dominate.
Why this step: Prevents the benchmark from becoming lopsided and easy to game.
Example: If ‘pouring paint’ is rare, ensure it still appears in the final set.

Step 5: Generate fakes with different inputs

What happens: Ask multiple VGMs (e.g., Veo3.1-Fast, Sora2, Wan2.2, HunyuanVideo, OpenSora-V2, StepVideo) to produce fake videos using: a) start-frame image + detailed text, or b) start-frame image only (“make it as real as possible”), or c) text only.
Why this step: Tests how conditioning affects realism and whether models can keep motion and sound aligned.
Example: Give a soap-slicing first frame and the description; see how each VGM recreates the action and audio.

Step 6: Reviewers judge reality

What happens: VLMs (e.g., Gemini 3-Pro, Gemini 2.5-Pro/Flash, GPT-5, GPT-4o/mini, Qwen family, GLM-4.5V) watch 8 sampled frames (and optionally audio) and output 1 for real, 0 for fake.
Why this step: Measures how well AI can spot tiny audio–visual mismatches and visual artifacts.
Example: A VLM hears crackling that stops abruptly while the image still bubbles—calls it fake (0).

🍞 Hook: You know how in class, projects are graded by both how well they’re made and how well a classmate can explain them?

🥬 The Concept (Scoring: Fakeness rate and Accuracy): Creators are scored by fakeness rate (how rarely reviewers catch them); reviewers are scored by accuracy on real vs fake. How it works:

Reviewer accuracy: percent of correct real/fake calls over all videos.
Creator fakeness rate: percent of its fakes detected as fake (lower is better). Why it matters: Balanced scoring rewards true realism and true detection. 🍞 Anchor: If Veo3.1-Fast is detected only 12.54% of the time, it’s very good at fooling AI judges.

Step 7: Control experiments (ablation)

With vs without audio: Does sound help VLMs spot fakes?
With vs without watermark: Do VLMs rely on logos instead of actual video quality?
Direct judgment vs pairwise preference: Is choosing the real video (from two) easier than single-video judgment?
Chain-of-thought vs direct answer: Does thinking aloud help strong models?
Vary frames: Does seeing more of the timeline help? Why these steps: They reveal shortcuts, biases, and what signals matter most. Example: Remove Sora watermarks and watch detection scores drop toward chance—proof of shortcutting.

Secret sauce:

ASMR focus: It magnifies audio–visual timing errors that casual viewers or models might miss elsewhere.
Adversarial loop: New fakes keep the test fresh; creators and reviewers co-evolve.
Shortcut busting: Watermark removal and audio toggles expose what models are really using to decide.

🍞 Hook: Imagine a school science fair where inventors and judges level each other up all year long.

🥬 The Concept (Watermarks): Watermarks are on-screen logos or marks indicating a source (like an AI model’s name). How it works:

They’re placed on generated videos.
Reviewers may latch onto them as an easy tell.
Remove them to see if detection still works. Why it matters: If scores collapse without watermarks, reviewers weren’t truly understanding the video. 🍞 Anchor: Some VLMs hit about 95% accuracy on watermarked Sora videos but drop to ~52% without the mark.

🍞 Hook: When you flip through a comic, do the panels flow naturally, or do characters jump around oddly?

🥬 The Concept (Temporal Coherence): It’s the natural, continuous flow of motion and changes across frames. How it works:

Each frame follows logically from the previous.
Objects keep identity, lighting, and shape consistent.
Motions don’t warp or skip. Why it matters: Broken flow feels fake even if a single frame looks great. 🍞 Anchor: In soap-cutting, slices should thin and pile up smoothly—not blink in and out.

04Experiments & Results

The test: We measured two things—how well VLMs spot fakes (accuracy) and how well VGMs avoid being spotted (low fakeness rate). We also checked how audio and watermarks affect results, whether pairwise comparisons help, and whether reasoning out loud (chain-of-thought) changes outcomes.

The competition:

Creators (VGMs): Veo3.1-Fast, Sora2, Wan 2.2 (various sizes), HunyuanVideo, OpenSora-V2, StepVideo.
Reviewers (VLMs): Gemini 3-Pro, Gemini 2.5-Pro and Flash, GPT-5, GPT-4o and 4o-mini, Qwen3/2.5 family, GLM-4.5V.
Humans: Expert annotators as the gold standard.

Scoreboard with context:

Humans are top: about 89.11% average across creator sets—like getting an A when many AIs are at C to B-.
Best reviewer average: Gemini 3-Pro hits around 76.27% overall—good, but still well below humans.
On the hardest creator (Veo3.1-Fast), even strong reviewers struggle: Gemini 2.5-Pro drops to about 56%—just above a coin flip.
Best creator: Veo3.1-Fast has only about 12.54% of its fakes detected—like pulling off a magic trick on most judges.
Open-source is catching up: HunyuanVideo and Wan2.2-A14B perform close to proprietary models on some settings.

Surprising findings:

Audio helps reviewers: Adding audio boosts top VLMs by roughly 5 points—because current fake audios often misalign (e.g., wrong texture, abrupt cutoffs, or even human speech in ASMR scenes).
Audio hurts creators when it’s weak: When creators add audio that doesn’t match the visuals, detection rates go up (e.g., Veo3.1-Fast from 12.54% to about 15.05%; Sora2 from around 46.58% to 49.47%).
Watermarks are a shortcut: With Sora logos visible, some VLMs score up to ~95%; remove them and scores can crash to ~52%—barely above guessing.
Real-bias: VLMs tend to call videos real about 71% of the time, which inflates errors on good fakes.
Preference is easier: Picking the real one from a pair beats single-video judgment for many models.
Reasoning helps strong models: Chain-of-thought can improve top VLMs (e.g., Gemini 2.5-Pro), but it may confuse weaker ones (e.g., some Qwen variants) and hurt accuracy.

Examples that make it concrete:

Audio alignment catch: A clip shows molten material bubbling, but the crackling sound stops abruptly before the visuals settle. With audio, Gemini flags “fake”; without audio, it might pass.
Watermark trap: A Sora2 clip is flagged as fake immediately due to its watermark. After watermark removal, the same clip is often judged real—showing over-reliance on superficial cues.
Fooling success: A Veo3.1-Fast hand-motion scene with consistent lighting and realistic object deformation causes both GPT-5 and Gemini 2.5-Pro to label it real.

Big picture: Video Reality Test sets a fair playing field and shows the current boundary of realism. Humans still lead. Top reviewers are improving but remain vulnerable to polished fakes, especially when audio is either missing (they miss cues) or misleading (they get tricked). Creators that master tiny timing details in ASMR can fool many AIs; creators that stumble on audio alignment get caught more often.

05Discussion & Limitations

Limitations:

Domain focus: It’s ASMR-centric. That’s great for testing tight timing, but it’s not the entire world of video (sports, news, vlogs, etc.).
Sampling frames: Reviewers often see 8 frames, not the full clip, which may miss subtle transitions (though audio experiments help).
Audio support and access: Some open models can’t yet take audio + video together, and some top generators are closed-source with limited controls.
Dynamic dataset: Because creators keep generating fresh fakes, exact reproducibility across time can be tricky—though it reflects real-world drift.

Required resources:

Access to multiple VGMs and VLMs (often via paid APIs or strong GPUs).
Storage and bandwidth for videos and audio.
Evaluation harness to standardize prompts, frame sampling, and audio toggles.
Optional: watermark-removal pipelines to control for shortcuts.

When not to use:

If you only care about face deepfakes or speech lip-sync (different cues dominate there).
If your task has no audio at all (then other benchmarks might be simpler and faster).
If you need fixed, unchanging datasets for long-term regression tests (the dynamic element here is a feature, not a bug).

Open questions:

How can we build reviewers that focus on genuine audio–visual causality instead of shortcuts (logos, text overlays, or trendy artifacts)?
What’s the best way to quantify audio–visual alignment so models learn it directly?
Can provenance tools (trusted metadata/watermarking) help without being easily exploited by detectors as shortcuts?
How do we reduce the “everything is real” bias in VLMs while avoiding false alarms?
Can hybrid systems—audio experts + video experts + language reasoning—beat single monolithic models on realism?

06Conclusion & Future Work

Three-sentence summary: This paper introduces Video Reality Test, a benchmark that pits video generators against video reviewers on realistic ASMR videos where sound and sight must match tightly. It shows that top AIs can be fooled—especially when fakes are polished and when reviewers lean on shortcuts like watermarks—while humans still outperform machines. It also proves that audio matters: adding it both helps detection and exposes weaknesses in today’s generators.

Main achievement: A dynamic, peer-review benchmark that fairly measures perceptual realism under tight audio–visual coupling, with leaderboards for both creators and reviewers and careful controls for shortcuts like watermarks.

Future directions: Expand beyond ASMR to everyday scenes, sports, and news; improve audio-visual alignment metrics; integrate provenance signals robustly; and build reviewers that reason about cause-and-effect in time rather than surface cues.

Why remember this: If an AI video can pass the ASMR test, it’s truly convincing—and if a detector fails the ASMR test, it will likely fail in the wild. Video Reality Test shines a bright light on where our models really stand today and where they must improve tomorrow.

Practical Applications

•Platform safety audits: Evaluate how well content moderation AIs catch sophisticated AI-made videos before they spread.
•Detector training curriculum: Use audio-on vs audio-off splits to teach models to rely on real alignment, not shortcuts.
•Watermark robustness checks: Measure how much detection depends on visible logos and fix over-reliance.
•Benchmarking contests: Run public leaderboards for both creators and reviewers to drive progress.
•Model selection: Choose the most reliable reviewer model for a given use case (e.g., news verification vs entertainment).
•Creator feedback: Help video generators improve audio–visual timing to reduce artifacts that get them caught.
•Policy testing: Evaluate how proposed provenance standards (invisible watermarks, metadata) affect fairness and reliability.
•Educational demos: Show students how small audio–visual mismatches reveal fakes and why critical thinking matters.
•Forensics pipelines: Combine temporal coherence checks with audio cues for stronger real/fake judgments.
•Product QA: Test AI-made promo videos for subtle mismatches before public release.

Version: 1