Skyra: AI-Generated Video Detection via Grounded Artifact Reasoning

Yifei Li; Wenzhao Zheng; Yanran Zhang; Runze Sun; Yu Zheng; Lei Chen; Jie Zhou; Jiwen Lu

Skyra: AI-Generated Video Detection via Grounded Artifact Reasoning

Intermediate

Yifei Li, Wenzhao Zheng, Yanran Zhang et al.12/17/2025

arXiv PDF

Key Summary

•Skyra is a detective-style AI that spots tiny visual mistakes (artifacts) in videos to tell if they are real or AI-generated, and it explains its decision with times and places in the video.
•Instead of a yes/no guess, Skyra points to concrete clues, like a hand melting or an object popping in, and tags them with timestamps and bounding boxes.
•The team built ViF-CoT-4K, the first large, human-annotated dataset of video artifacts with a clear, fine-grained taxonomy and step-by-step explanations.
•Skyra learns in two stages: first with supervised examples (cold start) and then with reinforcement learning that rewards careful checking and punishes false alarms more strongly.
•A new benchmark, ViF-Bench, tests detectors fairly by pairing real and fake videos with matched topics, formats, and lengths from more than ten modern video generators.
•On ViF-Bench, Skyra beats strong binary detectors and general multimodal LLMs by large margins, turning near-random guessing into A-grade performance.
•Reinforcement learning adds a ‘self-check’ habit that helps Skyra find subtle, physics-violating clues while avoiding confusing real-world blur or compression with fakes.
•Skyra stays robust under common video degradations (compression, noise, lighting changes), keeping top accuracy when other methods falter.
•The approach is explainable and practical for human reviewers, but it does not judge intent or social harm and may still overconfidently describe uncertain details.
•This work helps rebuild trust in video by making AI detection both accurate and understandable for journalists, platforms, and everyday users.

Why This Research Matters

Videos shape what we believe about the world, so we need detectors that can explain their decisions, not just guess. Skyra shows its work: it marks the exact times and places where a video breaks physics or common sense, helping people verify claims quickly. This makes it useful for journalists, platforms, and educators who must decide what to publish or flag. By focusing on true, human-visible artifacts, Skyra avoids shortcuts and stays robust under common video glitches like compression or blur. The approach adapts quickly to new generators, a must as AI video tools improve fast. Clear, grounded explanations also help teach audiences what kinds of clues reveal fakes. In short, Skyra turns AI video detection into a transparent, trustworthy process.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Top Bread (Hook): You know how when you watch a magic trick, you try to spot the moment something looks off—like a coin that jumps without a push or a shadow that shouldn’t be there? Those little giveaways help you figure out the trick.

🥬 Filling (The Actual Concept):

What it is: This paper tackles how to tell if a video is real or made by AI by hunting for human-noticeable visual mistakes called artifacts.
How it works: People and machines look for small clues—like shape warping, objects appearing from nowhere, or impossible motion—that break real-world physics or common sense, then use those clues to decide if the video is fake.
Why it matters: Without hunting for artifacts, detectors just guess “real or fake” and often latch onto the wrong things (like blur or lighting), which makes them brittle and hard to trust.

🍞 Bottom Bread (Anchor): Imagine a bartender lifting a metal shaker that bends like rubber. That subtle impossibility is the kind of clue that gives an AI-generated video away.

🍞 Top Bread (Hook): Imagine your science teacher asks, “Is this plant alive?” and you just say “yes” or “no” without pointing to leaves, color, or growth. That answer isn’t very helpful.

🥬 Filling (The Actual Concept: Binary Classifiers):

What it is: A binary classifier is a model that returns only “real” or “fake.”
How it works: It learns patterns from many examples but usually can’t show which parts of the video made it decide.
Why it matters: Without explanations, people can’t check the model’s work, and the model can overfit shortcuts (like frame rate or video style) that don’t generalize.

🍞 Bottom Bread (Anchor): If a detector says a news clip is fake but can’t show any glitch or physics error, would you trust it? Likely not.

🍞 Top Bread (Hook): You know how you can understand a comic better because it has both pictures and words together?

🥬 Filling (The Actual Concept: Multimodal Large Language Model — MLLM):

What it is: An MLLM is an AI that can read and reason over text and visuals (images or videos) together.
How it works: A visual encoder turns frames into features; a language model reasons over those features plus text prompts to produce answers and explanations.
Why it matters: Videos are both motion and meaning; you need a model that can see and explain.

🍞 Bottom Bread (Anchor): When you ask, “Is this video real?”, an MLLM can look at frames and also explain its thinking in sentences.

The world before: Generative video models got really good—so good that anyone can make realistic clips from text or a single image. Detectors popped up, but many were black boxes. Even general MLLMs, when prompted carefully, often performed near random on this task and confused normal video issues (like motion blur) for fakes.

The problem: Existing datasets had mismatched real vs. fake video lengths and styles, encouraging shortcut learning. Artifact labels were rare or vague. Models described scenes (“a man in a bar”) instead of catching true tell-tale clues (like an object melting or teleporting).

Failed attempts: Binary detectors overfit to superficial cues and struggle on new generators. Prompting off-the-shelf MLLMs with step-by-step instructions still missed subtle spatio-temporal artifacts or hallucinated reasons on real videos.

🍞 Top Bread (Hook): Imagine being a detective without a rulebook for what counts as real evidence.

🥬 Filling (The Actual Concept: Grounded Artifact Reasoning):

What it is: A method where the model must find, name, and point to specific visual mistakes (artifacts) tied to exact times and locations in the video.
How it works: It uses an artifact taxonomy and outputs tagged evidence—types, timestamps, and bounding boxes—inside a reasoning block.
Why it matters: Grounding turns guesses into checkable claims, making the model more accurate and trustworthy.

🍞 Bottom Bread (Anchor): “At 1.2–1.6s, the cup’s rim melts at [0.35, 0.45, 0.55, 0.65]” is evidence a human can verify frame by frame.

The gap: We needed (1) a clear, fine-grained artifact rulebook, (2) a large, human-annotated dataset of these artifacts with step-by-step reasoning, and (3) a training recipe that teaches a model to actively look for such clues and avoid overcalling fakes.

Real stakes: Misinformation, scams, and erosion of trust can spread fast when fake videos look real. Teachers, journalists, platforms, and families need tools that don’t just decide but show why—so people can judge for themselves.

02Core Idea

🍞 Top Bread (Hook): Imagine watching a replay with a coach who pauses the video, circles the exact spot, and says, “Here’s the foul, at this time, right there.”

🥬 Filling (The Actual Concept):

What it is: The key insight is to detect AI-generated videos by finding, naming, and grounding human-visible artifacts (with times and boxes) and using them as the reason for the verdict.
How it works: Build a precise artifact rulebook; collect many human examples with step-by-step explanations; train a model to follow a strict reasoning template; then refine it with reinforcement learning that rewards careful inspection and punishes false alarms more.
Why it matters: Explanations anchored in the video itself make detection robust, fair, and checkable by humans.

🍞 Bottom Bread (Anchor): “The robot’s leg melts from 0.3–0.9s in this rectangle—so Fake.” That’s both a decision and proof.

Three analogies:

Detective: Look for fingerprints (artifacts), note when and where you found them, and explain the crime.
Referee: Stop the play, show the exact rule break (physics violation), and mark the spot.
Science judge: Ask for evidence (timestamps/boxes) before approving or rejecting a claim (real/fake).

Before vs. After:

Before: Models gave yes/no answers and chased shortcuts (like video style), crumbling on new generators.
After: The model points to universal, model-agnostic clues (physics/common-sense breaks), transferring better across generators and formats, with human-auditable reasons.

🍞 Top Bread (Hook): You know how a toolbox labels each tool so you grab the right one for the job?

🥬 Filling (The Actual Concept: Artifact Taxonomy):

What it is: A three-level catalog of visual mistakes: Low-level Forgery (texture/color/motion oddities) and Violation of Laws (object inconsistency, interaction issues, causality breaks, commonsense violations, unnatural movement), each with fine-grained L3 types.
How it works: Annotators choose exact L3 types (like Shape Distortion, Abnormal Object Appearance) and mark when/where they occur.
Why it matters: Clear labels prevent confusion and train the model to look for the right evidence.

🍞 Bottom Bread (Anchor): “Text Distortion” is different from “Color Over-Saturation,” and the model learns to tell them apart and point to each.

🍞 Top Bread (Hook): Imagine your math homework shows every step so the teacher can see your thinking.

🥬 Filling (The Actual Concept: Chain-of-Thought (CoT) Annotation):

What it is: A written, step-by-step reasoning process that includes tagged artifact evidence.
How it works: For fake videos, CoT lists found artifacts with <type>, <t>, and <bbox>; for real videos, it inspects likely regions and clears them.
Why it matters: Training on CoT teaches the model to inspect methodically instead of guessing.

🍞 Bottom Bread (Anchor): “I checked 0.5–1.5s in this box; the hand stays anatomically correct—so Real.”

🍞 Top Bread (Hook): Think of a student who first studies with a tutor, then practices under game rules.

🥬 Filling (The Actual Concept: Two-Stage Training — SFT then RL):

What it is: First, supervised fine-tuning (SFT) gives the model a solid start; then reinforcement learning (RL) sharpens its habit of active checking and reduces false alarms.
How it works: SFT learns from ViF-CoT-4K examples and strict output templates. RL (with GRPO) uses rewards: + for correct verdicts, extra for valid check blocks, and stronger penalties for calling a real video fake.
Why it matters: Without SFT, RL gets lost (sparse signals). Without RL, the model explores less and stays too cautious or shortcut-prone.

🍞 Bottom Bread (Anchor): After SFT the model can spot common clues; after RL it catches trickier cases without panicking over motion blur.

🍞 Top Bread (Hook): You know how a fair test gives both teams the same ball, field, and rules?

🥬 Filling (The Actual Concept: ViF-Bench):

What it is: A benchmark of matched real/fake videos from many modern generators, aligned in topic and format.
How it works: It removes shortcuts like different lengths or styles and asks detectors to find real artifact evidence.
Why it matters: It tests true understanding, not hacks.

🍞 Bottom Bread (Anchor): If real and fake both show a bartender scene, only a true artifact (like a warping shaker) separates them.

03Methodology

At a high level: Video → sample 16 frames → Skyra inspects spatio-temporal regions using a CoT template → outputs: <think> grounded evidence </think> + <answer> Real/Fake </answer>.

Step 1: Build the rulebook and the study guide (Artifact Taxonomy + ViF-CoT-4K)

What happens: Researchers design a 3-layer taxonomy and collect thousands of videos (real and AI-generated) from many sources. Expert annotators label artifacts with L3 types and precise times and boxes. For each video, they produce step-by-step Chain-of-Thought that either (a) finds artifacts (fake) or (b) inspects and clears likely regions (real).
Why this step exists: Without a clear rulebook and many labeled examples, the model won’t learn what to look for or how to explain itself.
Example: A clip shows a metal tool bending like clay at 1.2–1.6s in the lower-left quarter. Labeled as <type>Shape Distortion</type> in that time and box, with a short explanation of why metal shouldn’t deform that way.

🍞 Top Bread (Hook): Imagine learning soccer by first drilling the basics with cones.

🥬 Filling (The Actual Concept: Supervised Fine-Tuning — SFT):

What it is: The model is trained to copy good answers and reasoning from the dataset.
How it works: Given frames and a prompt, the model generates the <think> CoT </think> and the <answer> verdict, following the strict format, learning to tag artifacts and to clear regions in real videos.
Why it matters: It bootstraps the model with solid habits before we let it explore on its own.

🍞 Bottom Bread (Anchor): After SFT, Skyra can already say “Fake” and point to the warping shaker with correct tags, or say “Real” and show that the text overlay is stable.

Step 2: Teach careful checking and reduce false alarms (Reinforcement Learning — RL)

What happens: Skyra practices on more examples and gets scored. Correct verdicts earn points. Calling a real video fake is punished more than missing a fake (asymmetric reward). Extra points are given when the CoT includes well-formed, time-and-box “check blocks.” Training uses Group Relative Policy Optimization (GRPO) to adjust the model toward higher-scoring behaviors.
Why this step exists: Some videos are very clean; humans and models both can miss tiny clues. RL nudges the model to probe suspicious regions and to avoid confusing natural blur/compression with fakes.
Example: A news clip with motion blur around the reporter’s hands. SFT might worry it’s fake; RL encourages the model to check, note consistent anatomy and lighting, and confidently conclude “Real.”

🍞 Top Bread (Hook): Think of a game where you get extra credit for showing your work, but heavy penalties for crying “wolf.”

🥬 Filling (The Actual Concept: Asymmetric Accuracy Reward + Inspection Reward):

What it is: A reward design that (1) punishes false positives on real videos more, and (2) gives bonus points for including valid, grounded checks in the CoT.
How it works: Accuracy reward is +1 for correct, 0 for missing a fake, and −0.2 for falsely accusing a real. Inspection reward grows when the model includes well-formed tagged evidence or clearance checks.
Why it matters: Real videos are common; we must avoid wrongly flagging them. Requiring grounded checks makes the reasoning concrete and useful.

🍞 Bottom Bread (Anchor): “I checked the banner text from 0–5s at this box; it’s stable and sharp—Real.” earns inspection credit and builds good habits.

Step 3: Make the test fair (ViF-Bench)

What happens: The benchmark pairs real and fake videos with matched topics, styles, and lengths from many modern generators (like Sora-2, Wan2.2, Kling, Pika, etc.).
Why this step exists: If real are always longer or higher FPS, detectors might ‘cheat’ using shortcuts. Matching removes these crutches.
Example: A city skyline real/fake pair. Only true artifacts—like ‘Structure Anomaly’ in windows or unnatural water ‘Texture Jittering’—should drive the decision.

Step 4: Inference (Using Skyra on a new video)

What happens: Sample frames → prompt Skyra to analyze → Skyra produces a <think> block listing findings in time order with <type>, <t>, and <bbox>, then an <answer> Real/Fake.
Why this step exists: The consistent output lets humans quickly review evidence, not just a label.
Example output (fake): “<type>Abnormal Object Appearance</type> in <t>[1.3, 2.0]</t> at <bbox>[0.4, 0.6, 0.5, 0.8]</bbox> … <answer>Fake</answer>.”

The secret sauce:

Balanced real/fake CoT templates so the model learns both to find artifacts and to clear regions.
Asymmetric rewards to curb false alarms.
Check rewards to encourage grounded, auditable reasoning.
A fine-grained taxonomy that teaches the model ‘what exactly to look for’ and where.
A benchmark that removes shortcuts and forces true artifact-based detection.

04Experiments & Results

The test: The team measured accuracy (overall correctness), recall (how many fakes were caught), and F1 (a balance of precision and recall). They tested on ViF-Bench (modern, well-matched real/fake pairs) and on GenVideo (out-of-domain, older, lower-quality, near-static content). They also stressed the models with common degradations: compression, zoom, noise, and lighting/color shifts.

The competition: Skyra was compared to strong binary detectors (AIGVDet, DeMamba, NSG-VD), off-the-shelf MLLMs (Video-LLaMA-3, Qwen2.5-VL family, InternVL-3, GPT-4.1-mini, Gemini-2.5-flash), and the recent MLLM-based detector BusterX++.

The scoreboard (with context):

On ViF-Bench, Skyra-SFT already leaped ahead, and Skyra-RL pushed further, reaching around 97% accuracy and ~97% F1 on average across generators. That’s like getting an A+ when most others hovered around C to B- (many off-the-shelf MLLMs were near coin-flip; even the best binary baselines trailed by more than 25 percentage points in accuracy).
On especially tricky image-to-video (I2V) cases, RL boosted recall by about +3–4%, helping catch subtle forgeries.
On GenVideo (out-of-domain), Skyra-RL gained +11% accuracy over the best binary detector and a big recall boost over Skyra-SFT, showing better generalization. With just ~1 epoch of RL tuning on a small slice of GenVideo, Skyra-RL-GenVideo jumped even higher (over +19% accuracy and +42% recall gains versus earlier Skyra), showing quick adaptability without new human labels.
Robustness: Under compression, noise, zoom, and light/color shifts, Skyra maintained top accuracy and F1, whereas others dipped more. This indicates Skyra focuses on true artifacts rather than fragile cues.

Surprising findings:

Off-the-shelf MLLMs, even with careful step-by-step prompts, often described scenes instead of catching intrinsic forgery cues, yielding near-random performance.
Pure RL without the SFT cold start failed badly (sparse rewards; no ‘artifact sense’), confirming the two-stage design is necessary.
Symmetric penalties in RL made the model cry “fake” too often. Asymmetric penalties reduced false positives and improved overall accuracy.
Requiring grounded check blocks in the CoT rewarded healthy habits: inspect, point, and explain—improving both correctness and human trust.

05Discussion & Limitations

Limitations:

Coverage: Training/benchmark data, while diverse and modern, still can’t include every future style (e.g., ultra-long videos, stylized cartoons). Domain shifts may require quick adaptation.
Intent and harm: Skyra only says whether a video looks AI-generated and why. It doesn’t judge malicious intent or social impact.
Overconfidence: Natural-language explanations can sometimes overstate certainty or include minor hallucinations; human review remains essential.
Long-context reasoning: Sampling 16 frames at 256p may miss rare, brief artifacts in long clips.

Required resources:

A 7B-parameter MLLM backbone (Qwen2.5-VL-7B-Instruct), 8× NVIDIA H200 GPUs for training, and the curated ViF-CoT-4K dataset.
Inference typically needs only sampled frames and the standard prompt template.

When NOT to use:

Purely stylized animation or deliberately surreal content where ‘violations’ are the style.
High-stakes legal decisions without human oversight or corroborating evidence (e.g., provenance, watermarking).
Ultra-long videos where sampling may miss short-lived artifacts unless adapted.

Open questions:

How to scale to hours-long videos with reliable temporal coverage and efficient zoom-in strategies.
How to quantify uncertainty and calibrate confidence in explanations.
How to continually learn new artifact types as generators improve, perhaps via human-in-the-loop or self-curation.
How best to combine artifact reasoning with provenance signals and watermarks for layered defense.

06Conclusion & Future Work

Three-sentence summary: Skyra detects AI-generated videos by finding and grounding human-visible artifacts—naming the mistake, marking its time and place, and using that as evidence for Real or Fake. It learns this skill through a new dataset (ViF-CoT-4K) with a fine-grained artifact taxonomy and step-by-step reasoning, plus a two-stage training pipeline (SFT then RL) that rewards careful checking and prevents false alarms. On a fair, modern benchmark (ViF-Bench), Skyra far outperforms prior detectors and remains robust under common video degradations.

Main achievement: Turning video forgery detection from a black-box yes/no into grounded, checkable, human-auditable reasoning—at high accuracy across many generators.

Future directions: Extend to longer videos with adaptive zoom-and-check; add calibrated uncertainty; fuse artifact reasoning with provenance and watermarking; and keep expanding/updating the taxonomy as generators evolve. This approach can also inspire explainable detection across other media (audio, 3D, streaming).

Why remember this: Skyra shows that demanding evidence—what, where, and when—not only explains decisions but also makes detectors stronger and fairer. It’s a practical path to rebuild trust in video by letting humans see the exact clues behind the verdict.

Practical Applications

•Newsroom fact-checking: rapidly vet viral clips by highlighting concrete artifacts for editors to review.
•Platform moderation: auto-flag likely AI videos with grounded evidence for human moderator queues.
•Education: teach students how to spot artifacts and understand physics/common-sense breaks in video.
•Public safety alerts: verify suspicious incident footage before it spreads widely.
•Media forensics: assist investigators with time-stamped, localized clues for case reports.
•Creator tools: warn video editors when compositing introduces physics-violating errors.
•Ad integrity: check product ads for unrealistic depictions (e.g., objects teleporting).
•Watermark complement: combine artifact reasoning with provenance/watermark checks for layered defense.
•Dataset auditing: screen training sets to remove synthetic clips mislabeled as real.
•Compliance review: provide explainable evidence for decisions in regulated industries.

Version: 1