HERBench: A Benchmark for Multi-Evidence Integration in Video Question Answering

Dan Ben-Ami; Gabriele Serussi; Kobi Cohen; Chaim Baskin

HERBench: A Benchmark for Multi-Evidence Integration in Video Question Answering

Beginner

Dan Ben-Ami, Gabriele Serussi, Kobi Cohen et al.12/16/2025

arXiv PDF

Key Summary

•HERBench is a new test that checks if video AI models can combine several clues spread across time, not just guess from one frame or language priors.
•Every question in HERBench needs at least three separate pieces of visual evidence, making single-cue shortcuts impossible on purpose.
•They introduce MRFS (Minimum Required Frame-Set), a number that tells you how many frames a model must fuse to answer correctly; HERBench averages 5.5, higher than other datasets.
•Across 13 strong Video-LLMs, accuracy is only 31–42%, barely above the 20% random guess baseline, showing real trouble with multi-evidence reasoning.
•Two big bottlenecks were found: retrieving the right frames (retrieval deficit) and combining them properly even when retrieved (fusion deficit).
•HERBench has 26,806 multiple-choice questions across 12 tasks covering ordering, tracking, verification, and counting.
•A rigorous pipeline builds questions by separating appearance from behavior, splitting videos into shots, and filtering out items solvable by language alone.
•Even when given oracle (ground-truth) frames, models still often fail because they overweight one frame and underuse others.
•HERBench gives a clear, measurable target to improve Video-LLMs for real-world, long-horizon reasoning.
•This benchmark helps the community move beyond “look at one frame and guess” toward true compositional video understanding.

Why This Research Matters

Videos tell stories across time. To understand them, an AI must link multiple moments, not just guess from a single image or a language habit. HERBench forces and measures this kind of real reasoning, so we can trust models in practical tasks like safety checks, sports summaries, or robot planning. By exposing two core weaknesses—finding the right frames and fusing them—HERBench gives researchers a clear to-do list for building better video AIs. It also provides a fair way to compare progress across datasets using MRFS, a concrete measure of evidence demand. This helps move the field from “lucky frame” success to reliable, story-aware understanding.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Top Bread (Hook): You know how when you tell a friend about a long movie, you don’t just show them one screenshot—you piece together lots of scenes to explain what really happened?

🥬 Filling (The Actual Concept):

What it is: Video Question Answering (VideoQA) is when an AI watches a video and answers questions about it.
How it works (simple):
1. The AI looks at selected frames from a video.
2. It reads a question like “Who left the room last?”
3. It tries to find and combine the right clues across time to pick the best answer.
Why it matters: If the AI relies on a single snapshot or common language patterns (like “people often enter on the left”), it can get answers right for the wrong reason and then fail in real life when it must actually combine clues.

🍞 Bottom Bread (Anchor): Imagine asking “Which person stayed in the scene the longest?” You can’t answer from one frame. You need to check when each person entered and left, then compare.

The World Before: For years, Video-LLMs (video versions of big AI models) were scoring high on popular benchmarks. But researchers noticed a problem: many questions could be answered from a single salient frame or from language priors (“in kitchens, people often open fridges”) instead of real multi-step reasoning across time. That means models might look smart on paper but stumble on tasks that truly require linking moments—like verifying an event never happened, or ordering scenes.

🍞 Top Bread (Hook): Imagine you’re a detective solving a case. One fingerprint isn’t enough—you need multiple clues from different times and places.

🥬 Filling (The Actual Concept):

What it is: Multi-Evidence Integration means combining several, non-overlapping clues from different times to answer a question.
How it works:
1. Find each needed clue (e.g., Person A enters; later, Person B leaves; later, a door closes).
2. Keep track of who is who across the timeline.
3. Fuse the clues to reach the correct answer.
Why it matters: Without integrating multiple clues, the AI might guess from a single frame and miss the true story.

🍞 Bottom Bread (Anchor): To answer “Who left after the person in the red hat?” the AI must (a) find the red-hat person, (b) follow them, (c) compare exit times with others.

Failed Attempts: Benchmarks got longer (some with minute-to-hour videos), but “long” did not equal “many clues required.” Models could still cherry-pick one helpful frame. Researchers suspected we needed tests where multiple clues are unavoidable and measurable.

🍞 Top Bread (Hook): Imagine a teacher who grades not just the final answer but how many parts of the textbook you used to solve the problem.

🥬 Filling (The Actual Concept):

What it is: Evidential Requirement (ER) is how many distinct, non-redundant pieces of evidence a question truly needs.
How it works:
1. Design questions that require at least three separate clues from different times.
2. Make distractors look plausible, so guessing from priors doesn’t work.
3. Confirm that no single frame is enough.
Why it matters: If ER is high, success must come from real reasoning, not shortcuts.

🍞 Bottom Bread (Anchor): A shot-ordering question listing four scenes forces the model to find each scene and place them in order—no single frame solves it.

The Gap: We needed a benchmark where multi-clue reasoning is both (a) required and (b) quantifiable. That’s where MRFS comes in.

🍞 Top Bread (Hook): Picture a treasure hunt where you must collect a minimum number of clues before you can unlock the chest.

🥬 Filling (The Actual Concept):

What it is: Minimum Required Frame-Set (MRFS) is the smallest number of frames the model must fuse to get the answer right.
How it works:
1. Rank frames likely to matter for the question.
2. Feed the model the top 1, then top 2, etc., until it answers correctly.
3. The smallest “k” that works is the MRFS.
Why it matters: MRFS turns “how much evidence is needed?” into a concrete number, showing whether a benchmark truly demands multi-frame reasoning.

🍞 Bottom Bread (Anchor): If a model only answers correctly after seeing 6 key frames, then MRFS=6 for that question, proving it can’t be solved with a single snapshot.

Real Stakes: In the real world—sports analytics, safety audits, robotics, education—systems must track who did what, when, and with whom, often verifying something didn’t happen. HERBench makes such multi-evidence skills unavoidable and measurable so we build AIs that can truly follow a story, not just glance at a picture and guess.

02Core Idea

🍞 Top Bread (Hook): Imagine building a puzzle where you can’t finish unless you place several key pieces together—no single piece shows the whole picture.

🥬 Filling (The Actual Concept):

What it is: The core idea of HERBench is to make multi-clue reasoning in videos both unavoidable (by design) and measurable (with MRFS).
How it works:
1. Build questions that require at least three distinct, non-overlapping visual clues spread across time.
2. Remove text-only shortcuts by filtering out questions answerable without video.
3. Use MRFS to quantify how many frames must be fused to get the answer correct.
Why it matters: This reveals whether models truly integrate evidence or rely on lucky frames and language habits.

🍞 Bottom Bread (Anchor): A counting question like “How many times did someone close the tap?” forces scanning the whole video and summing multiple moments—one frame won’t do.

The “Aha!” Moment (one sentence): If we control and measure how much cross-time evidence is required, we can cleanly separate true multi-evidence reasoning from single-cue guessing.

Multiple Analogies:

Mystery Novel: You can’t solve the crime from one paragraph—you must connect clues from Chapter 1, 7, and 12.
Cooking Recipe: Success needs several steps—mix, bake, cool—not just one action; missing a step ruins the dish.
History Timeline: To understand a war’s outcome, you must piece together causes, battles, and treaties across years.

Before vs After:

Before: Video benchmarks often allowed single-frame shortcuts; high scores could hide shallow reasoning.
After: HERBench enforces multi-clue questions and reports MRFS, exposing whether models really fuse dispersed evidence.

🍞 Top Bread (Hook): You know how teachers give multi-part problems to ensure you understand the whole idea, not just one trick?

🥬 Filling (The Actual Concept):

What it is: Task Taxonomy in HERBench organizes 12 tasks into 4 families that each stress different kinds of multi-evidence reasoning.
How it works:
1. Temporal Reasoning & Chronology (order and durations),
2. Referring & Tracking (follow a described person over time),
3. Global Consistency & Verification (verify presence/absence across the video),
4. Multi-Entity Aggregation & Numeracy (who appears, how many times, where).
Why it matters: By spreading evidence across time and people, the tasks require k ≥ 3 clues, blocking single-frame answers.

🍞 Bottom Bread (Anchor): A shot-ordering task asks you to place four scene descriptions in the correct timeline—each description is a separate clue you must locate and order.

Why It Works (intuition, not equations):

Forcing at least three separated clues defeats “lucky” frames and language priors.
Disentangling appearance (who) from behavior (what/when) stops description-matching tricks.
MRFS quantifies the minimum visual proof needed, making progress measurable.
Removing text-only solvable items ensures vision is necessary, not optional.

Building Blocks:

High-ER question design: Each item structurally needs ≥3 distinct moments.
A/B cards: Separate appearance (A-card) from behavior/trajectory (B-card) to require cross-time binding.
Shot segmentation: Turn long videos into scene cards, enabling ordering and verification tasks.
Text-only filtering: Drop any item that blind LLMs can answer without video.
MRFS protocol: Standardize model, selector, and frame budget to make fair cross-benchmark comparisons.

🍞 Top Bread (Hook): Think of a teacher who won’t let you pass by guessing one multiple-choice letter—they check that you actually used the needed pages in the book.

🥬 Filling (The Actual Concept):

What it is: MRFS (Minimum Required Frame-Set) is the score for “how many frames must be fused.”
How it works:
1. Rank frames by relevance to the question.
2. Feed frames one by one in that order.
3. Stop when the model becomes correct—the count is MRFS.
Why it matters: If a dataset’s average MRFS is high (HERBench: 5.5), it proves single-frame shortcuts won’t work.

🍞 Bottom Bread (Anchor): On HERBench, a model often needs about five or six frames combined to answer right, which is like needing several puzzle pieces before seeing the picture.

03Methodology

At a high level: Video + Question → Build evidence scaffold (tracks, shots, ground truth) → Compose 12 high-ER tasks → Filter text-only shortcuts → Evaluate with MRFS and accuracy.

Step 1: Object Tracking & Trajectory Analysis (Appearance/Behavior disentanglement)

What happens: The video is processed to track people over time (RF-DETR + DeepSORT). For each person, the pipeline creates two separate descriptions: an A-card (appearance: clothing, colors, accessories) from start/end frames and a B-card (behavior/trajectory: where they go, who they meet) from middle frames.
Why this step exists: It forces the model to bind “who” (from A-card) with “what/where/when” (from B-card) across time. Without it, a model might cheat by matching text phrases instead of tracking the person.
Example: A-card says “Person with a red jacket and glasses.” B-card later mentions “This person exits through the right edge after passing a seated man.” To answer, you must connect the described person to their later path.

Step 2: Shot Segmentation (Scene cards for global reasoning)

What happens: The video is split into shots and each shot is summarized into a concise scene card. Some cards are then perturbed (small but plausible changes) to create tricky negatives.
Why this step exists: It provides macro-level building blocks for ordering and verification tasks. Without it, models might rely on a single flashy frame rather than reasoning about whole scenes.
Example: Four scene cards describe different rooms or actions. The task asks for their correct chronological order.

Step 3: Ground-Truth Integration (Human-verified events)

What happens: Human-verified event logs supply trusted timelines and counts (e.g., how many times an action happened).
Why this step exists: It anchors counting, sequence integrity, and absence verification. Without this, answers could be ambiguous.
Example: “How many times did ‘close tap’ occur?” Ground-truth says 5; distractors are plausible (e.g., 3, 4, 6) but wrong.

Step 4: Oriented Task Programming (Compose 12 tasks with k ≥ 3 evidence)

What happens: The pipeline programmatically assembles questions across four families: Temporal Reasoning & Chronology, Referring & Tracking, Global Consistency & Verification, Multi-Entity Aggregation & Numeracy.
Why this step exists: It ensures every question needs at least three non-overlapping cues. Without this rule, a single frame could sneak through.
Example: Multi-Person Duration Reasoning compares who stayed longest; you must check multiple intervals and compare.

Step 5: Text-Only Filtering (Remove language priors)

What happens: If 3 of 4 blind LLMs can answer a question without seeing any frames, the question is discarded.
Why this step exists: It guarantees that visual evidence is necessary. Without it, models might exploit common-sense guesses.
Example: A question answerable just from wording gets removed before it reaches the final set.

Step 6: Human Verification (Quality control)

What happens: Experts sample-check items to confirm: (a) at least three frames are needed, (b) there’s a unique, objective ground truth, and (c) no leakage between A- and B-cards.
Why this step exists: It keeps the dataset honest and hard in the right way. Without it, some items might be solvable with one frame or be ambiguous.
Example: If reviewers find a question solvable by a single snapshot, it’s rejected.

Step 7: Evaluation with MRFS + Accuracy

What happens: Standardize the evaluator (e.g., Qwen2.5-VL), selector (AKS), and frame budget (16). Compute MRFS by feeding top-1, top-2, … frames until correct, and report accuracy.
Why this step exists: It makes benchmarks comparable and the required evidence measurable. Without standardization, results would be apples-to-oranges.
Example: HERBench has a mean MRFS of 5.49, much higher than others (2.61–4.07), showing higher evidential demand.

The 12 Task Families in Action:

Temporal Reasoning & Chronology (e.g., Temporal Shot Ordering, Multi-Person Duration Reasoning, Action Sequence Integrity): Require gathering multiple clues about when things happen and in what order.
Referring & Tracking (e.g., Appearance-Grounded Attribute Recognition, Behavior Interactions, Localization Trajectory): Require binding a described person across time to read off context, partners, and path.
Global Consistency & Verification (e.g., False Action/Object Memory, Scene Verification & Arrangement): Require confirming what did and did not occur, and arranging true scenes.
Multi-Entity Aggregation & Numeracy (e.g., Multi-Entities Grounding & Localization, Action Counting, Region-Localized People Counting): Require deduplicating people and summing events, often with spatial constraints.

The Secret Sauce (what makes it clever):

Disentangled A/B cards force identity binding across time.
Shot-level cards and perturbed variants test fine-grained verification beyond gist.
Text-only filtering removes language shortcuts.
MRFS transforms “needs many clues” from a vibe into a number.
Standardization separates dataset difficulty from model/selector quirks.
Retrieval vs Fusion disentanglement: By testing different frame selectors and even oracle frames, HERBench pinpoints whether failure comes from not finding clues (retrieval) or not combining them (fusion).

04Experiments & Results

The Test: Researchers evaluated 13 state-of-the-art Video-LLMs on HERBench with a fixed 16-frame budget (uniform sampling) to focus on multi-evidence integration. They also compared frame selection methods (Uniform, Vanilla-BLIP, BOLT-ITS, AKS) and an oracle setup (ground-truth key frames) on subsets.

The Competition: Models included closed-source systems (e.g., GPT-4.1, Gemini-2.5-Flash) and open-source models (e.g., Ovis-2.5-9B, InternVL3.5-14B, Qwen2.5-VL variants, LLaVA-OneVision, etc.). HERBench was contrasted with earlier datasets like NExT-QA, MVBench, and LongVideoBench using the MRFS lens.

The Scoreboard (with context):

Overall accuracy across 13 models averaged 38.2% (random guess is 20%).
Best model (Ovis-2.5-9B) reached 42.1%; lowest (LLaMA-4-Scout-17B) scored 31.4%.
This is like most students scoring barely above chance on a tough multi-step exam, even though they did great on easier quizzes.

Task-level patterns:

Stronger on simpler single-entity tracking/attribute tasks (e.g., AGBI, AGAR), often >70% for top models.
Much weaker on high-integration tasks: Action Counting (~23% mean), Multi-Entities Grounding & Localization (~23% mean), and Temporal Shot Ordering (sometimes near 0% for certain models), showing clear difficulty with aggregating many clues.

Retrieval vs Fusion (the two bottlenecks):

Frame Selection (Retrieval): Learned selectors (AKS, BOLT-ITS) sometimes beat Uniform but still lag behind Oracle Frames (manually curated evidence). So models often don’t even get all the right clues.
Fusion: Even with Oracle Frames, accuracies typically stayed below 50%. A leave-one-out analysis showed errors often came from over-weighting a single frame (Top-1 share ≈ 0.8 when wrong) rather than balancing multiple frames (≈ 0.5 when correct). This means models struggle to integrate clues even when those clues are handed to them.

Surprising Findings:

Giving models the exact right frames helped, but not enough—fusion remained a major failure point.
Some tasks with sparse evidence (e.g., SVA, TSO) benefited more from better retrieval; but counting and multi-entity tasks still lagged, highlighting fusion as a core weakness.

MRFS across benchmarks:

With standardized evaluation, HERBench’s mean MRFS is 5.49 vs. NExT-QA 2.61, MVBench 3.52, LongVideoBench 4.07.
As MRFS goes up, accuracy goes down—evidence that current models are not yet good at multi-clue fusion.

Bottom line: Today’s Video-LLMs can track one person fairly well, but when the story needs several pieces merged across time, they stumble. HERBench makes that clear, measurable, and hard to ignore.

05Discussion & Limitations

Limitations:

HERBench focuses on multi-evidence visual reasoning in a multiple-choice format; it doesn’t cover all video skills (e.g., open-ended generation, fine-grained physics simulation, or audio reasoning).
MRFS is defined relative to a particular evaluator, selector, and frame budget; different settings may shift MRFS a bit, though the standardized protocol helps comparability.
Multiple-choice can simplify some answer spaces; future work could extend HERBench principles to free-form outputs.

Required Resources:

Running strong Video-LLMs on long videos with 16-frame contexts requires decent GPUs and time.
Building models that improve both retrieval (better frame selection) and fusion (better multi-clue aggregation) may need architecture changes and training data emphasizing compositionality.

When NOT to Use:

If your goal is single-shot attribute recognition or very short clips where one frame truly suffices, HERBench is overkill.
If you need audio-heavy tasks (e.g., speech understanding) or purely text reasoning, other benchmarks fit better.

Open Questions:

How to design fusion modules that distribute importance across all required frames rather than over-concentrating on one?
What training curricula best build multi-evidence skills (e.g., synthetic multi-hop chains, supervised attention over evidence)?
Can retrieval be jointly learned with fusion so that selection anticipates reasoning needs (not just relevance)?
How to extend MRFS-like measures to dense video tokens (clips), audio, and interactive tasks?
Can we create data-efficient methods that learn robust multi-clue reasoning without massive compute budgets?

06Conclusion & Future Work

Three-Sentence Summary:

HERBench is a new VideoQA benchmark that forces models to combine at least three separate visual clues spread across time, and it measures this demand with MRFS.
Across 13 leading models, accuracy hovers only slightly above chance, revealing two core weaknesses: missing key frames (retrieval) and failing to combine them (fusion) even when retrieved.
By making cross-time evidence unavoidable and quantifiable, HERBench provides a clean target for improving genuine, compositional video understanding.

Main Achievement:

Turning “needs multi-evidence” from a vague idea into a concrete, enforced design with a measurable number (MRFS), so we can tell if a model truly reasons across time.

Future Directions:

Architectures and training strategies that explicitly balance attention across multiple frames, not just the most salient one.
Smarter frame selectors that cover all necessary evidence and coordinate with fusion modules.
Extending HERBench principles to free-form answers, audio-visual tasks, and interactive video agents.

Why Remember This:

HERBench changes the goalpost from “did the AI get it right?” to “did the AI use enough evidence to be right?”
That shift is essential for building AIs that can follow real stories in real videos—where answers come from many moments, not one lucky snapshot.

Practical Applications

•Audit smart cameras to verify that required safety steps happened in the right order (e.g., put on helmet, then enter site).
•Sports analysis that counts actions (shots, passes) and orders key plays to build accurate game summaries.
•Body-cam and dash-cam review that confirms or disproves claimed events by scanning the whole timeline.
•Home robotics that track who/what moved where and when to complete chores in the right sequence.
•Video search engines that answer multi-part queries like “Show all times person A met person B before sunset.”
•Manufacturing QA that verifies action sequences (assemble → test → label) and counts occurrences.
•Education tools that quiz students on event order and evidence across documentary videos.
•Retail analytics that count region-specific entries/exits and verify co-occurring events.
•Content moderation that checks multi-step behaviors (e.g., identify, pursue, interact) across long videos.
•Healthcare or eldercare monitoring that verifies multi-stage routines (medication taken, water glass filled, device turned off).

Version: 1