RANKVIDEO: Reasoning Reranking for Text-to-Video Retrieval

Tyler Skow; Alexander Martin; Benjamin Van Durme; Rama Chellappa; Reno Kriz

RANKVIDEO: Reasoning Reranking for Text-to-Video Retrieval

Intermediate

Tyler Skow, Alexander Martin, Benjamin Van Durme et al.2/2/2026

arXiv PDF

Key Summary

•RANKVIDEO is a video-native reasoning reranker that helps search engines find the right videos for a text query by directly looking at the video’s visuals and audio, not just text captions.
•It trains in two steps: first to see and describe what’s in videos (perception), then to decide which videos best match a query (reranking).
•The model scores each query–video pair using a simple yes-vs-no signal (a logit margin), which is fast and stable without generating long reasoning texts.
•Its training blends three ideas: pointwise calibration (is this relevant?), pairwise ranking (is this one better than that one?), and teacher distillation (learn confidence from a larger model).
•A data synthesis pipeline creates tough, reasoning-heavy queries by mixing captions, speech transcripts, on-screen text (OCR), and metadata, then filters them for quality.
•On the large MULTIVENT 2.0 benchmark (about 110k videos), RANKVIDEO lifts nDCG@10 by an average of 31% across first-stage retrievers, beating text-only and vision-language baselines.
•It adapts how much it “thinks”: it reasons more only when the query is hard, which saves time compared to always generating long explanations.
•It consistently improves weak and strong first-stage retrievers alike, with the biggest gains on weaker ones.
•In multimodal RAG for WIKIVIDEO, it boosts claim coverage and article factuality, showing downstream benefits beyond standard retrieval metrics.

Why This Research Matters

Video libraries are massive, and people expect the right clip to appear in the first few results. RANKVIDEO makes that happen more reliably by actually looking at the visuals and sounds, not just text. This helps students learn faster, journalists verify facts more accurately, and creators find reference footage without wasting time. Because it improves even weaker first-stage search engines, it can cut infrastructure costs by letting you use a fast, simple first stage and depend on RANKVIDEO to clean up the top results. Its efficiency (no long reasoning texts at inference) makes it practical, and its gains in multimodal RAG show better retrieval leads to better, more factual generation. In short, it turns huge, messy video piles into useful answers people can trust.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: You know how searching for the right clip in a giant video library can feel like finding a needle in a haystack? Even if you type a great query, you still need a smart helper to put the best videos at the top.

🥬 The Concept: Information Retrieval (IR)

What it is: IR is the science of finding the most relevant items (like videos) for your question.
How it works:
1. You ask a question (query).
2. The system looks through a big collection (index).
3. It ranks items from most to least useful.
Why it matters: Without ranking, the best answers might get buried on page 100. 🍞 Anchor: When you search “how to tie a tie,” IR decides which video shows the clearest steps and puts it first.

🥬 The Concept: Video Retrieval

What it is: Video retrieval is IR focused on matching your text question to the right video.
How it works:
1. Turn the query into a vector (a numeric summary).
2. Turn videos into vectors using frames, audio, and other signals.
3. Compare vectors to find the closest matches and rank them.
Why it matters: Without matching text to visuals/audio, you miss great videos that don’t have perfect captions. 🍞 Anchor: If you ask “footage of SpaceX rocket landing,” a good system finds videos where a rocket actually lands—without needing the exact words in the title.

🥬 The Concept: Multimodal Representation

What it is: A multimodal representation mixes different types of information—like images, sounds, and text—into one understanding.
How it works:
1. Extract features from frames (vision), audio (speech/sounds), and on-screen text (OCR).
2. Fuse these features so the model can see how they relate.
3. Use the fused features to compare with the query.
Why it matters: If you only use text, you miss silent visuals or important sounds not written down. 🍞 Anchor: A marching band video might have no useful title, but the visuals (instruments) and sounds (music) still prove it’s relevant.

🥬 The Concept: Reranking (Two-Stage Retrieval)

What it is: Reranking is a second pass that refines an initial list of candidates from a fast first-stage search.
How it works:
1. A speedy first-stage retriever gathers the top ~1000 likely videos.
2. A smarter, slower reranker closely inspects these candidates.
3. It reshuffles them to put the truly best results at the top.
Why it matters: Without a reranker, many “almost right” videos outrank the perfect one. 🍞 Anchor: First, you skim book covers (fast). Next, you read key pages of the best few to pick the winner (reranking).

🥬 The Concept: nDCG (Normalized Discounted Cumulative Gain)

What it is: nDCG measures how well the top of the list is ordered, rewarding correct items placed higher.
How it works:
1. Give points for relevant results.
2. Discount points more for lower ranks.
3. Normalize so scores range 0–1 for fair comparison.
Why it matters: A system that finds the right video but puts it at rank 50 is less helpful than one that puts it at rank 2. 🍞 Anchor: If your favorite song is ranked #2, that gets a high nDCG; if it’s #70, not so much.

The World Before: As video libraries exploded (education, entertainment, social media), teams built fast first-stage retrievers (like CLIP or OmniEmbed) to scan huge collections. But the second step—video-native reranking—lagged behind. Rerankers from text search were adapted by turning videos into text (captions, transcripts), then using big text reasoning models to judge relevance. This helped sometimes but missed crucial signals that were only in the raw visuals or audio, and creating those transcripts/captions was slow and expensive.

The Problem: We needed a reranker that can reason directly over video (not just text) and scale to large collections. It must be good at telling apart near-miss “hard negatives” in the top 1000—clips that look or sound very similar to the right answer but aren’t.

Failed Attempts:

Text-only rerankers (using captions and transcripts) often miss silent visuals (like a logo) or non-speech audio cues (like sirens).
General-purpose vision-language models (VLMs) in zero-shot settings frequently overclaimed relevance (low precision), pushing wrong videos to the top.
Models trained on small, contrived datasets (caption-to-video) didn’t generalize to the real-world, event-centric queries in large benchmarks.

The Gap: What was missing is a video-native reasoning reranker that:

Reads frames and audio directly (multimodal),
Learns to see first (perception), then judge (reason),
Calibrates confidence and separates positives from hard negatives,
Scores fast without long reasoning text at inference.

Real Stakes: Better video search improves learning (“show me the 2019 Cairo International Film Festival red carpet”), news verification (find clips that truly support a claim), and creative work (quickly assemble accurate reference footage). It saves time, avoids misinformation, and lets even lightweight first-stage search be viable if a strong reranker cleans up the top of the list.

02Core Idea

🍞 Hook: Imagine a talent show with hundreds of acts. A quick screener picks the top 100, then a sharp judge watches closely and decides the final order. The judge doesn’t write an essay each time; they just say “yes” or “no” and keep moving.

🥬 The Concept: Reasoning Reranking (video-native)

What it is: A reranker that directly watches the video (frames and audio) and decides how well it matches your text query, using reasoning only as needed.
How it works:
1. Take a query and a candidate video.
2. Ask the model a structured yes-or-no question: “Is this video relevant?”
3. Compute a simple relevance score: the model’s logit for “yes” minus its logit for “no.”
4. Rank candidates by this score—no long explanations required.
Why it matters: Without video-native reasoning, rerankers can be blind to visual/audio clues or too slow from generating long chains of thought. 🍞 Anchor: When you ask “Korean Blockchain Week 2022,” the model looks for visuals like event banners, stages, and language cues in the video—not just the title—to decide if it’s a match.

The “Aha!” Moment (one sentence): Teach the model to first see the video well (perception), then rerank with a fast, stable yes-vs-no margin score, trained to separate positives from hard negatives using combined pointwise, pairwise, and teacher-distilled signals.

Multiple Analogies:

Librarian: First, learn to summarize books (perception). Then, when someone asks a question, say yes/no for each candidate and order them by confidence (reranking margin).
Detective: First, learn to notice clues (objects, text on screen, sounds). Then, judge suspects (videos) with a clear thumbs-up/down signal—only building a long case when the clues are tricky.
Sports Judge: Instead of writing paragraphs after every routine, a judge gives a clean score difference that clearly separates the winners.

Before vs After:

Before: Rerankers leaned on text (captions/transcripts), which could miss visual/audio info and add preprocessing delays. Some vision-language models over-claimed relevance.
After: A video-native reranker grounds itself in frames and audio, produces a quick yes-vs-no margin, and uses smarter training (pairwise + pointwise + distillation) to push true matches ahead of hard negatives.

Why It Works (intuition):

Perception-first avoids guessing: learning to caption videos grounds the model in what’s actually there.
Logit-margin scoring is like a calibrated confidence gap: big positive margins mean strong matches, big negatives mean clear mismatches.
Pairwise training teaches the model to separate lookalikes—exactly the hard part of reranking.
Pointwise calibration keeps decisions steady across mixed data.
Teacher distillation passes down smooth, well-calibrated confidence, reducing over/under-confidence.

Building Blocks (each as a mini concept):

🥬 The Concept: Perception-Grounded SFT

What it is: First train the model to generate accurate captions for videos so it learns to see.
How it works: Feed a video and train it to produce teacher captions with key objects, actions, and context.
Why it matters: Without strong perception, the model can’t make reliable relevance judgments. 🍞 Anchor: The model learns to say “A SpaceX rocket lifts off at night” before it later decides if that matches “Sixth SpaceX operational mission.”

🥬 The Concept: Hard Negative Mining

What it is: Choosing tricky non-relevant videos that look or sound similar to the right answer.
How it works: A teacher model flags trusted negatives (clearly wrong), suspected positives (likely mislabeled), and hard negatives (ambiguous). Keep the first and third; drop suspected positives.
Why it matters: Without hard negatives, the model aces easy cases but fails at the exact edge cases that break rankings. 🍞 Anchor: For “Notre-Dame fire emergency response,” another cathedral fire video might look close; training against that hardness is vital.

🥬 The Concept: Pairwise Loss

What it is: A training signal that pushes the correct video above the incorrect ones within the same query group.
How it works: Compare the positive’s score to negatives via a softmax over the group and increase the positive’s chance to be on top.
Why it matters: Without pairwise pressure, the model may not learn to separate near-misses. 🍞 Anchor: Teach the judge to rank the true champion above very similar runners-up.

🥬 The Concept: Pointwise Calibration

What it is: A signal to predict yes/no correctly and keep scores well-behaved across queries.
How it works: Train with softened labels (e.g., negatives at 0.1) and weights to handle noise and class imbalance.
Why it matters: Without calibration, scores can drift, making rankings unstable. 🍞 Anchor: Like grading with partial credit to prevent over-penalizing borderline answers.

🥬 The Concept: Teacher Confidence Distillation

What it is: Learn from a larger teacher’s probability (confidence), not just a binary label.
How it works: Match your score distribution to the teacher’s soft targets.
Why it matters: Without this, the student can be overconfident on shaky cases. 🍞 Anchor: A junior chef learns not only the recipe but also when a dish is “almost” right.

🥬 The Concept: Logit-Margin Scoring

What it is: Score = logit(yes) − logit(no) for the final decision token.
How it works: Ask the model to answer <answer>yes/no</answer>, then take the difference of its internal yes/no strengths.
Why it matters: Without generating long thoughts, you get a fast, monotonic, and robust ranking signal. 🍞 Anchor: Think of it as the gap between two team scores—bigger gaps are clearer wins.

Put together, RANKVIDEO’s recipe first teaches good seeing (perception), then sharp separating (pairwise), steady deciding (pointwise), and wise confidence (distillation), all summarized by a fast yes-vs-no margin. That’s the core idea.

03Methodology

🍞 Hook: Imagine sorting 100 lookalike puzzle pieces to find the exact one that fits your spot. First you learn to recognize the shapes, then you compare the best candidates very carefully.

High-Level Overview: Input (Query + Candidate Videos from First Stage) → Stage 1: Perception-Grounded SFT → Stage 2: Reranking with Margin Scoring (pointwise + pairwise + distillation) → Output: Reordered top candidates.

Step 0: Candidate Pool from First Stage

What happens: A fast retriever (e.g., OmniEmbed) gathers the top ~1000 candidate videos for the query.
Why it exists: You can’t run a slow, careful model on every video in a giant library.
Example: For “2019 Cairo International Film Festival,” pull the 1000 most likely event videos.

🥬 The Concept: Data Synthesis Pipeline

What it is: A factory that builds tough, reasoning-heavy training examples.
How it works:
1. Generate/collect text signals per video: teacher captions (Qwen3-Omni), speech transcripts (Whisper), OCR (on-screen text), and metadata.
2. Use a text reasoning model (Qwen3-32B) to craft queries using different combos: caption-only, audio-only, OCR-only, metadata-only, and all.
3. Filter for quality: drop queries whose true positive isn’t in top-1000; drop cases where a top negative is wildly favored; drop pairs a strong text reranker misjudges with evidence.
Why it matters: Without hard, diverse queries, the reranker won’t learn to handle real-world complexity. 🍞 Anchor: Build a query like “Emergency response Notre-Dame fire” that relies on visual and audio clues, not just the title.

Stage 1: Perception-Grounded Supervised Fine-Tuning (SFT)

What happens: Teach the model to caption videos accurately—objects, actions, and context.
Why it exists: This anchors the model in what is truly in the video, reducing hallucinations later.
Example: From 32 sampled frames at 2 FPS, predict a caption like “A rocket lifts off at night, bright plume visible,” matching a teacher caption.
Training setup: One unique video per query to cover many events; learning rate 1e-5, batch size 16, max 32 frames.

Scoring Mechanism: Logit-Margin without Decoding Long Reasoning

What happens: For each (query, video), prompt the model to output <answer>yes</answer> or <answer>no</answer>.
Why it exists: This gives a clean, fast relevance score without generating a long chain-of-thought.
Example with numbers: If logit(yes)=3.1 and logit(no)=1.2, score = 1.9 (strongly relevant). If logit(yes)=0.8 and logit(no)=2.4, score = -1.6 (irrelevant).

Stage 2: Ranking Fine-Tuning with Three Losses

Query-Grouped Batches: For each query, sample 1 positive and K=2 negatives from the candidate pool.

Pairwise Ranking Loss (separate the lookalikes)

What happens: Turn the scores in the batch into a softmax and push the positive to the top; use temperature τ_pair=10 to avoid early saturation.
Why it exists: Hard negatives dominate reranking errors; pairwise pressure teaches fine separation.
Example: For “KAMAZ Autonomous Mining Dump Truck,” rank the true dump truck video above a similar-looking heavy machinery clip.

Pointwise Calibration Loss (stay stable)

What happens: Binary yes/no training with softened labels (positives=1.0, negatives=0.1) and weights (pos=1.0, neg=0.5).
Why it exists: Handles class imbalance and noisy negatives; keeps scores well-behaved.
Example: If a negative is almost on-topic, label smoothing prevents overconfidence.

Teacher Probability Distillation (learn confidence)

What happens: Match your score to a teacher’s soft relevance probability using a temperature-scaled BCE; weight this term by λ_teacher=5 (and pointwise by λ_pt=0.5).
Why it exists: Transfers calibrated confidence beyond binary labels.
Example: If the teacher thinks a clip is 70% relevant, the student learns a nuanced boundary.

Hard Negative Mining Details

Teacher (REASONRANK) provides a label and a margin over yes/no.
Partition candidates into: trusted negatives (clearly wrong), suspected positives (likely mislabeled; drop these), and hard negatives (ambiguous; keep).
Use thresholds (e.g., margins around -6 and -8) to make cuts; ensure at least one trusted negative per query.
Why it exists: Training on the same error regime as real reranking (hard near-misses) is crucial.

Dynamic Reasoning (Think when you need to)

What happens: The model often answers yes/no directly; it “reasons” more only for difficult cases internally.
Why it exists: Saves time versus always generating long explanations.
Example: For a clear match (SpaceX rocket on-screen with logo), answer is quick; for ambiguous sports footage, it allocates more attention.

Implementation Summary

Video sampling: 2 FPS; max 32 frames (Stage 1), 24 frames (Stage 2).
First-stage depth: 1000; rerank top 100.
Data sizes: Stage 1 uses 9267 videos; Stage 2 uses 7995 queries and 23985 videos (1 positive + 2 negatives each), mixing 1361 human queries with 7906 synthetic.
Base model: Qwen3-VL-8B-Instruct.

The Secret Sauce

Perception-before-reasoning training stabilizes learning and grounds decisions.
Margin scoring is fast, monotonic, and robust—no chain-of-thought decoding needed.
The trio of losses (pairwise+pointwise+distillation) shapes both separation (who’s on top) and calibration (how confident), especially against hard negatives.
A careful data synthesis+filtering pipeline ensures the model practices on the tough, realistic cases it will meet at test time.

04Experiments & Results

🍞 Hook: Think of a spelling bee where everyone is pretty good—the winner is decided by tiny differences. In video search, the top 1000 results already look similar; small improvements in the top few ranks matter a lot.

The Test

Dataset: MULTIVENT 2.0 (≈109,800 videos), event-centric and reasoning-heavy.
Setup: For each query, retrieve the top 1000 with a first-stage retriever, then rerank the top 100.
Metrics: Recall@K (did the right video appear in the top K?) and nDCG@K (did we put the right ones near the top?).
Why: These reflect real user experience—finding the right video quickly is key.

The Competition

First-stage baseline: OmniEmbed (OE), plus other retrievers (CLIP, MMMORRF, LanguageBind, Video-ColBERT).
Reranking baselines:
1. REASONRANK (text-based), which uses captions and transcripts.
2. QWEN3-VL-8B-Instruct (QVL-I), a strong video model.
3. QWEN3-VL-8B-Thinking (QVL-T), a reasoning-heavy variant.
4. QWEN3-VL-Reranker-8B (QVL-R), a reranking variant.

The Scoreboard (with context)

On OmniEmbed as first stage, RANKVIDEO Stage 2 lifts nDCG@10 from 0.495 to 0.566 (≈+14.3% relative). That’s like moving from a solid B to a strong A- at the very top of the list.
Average across first-stage retrievers, RANKVIDEO boosts nDCG@10 by about 31%, with the biggest jumps on weaker first stages:
- CLIP: nDCG@10 from 0.306 to 0.478 (≈+56%).
- LanguageBind: 0.326 → 0.487 (≈+49%).
- Video-ColBERT: 0.422 → 0.535 (≈+27%).
- MMMORRF (already strong): 0.585 → 0.639 (≈+8%).
Compared to text-only REASONRANK, RANKVIDEO wins across metrics—showing the value of video-native signals.
QVL-I/T/R models struggled to improve real-world reranking quality; some even hurt performance, indicating zero-shot VLMs can be overconfident and miscalibrated.

What Changed Under the Hood

Score Distribution Shift: After Stage 2, relevant pairs get larger positive margins; non-relevant ones get pushed negative, reducing overlap. This directly improves early ranks where it matters most.
Stability Across First Stages: Gains appear consistently, confirming the reranker’s generality; you can pair it with fast but rough first-stage search and still get strong final rankings.

Efficiency and Dynamic Reasoning

Latency: RANKVIDEO Stage 2 runs about 1.02s per query–video pair (batch size 1), within ≈0.15s of text-based ReasonRank (≈0.87s including amortized preprocessing), but far faster than QVL-T by ≈2.67s (since QVL-T writes long thoughts every time).
Inference Inputs: Only frames are needed at test time (no captions/transcripts), avoiding expensive preprocessing.

Downstream RAG (WIKIVIDEO)

Task: Retrieve top-10 videos to support article generation.
Results: With RANKVIDEO, claim coverage (α-nDCG, nDCG, StRecall) improves across first stages, and MIRAGE Info Precision (factuality) rises (e.g., OmniEmbed InfoP from 88.0 → 91.2 with RV). This shows better retrieval translates into more factual generated articles.

Surprising/Notable Findings

Zero-shot reasoning VLMs removed some easy negatives but failed to place the truly relevant videos at the top (nDCG didn’t improve), revealing calibration issues.
Even Stage 1 alone (perception SFT) brings gains, confirming “see well first” is a strong prior.
RANKVIDEO’s scores have low video-only bias (it’s not just favoring certain videos), indicating true query–video interaction rather than shortcuts.

Bottom line: On a realistic, large-scale benchmark, RANKVIDEO improves the top of the list substantially and efficiently, beating both text-only rerankers and zero-shot VLM rerankers while staying fast enough for practical use.

05Discussion & Limitations

🍞 Hook: If your judge is great at spotting obvious winners but stumbles on lookalikes, your final ranking will wobble where it matters most—the top few spots.

Limitations

Compute for Listwise Training: True listwise (many videos at once) often outperforms pairwise/pointwise but is compute-heavy for video. The paper didn’t explore listwise due to multi-video memory limits, even with 8×80GB GPUs.
Generic or Ambiguous Visuals: Some event types (e.g., certain disasters) share similar visuals (e.g., rushing water), making fine-grained separation hard.
Short/Vague Queries: Underspecified queries leave little to anchor on, hurting precision.
Data Needs: Stage 1 benefits from teacher captions; Stage 2 uses teacher signals and curated negatives. While inference avoids captions, training does need these resources.

Required Resources

Models: A base video-language model (Qwen3-VL-8B-Instruct), plus teacher models for captions (Qwen3-Omni), transcripts (Whisper), and OCR.
Compute: Multi-GPU training; careful batching (e.g., max 24–32 frames at 2 FPS).
Retrieval Stack: A first-stage retriever to supply the top 1000 candidates, then rerank top 100.

When NOT to Use

Purely text-only collections where high-quality metadata already perfectly captures content.
Ultra-low-latency, on-device scenarios with tight compute constraints.
Domains where visuals/audio add little beyond text (the extra modality may not pay off).

Open Questions

Listwise or Groupwise Objectives: Can we make multi-video reasoning feasible (e.g., memory-efficient attention, distillation tricks) to unlock further gains?
Better Dynamic Reasoning: How to control when and how long the model “thinks” to balance speed and accuracy?
Closing the Accuracy–Rank Gap: How to align training signals so higher classifier accuracy more reliably boosts nDCG@K in hard-negative regimes?
Robustness Across Languages/Modalities: How to further reduce performance dips for certain languages or weakly visual events?

Overall, RANKVIDEO solves a real and stubborn piece of the pipeline—top-of-list quality under realistic, hard-negative pressure—while leaving room for future efficiency and listwise innovations.

06Conclusion & Future Work

Three-Sentence Summary

RANKVIDEO is a video-native reasoning reranker that first learns to see (via captioning) and then learns to rank using a fast yes-vs-no margin score.
Its training blends pairwise separation, pointwise calibration, and teacher confidence distillation, all focused on hard negatives from realistic candidate pools.
On a large, event-centric benchmark, it consistently lifts top-of-list quality (≈31% nDCG@10 gains across first stages) and runs efficiently without generating long reasoning text.

Main Achievement

Showing that “perception before reasoning,” combined with margin-based scoring and a three-part training objective, reliably beats text-only and zero-shot VLM rerankers on real-world video retrieval—while remaining fast.

Future Directions

Make listwise/groupwise video reranking practical (e.g., memory-efficient architectures, curriculum over candidate set sizes).
Smarter dynamic reasoning to adapt compute per item.
Improved synthesis and filtering for multilingual, weakly visual, or highly ambiguous queries.
Tighter alignment between classification signals and ranking objectives.

Why Remember This

It’s a blueprint for video search at scale: ground in real pixels and sounds, train against hard negatives, use a clean yes-vs-no margin for speed and stability, and you’ll rank better where it counts—the top results people actually click.

Practical Applications

•Educational search: Quickly find the most accurate lab demo or historical event footage for classroom questions.
•News verification: Retrieve clips that truly support or refute claims for fact-checking workflows.
•Content creation: Locate precise B-roll (e.g., a specific rocket launch) without sifting through hours of footage.
•Enterprise archives: Improve internal video search (trainings, meetings) where metadata is sparse or noisy.
•Sports analytics: Pull the correct plays or player moments from large game libraries for review.
•Emergency operations: Surface the most relevant situational clips (e.g., specific location cues) during crises.
•Multimodal RAG: Feed higher-quality top-10 videos into article or script generation to boost factuality.
•Streaming platforms: Enhance recommendation freshness by accurately matching trend queries to new uploads.
•Museum/media digitization: Help curators find visually anchored scenes across large, unlabeled archives.
•Search quality tuning: Pair a lightweight first-stage with RANKVIDEO to save compute while improving top results.

Version: 1