TimeLens: Rethinking Video Temporal Grounding with Multimodal LLMs

Jun Zhang; Teng Wang; Yuying Ge; Yixiao Ge; Xinhao Li; Ying Shan; Limin Wang

TimeLens: Rethinking Video Temporal Grounding with Multimodal LLMs

Intermediate

Jun Zhang, Teng Wang, Yuying Ge et al.12/16/2025

arXiv PDF

Key Summary

•TimeLens studies how to teach AI not just what happens in a video, but exactly when it happens, which is called video temporal grounding (VTG).
•The authors discovered that many famous VTG datasets contain lots of mistakes—unclear questions, wrong times, duplicates, or events that never happen—making past model rankings unreliable.
•They fixed three popular benchmarks by re-annotating them carefully, creating TimeLens-Bench, which reshuffled which models look best and offered a fairer test.
•They also built TimeLens-100K, a large, cleaner training set made with an automated re-annotation pipeline, which boosted model accuracy.
•A simple trick—interleaving textual timestamps with video tokens—worked better for representing time than fancy position embeddings or drawing timestamps onto frames.
•For training, a “thinking-free” reinforcement learning with verifiable rewards (RLVR) beat supervised fine-tuning and “think-then-answer” RL, while being faster and simpler.
•Two training recipes mattered a lot: stop early when rewards stop improving, and sample harder training examples for RL to learn more.
•TimeLens models (7B and 8B) set new open-source records on the refined benchmarks and even beat some proprietary models like GPT-5 and Gemini-2.5-Flash in VTG.
•All code, data, and models will be released to help others build stronger time-aware video AIs.

Why This Research Matters

Videos are how we learn, work, and relax—but finding the exact moment you need is still hard. TimeLens shows how to make AI reliably answer “when,” not just “what,” so you can jump to the right scene in seconds. This helps students learn from experiments, coaches analyze sports, customers locate steps in tutorials, and safety teams find critical moments fast. By fixing the data and simplifying training, it turns a tricky research problem into a practical recipe others can adopt. That means faster progress for tools we all use—search, highlights, summaries—across education, entertainment, and safety.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: You know how when you watch a long movie and someone asks, “When did the dragon show up?”, it’s not enough to say what happened—you must find the exact time it happened.

🥬 The Concept (Video Temporal Grounding, VTG): It’s teaching AI to find the precise start and end times in a video for the event you ask about.

How it works (recipe):
1. Give the AI a video and a question like “When does the kid start riding the bike?”
2. The AI scans the video frames, paying attention to what changes over time.
3. It picks the exact time range where that event happens.
Why it matters: Without VTG, AI can tell you what is in the video but not when, which makes search, highlights, and instructions much less useful.

🍞 Anchor: “Show me when the cat jumps onto the couch.” VTG answers, “The event happens in 12.3–14.8 seconds.”

The World Before: Multimodal large language models (MLLMs) got good at describing scenes and answering questions about images and short videos—basically, they could say what was there. But they struggled with when things happened, especially in long videos. They mixed events up, missed key moments, or gave fuzzy timing.

🍞 Hook: Imagine a librarian who knows every book’s story but can’t tell you which chapter a scene is in.

🥬 The Concept (MLLMs): These are AI models that can understand both language and visuals (images/videos) and answer questions about them.

How it works:
1. The video becomes a sequence of visual tokens (like words but for images).
2. The text question becomes text tokens.
3. The model learns connections between the text and video tokens to answer.
Why it matters: Many real tasks—sports review, safety monitoring, education—need understanding across text and video together.

🍞 Anchor: Ask, “What color is the car?” while showing a clip—the model reads the question and watches the video to reply, “Blue.”

The Problem: Benchmarks for VTG had serious quality issues. Many questions were vague (“the game continues”), duplicated (“a person sits in a chair” twice), leaked the answer (“ending credits”), or the labeled time ranges were simply wrong. This made leaderboards misleading—some models looked great only because they guessed shortcuts, while others that genuinely grounded events were under-scored.

🍞 Hook: Imagine a math test where half the answers in the answer key are wrong and some questions basically tell you the answer.

🥬 The Concept (Data Quality & Benchmark Reliability): Data must have clear queries and accurate time labels to fairly test models.

How it works:
1. Check if the event actually exists in the video (event existence).
2. Make each query unique per video (no duplicates).
3. Ensure queries are clear and don’t leak time info.
4. Mark precise start/end times.
Why it matters: If the test is broken, models learn the wrong lessons and researchers chase the wrong goals.

🍞 Anchor: If a quiz says “What’s 2+2? (Hint: It’s 4 at the end.)”, you’re not measuring math skill—you’re measuring hint following.

Failed Attempts: People tried complex timestamp encodings, heavy architectural changes, and different training strategies (like supervised fine-tuning or RL with long “thinking” chains). But because the data was noisy and evaluations inconsistent, it was hard to know what truly worked. Some methods looked strong only on flawed tests.

The Gap: We needed two things: (1) a trustworthy evaluation suite, and (2) simple, tested best practices for modeling time and training, so researchers could compare apples to apples and actually build better VTG models.

Real Stakes:

Video search: “Show the moment the package is delivered.”
Education: “When does the science experiment change color?”
Safety: “Locate when a person falls.”
Sports: “Find the play where the pass is intercepted.”
Entertainment: “Jump to when the hero finds the key.” If AI can’t reliably say when, users waste time and miss critical moments.

02Core Idea

🍞 Hook: Imagine you’re organizing a long movie with sticky notes that say exactly when each scene starts—suddenly finding any moment is easy.

🥬 The Concept (The Aha!): Fix the data, keep time representation simple, and train with verifiable rewards—this trio makes VTG models both stronger and simpler.

How it works:
1. Clean the tests (TimeLens-Bench) so scores are trustworthy.
2. Clean the training (TimeLens-100K) so learning signals are clear.
3. Represent time by interleaving plain text timestamps with video tokens.
4. Train with “thinking-free” RLVR that rewards correct time ranges.
5. Use two recipes: early stop when reward plateaus; sample harder examples.
Why it matters: Without clean data and simple, verifiable training, models overfit shortcuts and fail on real tasks.

🍞 Anchor: With tidy bookshelves (clean data), page numbers on sticky notes (timestamps), and a quick point system (rewards), the librarian instantly finds the right chapter.

Three Analogies:

Detective Timeline: Instead of guessing, the detective pins exact times of clues on a timeline; the case becomes solvable.
GPS for Movies: Put clear mile markers (timestamps) along the route (video). The navigator (model) arrives at the right exit (event time).
School Quiz Fix: Replace a messy test with a fair one, grade answers with a clear rubric (reward), and stop studying when scores stop improving.

Before vs. After:

Before: Noisy benchmarks, unclear training data, complicated time encodings, and training strategies that didn’t fairly compare.
After: TimeLens-Bench (trustworthy testing), TimeLens-100K (clean training), simple interleaved textual time encoding, and thinking-free RLVR with two practical recipes—leading to state-of-the-art open-source VTG.

Why It Works (intuition):

Cleaner data means the model learns real grounding, not shortcuts.
Interleaving timestamps as text lets the language backbone naturally read time as if it were words.
Verifiable rewards (IoU of time segments) give direct, honest feedback every step.
Early stopping avoids over-optimizing on a plateau; hard-example sampling stretches the model’s skills faster.

Building Blocks (with mini “sandwich” explanations):

🍞 Hook: You know how a fair race needs a proper track and referee. 🥬 The Concept (TimeLens-Bench): A re-annotated, cross-validated benchmark suite for VTG.

How: audit errors, fix queries/times, cross-validate, and enforce strict criteria.
Why: Bad tests produce bad rankings; good tests reveal true skill. 🍞 Anchor: After fixing the test, some models jumped down the leaderboard while careful ones rose.

🍞 Hook: Imagine studying from a clean, well-edited textbook instead of scribbled notes. 🥬 The Concept (TimeLens-100K): A large, high-quality training set from automated re-annotation.

How: Use a strong MLLM to propose events, generate queries, assign times, and verify quality.
Why: Better training data → better learning signals → better models. 🍞 Anchor: Models trained on TimeLens-100K scored much higher on the refined benchmarks.

🍞 Hook: Think of placing the timestamp right before each photo in a scrapbook. 🥬 The Concept (Interleaved Textual Encoding): Put timestamps as text tokens before each frame’s tokens.

How: Convert times like “10.2s” into text and interleave with the video tokens.
Why: The LLM can “read” time directly, aligning language and visuals simply and effectively. 🍞 Anchor: This method beat position-embedding tweaks and image-overlay clocks in tests.

🍞 Hook: Picture a game that gives you points only when you hit the right target. 🥬 The Concept (RLVR – Reinforcement Learning with Verifiable Rewards): Train by rewarding how close the predicted time range is to the truth.

How: Sample outputs, compute IoU with ground truth, increase chances of higher-reward answers.
Why: Direct, verifiable signals speed up learning and reduce guesswork. 🍞 Anchor: The model keeps what works—accurate time ranges—and drops what doesn’t.

🍞 Hook: If your test score stops improving after more practice, you stop and switch tactics. 🥬 The Concept (Early Stopping in RLVR): Halt training when rewards plateau.

How: Track rewards and their spread; stop when they flatten.
Why: Saves compute and avoids overfitting. 🍞 Anchor: In experiments, going past the plateau hurt performance, so stopping early was best.

🍞 Hook: Climbing a slightly higher step each time makes you stronger than stepping on the same low step forever. 🥬 The Concept (Difficulty-Based Sampling): Prefer harder training examples.

How: Estimate difficulty by 1–IoU; sample from higher-difficulty ranges.
Why: Challenging cases teach more and faster—until a practical limit. 🍞 Anchor: As average difficulty rose, scores improved and then leveled off at high difficulty.

03Methodology

At a high level: Input (videos + queries) → Diagnose & fix benchmarks → Build clean training data → Encode time with interleaved text → Train with thinking-free RLVR (+ early stop, hard sampling) → Evaluate on TimeLens-Bench → Output (TimeLens models).

Step 1: Diagnose and Refine Benchmarks

What happens: Human annotators review existing datasets (Charades-STA, ActivityNet Captions, QVHighlights) for five common issues: unclear queries, duplicates, no event occurrence, multiple occurrences, and inaccurate times. They fix or rewrite queries and re-label precise start/end times. Cross-validation ensures consistency.
Why it exists: If the test is noisy, models learn and are judged by the wrong signals.
Example: Original query “A man is running down a track by a field” might match multiple spots. Rewritten to “A man kneels on the ground with both knees,” which occurs once, with corrected times.

Step 2: Curate High-Quality Training Data (TimeLens-100K)

What happens: An automated pipeline uses a strong MLLM to propose distinct events throughout a video, write clear queries, assign time spans, and self-check quality. This scales to ~100K annotations.
Why it exists: Training on clean labels teaches real grounding, not guessy shortcuts.
Example: From a cooking video, the pipeline creates queries like “When does the chef crack the egg?” with a single exact time range.

🍞 Hook: Like revising homework after a teacher’s notes. 🥬 The Concept (Re-annotation Pipeline): A system to improve labels by doing them again, better.

How: Detect issues → propose fixes → verify → accept.
Why: Clean labels are the foundation of honest learning. 🍞 Anchor: After re-annotation, models trained on the new data performed much better.

Step 3: Represent Time with Interleaved Text

What happens: Convert each frame’s sampling time (e.g., “10.2s”) into text tokens and insert them just before that frame’s visual tokens. The LLM now “reads” when each frame happens.
Why it exists: It’s simple, requires no architecture hacks, and empirically outperforms fancier alternatives.
Example: A 3-frame clip becomes: [“1.0s” + frame1 tokens], [“2.0s” + frame2 tokens], [“3.0s” + frame3 tokens].

🍞 Hook: Label the photos in your album with the time they were taken. 🥬 The Concept (Interleaved Textual Encoding): Write timestamps as text, right before each frame.

How: Textify times → interleave with frame tokens → feed to the LLM.
Why: The model can align language, vision, and time naturally. 🍞 Anchor: In head-to-head tests, this beat position-embedding methods and overlayed timestamps.

Step 4: Train with Thinking-Free RLVR

What happens: The model directly outputs a time range. A reward computes IoU (overlap) with ground truth. GRPO-style updates increase the chance of higher-reward outputs.
Why it exists: VTG is perception-heavy. Extra “think-then-answer” text adds cost without gains here.
Example: Model predicts “12.0–17.5s.” If the true segment is “12.8–17.0s,” the IoU reward is high, so the model reinforces that behavior.

🍞 Hook: A game that adds points the closer you land to the bullseye. 🥬 The Concept (Thinking-Free RLVR): Train by rewarding only accurate time spans; no extra “thinking” text.

How: Sample outputs → compute IoU reward → policy update.
Why: Direct signal, faster learning, better results. 🍞 Anchor: Compared to SFT and think-then-answer RLVR, this was simpler and stronger.

Step 5: Early Stopping & Difficulty-Based Sampling

Early Stopping
- What: Monitor the average reward and the within-group reward spread; stop when they plateau.
- Why: Extra training past the plateau degraded performance.
- Example: Rewards flattened around ~300 steps; stopping there saved compute and preserved peak accuracy.
Difficulty-Based Sampling
- What: Estimate difficulty as 1–IoU from offline inference; sample more from higher-difficulty ranges via a Gaussian selector.
- Why: Harder examples teach more; performance rose with difficulty until leveling off.
- Example: Raising the average difficulty (e.g., >0.75) improved results and then plateaued.

🍞 Hook: Quit practicing when you’re no longer improving, and pick tougher drills to get stronger. 🥬 The Concept (Early Stopping & Hard Sampling): Two training recipes that save time and boost learning.

How: Watch rewards to stop; select more challenging samples.
Why: Efficient training that targets what the model needs. 🍞 Anchor: These tweaks delivered clear, cumulative gains in the ablations.

The Secret Sauce

Clean, cross-validated benchmarks (fair test).
Clean, large training set (good study material).
Interleaved textual time encoding (simple, effective signals).
Thinking-free RLVR with two practical recipes (fast, strong learning). Together, these choices created TimeLens models that achieved state-of-the-art VTG among open-source systems and even surpassed some proprietary models.

04Experiments & Results

The Test: The team evaluated on three refined benchmarks—Charades-TimeLens (daily life), ActivityNet-TimeLens (activities), and QVHighlights-TimeLens (mixed)—to measure how well models localize events in time.

🍞 Hook: Think of grading archers by how close their arrows land to the center. 🥬 The Concept (mIoU – mean Intersection over Union): Average overlap between the predicted time range and the true time range, across all samples.

How: For each sample, compute overlap/union of two time segments; average over the test set.
Why: It rewards both precision and coverage; higher is better. 🍞 Anchor: Predicting 12–18s when the truth is 13–17s scores a high IoU; predicting 1–50s scores low.

🍞 Hook: A pass/fail threshold—did your arrow land close enough to count? 🥬 The Concept (R1@m): The percentage of samples where the top prediction’s IoU beats a threshold m (like 0.3, 0.5, 0.7).

How: For each sample, check if IoU ≥ m; count successes over all samples.
Why: It shows performance at different strictness levels. 🍞 Anchor: R1@0.7 is a harder test than R1@0.3, like needing a bullseye instead of just hitting the target.

The Competition: They compared TimeLens models (7B, 8B) with strong proprietary systems (e.g., GPT-5, Gemini-2.5-Flash/Pro) and popular open-source baselines (Qwen2.5-VL-7B, Qwen3-VL-8B, MiMo-VL-7B, Time-R1-7B, VideoChat variants).

Scoreboard with Context:

On Charades-TimeLens, TimeLens-7B reached R1@0.3/0.5/0.7 = 70.5/55.6/28.4 with mIoU 48.8—big gains over its base (Qwen2.5-VL-7B: 59.7/37.8/16.6, mIoU 39.3). TimeLens-8B did even better (76.6/63.0/35.2, mIoU 55.2).
On ActivityNet-TimeLens, TimeLens-7B improved to 62.8/51.0/32.6, mIoU 46.2 versus its base 44.1/31.0/16.1, mIoU 31.4; TimeLens-8B further improved to 68.9/58.4/40.6, mIoU 53.2.
On QVHighlights-TimeLens, TimeLens-7B achieved 74.1/62.7/43.1, mIoU 56.0; TimeLens-8B reached 80.2/71.6/55.5, mIoU 65.5.
Big picture: These are like moving from a class average of B- to solid A-/A scores across tests; notably, the 8B model set a new open-source state of the art and even beat some proprietary systems (e.g., GPT-5, Gemini-2.5-Flash) on VTG.

Surprising Findings:

Benchmark Quality Reversal: On old, noisy benchmarks, some open-source models looked better than proprietary ones. On the refined TimeLens-Bench, rankings flipped—showing the old tests were misleading.
Timestamp Encoding Ablation: Interleaved textual timestamps with raw times (“10.2s”) beat position-embedding changes and visually overlaying timestamps on frames, across all three datasets.
Training Paradigms: Thinking-free RLVR outperformed supervised fine-tuning and think-then-answer RLVR while being more efficient. Adding an SFT stage before RLVR showed no meaningful gain in this VTG setup.
Early Stopping: As soon as the reward plateaued, performance peaked; training further reduced scores.
Difficulty Sampling: As average example difficulty increased, performance rose and then leveled off—hard examples were key, but only up to a point.

General Video Skills: On the Video-MME benchmark, TimeLens-7B kept its base model’s general video understanding strength while significantly improving VTG, showing the VTG boosts didn’t break other abilities.

Takeaway: Clean data + simple time encoding + thinking-free RLVR + two practical recipes delivered consistent, cumulative gains that stood up across datasets and model sizes.

05Discussion & Limitations

Limitations:

Data Scope: Even with careful re-annotation, TimeLens-Bench covers three major datasets but not every domain (e.g., very long documentaries, first-person wearable videos with unusual motions, or specialized medical footage).
Visual-Only: Audio was removed for fair model comparisons; some real tasks benefit from sound (e.g., “When does the whistle blow?”). TimeLens focuses on vision-only VTG.
Ultra-Long Videos: Token budgets and frame sampling still limit very long video coverage; smart sampling helps but doesn’t solve everything.
RL Stability: RLVR requires careful monitoring; overtraining after the reward plateau can hurt results, and rollout diversity matters.
Automated Re-annotation: Using a strong model to label training data can pass along its own biases; quality checks help but can’t erase all bias.

Required Resources:

Compute: The reported RLVR recipe ran efficiently (e.g., ~4h10m at a 1.0× baseline on 8× H20 GPUs), but full-scale training and ablations still need multi-GPU time and storage.
Data: Access to the refined benchmarks (TimeLens-Bench) and TimeLens-100K training set.
Tooling: Annotation interfaces, cross-validation workflows, and evaluation scripts.

When NOT to Use:

Audio-Critical Tasks: If timing depends on sound (whistles, alarms), a vision-only VTG setup may miss the mark.
Micro-Clips: For super short clips where the event fills the whole video, VTG adds little value.
Heavy Logical Reasoning: If solving requires multi-step reasoning chains (beyond perception), think-then-answer RL might be revisited; TimeLens targets perception-heavy VTG.

Open Questions:

Audio-Visual VTG: How much would aligned audio cues boost grounding accuracy and robustness?
Longer Contexts: Can we scale interleaved time encoding and RLVR to hours-long videos without losing precision?
Reasoning-Intensive VTG: For tasks that truly need multi-step logic, what’s the right balance of explicit reasoning and perception?
Data Generation: How to further reduce bias in automated re-annotation and cover underrepresented scenarios?
Unified Training: Can one schedule blend SFT, RLVR, curriculum, and augmentation to work across many video tasks, not just VTG?

06Conclusion & Future Work

Three-Sentence Summary: TimeLens shows that getting time right in video AI starts with fixing the data: repair benchmarks (TimeLens-Bench) and build clean training at scale (TimeLens-100K). With that foundation, a simple time representation (interleaved textual timestamps) and thinking-free RLVR—plus two practical training recipes (early stopping and difficulty sampling)—deliver big, reliable gains. The resulting TimeLens models set new open-source records for VTG and even surpass some proprietary systems.

Main Achievement: Turning VTG progress from a maze of noisy tests and unproven tricks into a clear, reproducible recipe that pairs trustworthy evaluation with simple, effective modeling and training.

Future Directions: Extend VTG to audio-visual grounding; scale to much longer videos; explore reasoning-intensive VTG where chain-of-thought may help; broaden data re-annotation to more domains; and unify training schedules to cover diverse video tasks.

Why Remember This: TimeLens reminds us that in AI, clean data and clear rewards often beat complexity. By rethinking when (not just what), it unlocks precise video search, safer monitoring, sharper sports analysis, and smoother learning tools—pushing video understanding closer to how humans navigate time in stories.

Practical Applications

•Smart video search: Jump straight to the exact moment a requested event happens in long videos.
•Sports analysis: Quickly locate key plays (e.g., goals, turnovers) with precise timestamps.
•Education platforms: Highlight moments in experiment videos where important changes occur.
•Customer support: Pinpoint when an installation or troubleshooting step is performed in tutorials.
•Safety monitoring: Find the precise time a fall, intrusion, or hazard appears in surveillance footage.
•Content creation: Auto-create highlight reels by grounding requested moments with clean start/end times.
•Media indexing: Tag large video libraries with exact time ranges for searchable events.
•News and fact-checking: Locate the exact time a quoted event happens in broadcast footage.
•E-commerce and reviews: Jump to when a product feature is demonstrated in a video.
•User-generated content apps: Let users ask “when” questions and get instant time-stamped answers.

Version: 1