Factorized Learning for Temporally Grounded Video-Language Models

Wenzheng Zeng; Difei Gao; Mike Zheng Shou; Hwee Tou Ng

Factorized Learning for Temporally Grounded Video-Language Models

Intermediate

Wenzheng Zeng, Difei Gao, Mike Zheng Shou et al.12/30/2025

arXiv PDF

Key Summary

•This paper teaches video-language models to first find when the proof happens in a video and then answer with that proof, instead of mixing both steps together.
•It introduces evidence tokens (<evi>) that not only point to timestamps but also carry the visual meaning of the event, like a smart bookmark with a summary.
•The method uses a two-stage recipe: pure grounding (find events) and then interleaved text-and-evidence answering to stay consistent with what was found.
•A new training method, Factorized Preference Optimization (FPO), learns preferences for both correct grounding and correct text, not just text.
•FPO models grounding probability directly from frame–<evi> similarities, turning similarities into a likelihood that an event is correctly localized.
•A synthetic data pipeline creates controlled, event-level mistakes (in time and text) so the model can learn what not to do without expensive labeling.
•Across multiple benchmarks (E.T. Bench, Charades-STA, YouCook2), the 3.8B model beats many larger state-of-the-art models, especially on temporal grounding.
•Decoupling the tasks and adding explicit visual semantics to <evi> tokens make training more stable and answers more faithful to the video.
•The approach remains lightweight, uses LoRA fine-tuning, and adds negligible runtime overhead for similarity calculations per token.
•Limitations include still-low scores on some tasks and a need to generate more diverse positive preference samples in the future.

Why This Research Matters

People want answers they can trust, especially when videos are long and details matter. By grounding first and answering with explicit references, this method shows exactly where the proof is and what it means. That helps in sports analysis, cooking instructions, product demos, safety reviews, and classroom videos. It also makes smaller models competitive, which can reduce costs and widen access. The synthetic preference pipeline lowers the need for expensive labeling. Overall, it pushes AI toward clearer, checkable, and fairer explanations of what really happens on screen.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: You know how, when you watch a movie with a friend, one of you keeps track of when the key scenes happen (“The surprise reveal is at 52 minutes!”) while the other explains what those scenes mean? Mixing those two jobs can get confusing.

🥬 The Concept (Video-Language Models, VLMs): A video-language model is a computer program that watches videos and answers questions about them in text.

How it works: (1) It turns video frames into numbers (features). (2) It reads your question as text. (3) It generates an answer token by token. (4) Some models also try to tell you when (timestamps) the important proof happens in the video.
Why it matters: Without VLMs, you’d have to watch long videos yourself just to find moments like “when the dog jumps” or “when the chef adds salt.”

🍞 Anchor: Imagine asking, “When does the soccer goal happen, and who scores?” A good VLM should point to the exact seconds and explain what happened.

🍞 Hook: Imagine you’re a detective with a timeline on the wall. If you can’t pin evidence to the right minute, your whole case falls apart.

🥬 The Concept (Temporal Grounding): Temporal grounding means finding the exact time intervals in a video where the important events (evidence) happen.

How it works: (1) Look through frames. (2) Decide which frames belong to the event. (3) Group them into one or more intervals. (4) Use those intervals as proof.
Why it matters: If grounding is wrong, your answer will likely be wrong, like blaming the wrong suspect in a mystery.

🍞 Anchor: For “When does the cat knock over the glass?”, temporal grounding should return something like 10.6s–12.6s.

🍞 Hook: Think of telling a friend about a game highlight: first you jump to the clip, then you describe it. If you just describe without the clip, they may not believe you.

🥬 The Concept (Textual Response): Textual response is the explanation in words, based on the grounded video evidence.

How it works: (1) Use the grounded moments. (2) Read visual details. (3) Convert them into a sentence that answers the question.
Why it matters: Without a clear description, people don’t know what the evidence means.

🍞 Anchor: “The player in the red jersey scores at 55–58s.” That’s a textual response tied to time.

🍞 Hook: Picture doing math homework. If you try to solve and explain at the exact same time, you might mix steps. Solving first, then explaining, is easier.

🥬 The Concept (Coupled vs. Decoupled Learning): Many past methods mixed grounding and answering into one tangled process.

How it works (before): Generate text and timestamps together, with special time tokens sprinkled in the sentence.
Why it breaks: The model can get confused about what to do next, and special tokens often act like numbers (timestamps) without capturing the event’s meaning.

🍞 Anchor: It’s like trying to measure, cut, and paint wood at once—you’ll probably spill paint on the ruler.

🍞 Hook: Imagine a team with two clear jobs: one finds the clips, the other writes the story—but they coordinate closely.

🥬 The Concept (Factorized Learning): Factorized learning means breaking a big task into parts, training each with a clear objective, and keeping their connection strong.

How it works: (1) First, do pure grounding. (2) Then, answer by referencing what you grounded. (3) Use special signals to keep both stages consistent.
Why it matters: Clear jobs lead to cleaner learning and better results.

🍞 Anchor: First mark where a basketball dunk happens, then describe the dunk while pointing back to the marked clip.

The world before: Video-language models could answer questions and sometimes output timestamps with special tokens. But two big issues kept showing up. First, the model often learned a single, mixed objective: it tried to localize events and write text in the same breath. That muddled its training signal. Second, grounding tokens mostly acted like timestamp indices; they didn’t carry the event’s visual meaning. So even if a model guessed the right time, it didn’t always understand what happened there.

The problem: We need a model that (a) precisely finds event intervals, (b) deeply understands the visual semantics within those intervals, and (c) writes answers that explicitly reference those grounded events.

Failed attempts: Prior work added more and more special tokens, or even extra decoders, but still learned grounding and text together. Models got better at printing times, but not at capturing the event-level meaning that helps the next words be right.

The gap: A missing structure that teaches “ground first, then answer,” while also forcing the answer to match the grounded evidence. And a missing way to train preferences not just for better text, but for better grounding too.

Real stakes: In everyday life, this matters for: (1) Sports highlights—find and explain the goal, not just guess the minute. (2) Cooking or DIY—point to each step and describe it correctly. (3) Safety reviews—show precisely when a near-miss happened, then clearly explain it. (4) Education—locate and explain key moments in lecture videos. Precise, trustworthy answers save time and build confidence.

02Core Idea

🍞 Hook: You know how a good YouTube recap first shows the exact moment of the big play, then explains what made it special? That order makes it easy to trust.

🥬 The Concept (Aha!): The key insight is to factor the task: first ground the evidence in time, then answer while explicitly referencing that evidence using special evidence tokens that also carry visual meaning, and train preferences for both parts with FPO.

How it works: (1) Stage 1: Pure Grounding—generate <evi> tokens that latch onto the right frames and absorb their visual semantics. (2) Stage 2: Interleaved Answering—write text and re-generate matching <evi> tokens to reference the same events. (3) Enforce a consistency constraint so the second-stage <evi> tokens align with the first-stage ones. (4) Train with Factorized Preference Optimization (FPO) that rewards both good grounding and good text.
Why it matters: Without factoring, the model mixes goals and confuses itself; without visual semantics in <evi>, text lacks solid context; without FPO, alignment ignores grounding quality.

🍞 Anchor: “Where do I put the glassware?” The model first grounds the clip where you place it (e.g., 10.6–12.6s), then answers “In the dishwasher,” while referencing the same <evi> interval.

Three analogies for the same idea:

Librarian analogy: First, find the right chapter (grounding). Then, explain the plot while pointing to the passages (answering with evidence). The sticky note (<evi>) doesn’t just mark a page; it summarizes the key scene.
Detective analogy: Pin the suspect’s timeline to the board (grounding), then present the case citing those exact timestamps (answering). The pin (<evi>) includes a brief note of what happened there.
Cooking show analogy: Bookmark each step when it happens (grounding), then narrate the recipe, reusing those bookmarks to remind viewers exactly where each step occurred (answering with referencing).

Before vs. After:

Before: Time tokens were interleaved with text, objectives were coupled, and timestamp tokens lacked event meaning. Answers could drift from the true evidence.
After: Two clear stages, <evi> tokens that carry event-level visual semantics, explicit consistency between stages, and FPO that optimizes both grounding and text.

Why it works (intuition):

Separating the jobs simplifies learning: the model knows when to find evidence and when to explain it.
Packing visual semantics into <evi> gives the language model a rich, local context right where it needs it—at the next token.
Evidence referencing keeps the answer honest: reusing the same <evi> ties the text back to the original grounded moments.
FPO closes the loop: the model is rewarded not only for nice-sounding text but also for precise grounding, measured via frame–<evi> similarity.

Building blocks (each introduced with a sandwich):

🍞 Hook: Imagine a smart bookmark that also stores a tiny summary of the scene it marks. 🥬 The Concept (Evidence Token, <evi>): An <evi> token is a special token that grounds an event in time and absorbs the event’s visual meaning.

How it works: (1) Generate <evi>. (2) Compute similarity between <evi> and each frame feature. (3) Pick salient frames with high similarity. (4) Aggregate their features into <evi> (e.g., average and add). (5) Use frame indices of salient frames to form timestamps.
Why it matters: If <evi> only carried a number, the next words might drift. With visual semantics inside, the text stays tied to what actually happened. 🍞 Anchor: For “When does the dog fetch the ball?”, <evi> pulls in frames of the fetch and helps the model say, “The dog fetches the ball at 23.5–26.1s.”

🍞 Hook: Think of switching from scouting to reporting: first you scout locations, then you write the story. 🥬 The Concept (Grounding→Answering with Referencing): The response is generated in two stages: pure grounding, then interleaved text + evidence referencing.

How it works: (1) Stage 1: Emit a sequence of <evi> tokens that localize each event. (2) Emit </evi> to signal the switch. (3) Stage 2: Write text and re-emit matching <evi> tokens so timestamps come from the same grounded evidence.
Why it matters: Answers stay consistent with the original evidence; no drifting. 🍞 Anchor: “Step 1 at <evi>, Step 2 at <evi>…” Each <evi> in the answer matches the ones found earlier.

🍞 Hook: If your second report disagrees with your first notes, something’s off. 🥬 The Concept (Consistency Constraint): Force <evi> tokens in the answer stage to align with those from the grounding stage.

How it works: Compare their features and minimize the difference.
Why it matters: Prevents the answer from quietly changing the evidence. 🍞 Anchor: If grounding first said 12–15s, the answer should reference the same interval via a matching <evi>.

🍞 Hook: Imagine grading both the map and the essay, not just the essay. 🥬 The Concept (Factorized Preference Optimization, FPO): A training rule that increases the model’s preference for responses that have both better grounding and better text.

How it works: (1) Compute the usual text log-prob. (2) Compute grounding log-prob from frame–<evi> similarities (product across the interval). (3) Compare preferred vs. dispreferred pairs and push up the better one.
Why it matters: Text alone isn’t enough; we must also prefer precise evidence. 🍞 Anchor: If two answers sound okay but only one precisely marks the goal clip, FPO favors the precise one.

03Methodology

High-level pipeline: Input (Video + Question) → Visual/Text Encoding → Stage 1: Pure Grounding with <evi> → Stage Switch (</evi>) → Stage 2: Interleaved Text + <evi> Answering → Output (Text + Timestamps from <evi>)

Step 0: Encoders

What happens: Frames are sampled (e.g., 1 FPS), resized (e.g., 224×224), and encoded by a vision backbone (e.g., ViT-G/14 via EVA-CLIP) and a feature compressor. The question is tokenized by the language model. Both streams flow into a video-aware LLM decoder.
Why this step exists: The LLM needs compact, aligned tokens to reason about video and language together.
Example: A 100-second video becomes 100 frame tokens; the question “Where did I put the glassware?” becomes text tokens.

🍞 Hook: Think of a special bookmark that both points to a moment and stores a mini description of it. 🥬 The Concept (Stage 1: Pure Grounding with <evi>): The model first emits <evi> tokens to localize events without writing the final answer yet.

How it works: (1) Generate an <evi>. (2) Compute similarity with each frame token. (3) Mark salient frames whose similarity exceeds a threshold (e.g., 60% of the max similarity). (4) Aggregate their features into the <evi> (average and add). (5) The time interval comes from indices of salient frames. (6) Repeat for multiple events in order.
Why it matters: This step “locks in” where evidence lives and loads <evi> with the event’s visual semantics. 🍞 Anchor: For “Find all ‘tying something’ actions,” the model emits <evi> for 2.8–11.4s and another for 19.2–32.5s.

Details of visual semantic capture

What happens: Before similarity, <evi> is projected by a small MLP to better serve two roles: (a) a generation token (via LM head) and (b) a query for similarity and aggregation.
Why this step exists: It cleanly separates token classification from similarity-based grounding.
Example: The projected <evi> aligns better with frame features when finding salient frames.

Task transition signal

What happens: The model emits </evi> to signal it’s done grounding and ready to answer.
Why this step exists: Clear stage boundaries reduce confusion during generation.
Example: After three <evi> tokens, a </evi> says, “switch to answering now.”

🍞 Hook: Imagine writing a report and, whenever you claim something, you insert a little tag that points to the exact clip that proves it. 🥬 The Concept (Stage 2: Interleaved Text + Evidence Referencing): The model now writes the answer but reuses <evi> tokens to reference the same grounded events.

How it works: (1) Generate a piece of text. (2) Insert an <evi> that should match the earlier one. (3) Continue text, then another <evi> if needed. (4) Timestamps are read from frame–<evi> similarity again.
Why it matters: This keeps the answer faithful to the grounded evidence. 🍞 Anchor: “In the dishwasher. The relevant event happens in <evi>.” That <evi> decodes to the same 10.6–12.6s interval.

🍞 Hook: If your second note disagrees with your first note, you fix it. 🥬 The Concept (Consistency Constraint): Encourage the answer-stage <evi> tokens to match the grounding-stage <evi> tokens.

How it works: Minimize the difference between their features, aligned event by event.
Why it matters: Prevents quiet drift between stages. 🍞 Anchor: If the first-stage <evi> focused on frames 75–83, the answer-stage <evi> should too.

Supervision losses

What happens: The total loss L = L_sft (usual token prediction) + L_gnd (grounding with frame-wise BCE comparing similarity to ground-truth foreground/background) + L_cons (consistency between <evi> pairs).
Why this step exists: Each loss teaches a different, essential behavior: fluent text, accurate grounding, and cross-stage consistency.
Example: For a cooking step at 44–57s, frames inside are positive (1), others negative (0) for L_gnd.

🍞 Hook: Don’t just grade the essay; grade the map that the essay points to. 🥬 The Concept (Factorized Preference Optimization, FPO): Preference learning that scores both text and grounding quality.

What it is: A DPO-like objective that compares a preferred response to a dispreferred one and pushes up the joint probability of the better one.
How it works (step by step):
1. For each response, compute text log-prob via the LLM (as usual).
2. For each <evi>, compute grounding probability for its interval by multiplying per-frame similarities inside (and 1−similarity outside), then take log and add across <evi>.
3. Sum text and grounding log-probs to get the joint log-prob.
4. Use a logistic loss to prefer the better response.
Why it matters: The model learns to value precise evidence, not just pretty sentences. 🍞 Anchor: Two captions sound similar, but only one nails the moments—FPO rewards that one.

🍞 Hook: To learn what “wrong” looks like, you sometimes practice with realistic mistakes. 🥬 The Concept (Factorized Data Synthesis): A pipeline that creates controlled, event-level perturbations for both time and text.

What it is: Synthetic pairs with known causes of error—temporal shifts/merges/adds/deletes and textual distortions/repeats—built at the sub-video event level.
How it works: (1) Start from a correct response. (2) Randomly pick factors (grounding or text) and sub-factors (e.g., shift interval, distort key info). (3) Apply to chosen events only. (4) Form preferred vs. dispreferred pairs with known reasons.
Why it matters: Reliable, scalable preference data without manual labels, and perfectly matched to FPO. 🍞 Anchor: In dense captioning, you might shift “add sugar” from 37–44s to 33–48s or replace “sugar” with “salt” to create a precise negative example.

Computational note

Frame-wise similarity adds minimal overhead (~0.4 ms per token on a 3090 GPU), about 1.4% of the total forward time, keeping inference practical.

Putting it all together (example)

Input: “Localize each step and briefly describe it.”
Stage 1: Emit three <evi> that find [75–83], [120–128], [126–128].
Stage 2: “<evi>, remove the skin… <evi>, cut and dice… <evi>, enjoy your mangoes!” Each <evi> re-references the grounded intervals, and timestamps are read from similarity.

04Experiments & Results

The test: The authors evaluate whether the model can (1) find the right times (temporal grounding) and (2) give good descriptions or answers (text quality). They measure this on diverse tasks so the model must handle single events, multiple events, long videos, and step-by-step procedures.

The competition: They compare against strong video LLMs, including larger ones (7B–13B), like Video-LLaVA, TimeChat, LITA, VTG-LLM, TRACE, E.T. Chat, and others, many tailored for temporal reasoning. Their model is only 3.8B, so winning shows the method—not just size—matters.

The scoreboard with context:

E.T. Bench Grounding (5 sub-tasks: moment retrieval, episodic memory, action localization, extractive summarization, highlight detection): D_VLM-3.8B raises the average F1 by at least 7.0% over recent SOTA. Think of this like jumping from a solid B to a clear A on a tough, multi-part exam.
E.T. Bench Dense Captioning (dense video captioning, step localization & captioning): It leads in both grounding (at least +5.3% F1) and text similarity (at least +3.6% Sim). That’s like being best both at finding the right clips and telling a good story about them.
Charades-STA (moment retrieval): D_VLM-3.8B hits 50.3% R@1@0.5 and 26.0% R@1@0.7, beating prior 7B models and improving 4.4% over a closely related 3.8B baseline. Think of R@1@0.5 as “found the right moment with acceptable overlap” and @0.7 as “found it with tight overlap.”
YouCook2 (dense captioning): F1 improves by at least 4%, with CIDEr and SODAc up by at least 2.5 and 1.0, respectively. This shows better event detection and richer, more accurate descriptions of cooking steps.

Surprising findings and insights:

Event-level visual semantic capture matters a lot: Removing it drops text quality noticeably, especially for dense captioning. This confirms that <evi> should be more than a timestamp—it should carry event meaning.
Decoupling helps both halves: Switching from a coupled objective to “ground first, then answer with referencing” gives big jumps in both grounding and text metrics. That’s strong evidence that clean stage separation simplifies learning.
Interleaved answering with <evi> referencing beats plain text answering. In other words, explicit referencing keeps the answer faithful and boosts performance further.
The consistency constraint adds a final polish, improving coherence between stages.
FPO gives an extra lift, especially on grounding tasks: when training explicitly prefers better grounding, the model localizes events more accurately.

Ablations in plain language:

Just decouple (ground then answer): big gains.
Add interleaved text + <evi> referencing: more gains.
Add consistency constraint: steady boost.
Remove event-level modeling: big drop—interval-level reasoning is crucial.
Remove visual semantic capture: noticeable drop—text needs those semantics.
Add FPO: consistent improvements, strongest for grounding.

General QA benchmarks:

On MVBench and Video-MME, the method matches or beats some general models trained on more generic data and clearly beats grounding-focused baselines, despite no huge pretraining. This suggests the approach is versatile, and scaling data/model size could push it even further.

Bottom line: Even at 3.8B, the method consistently outperforms many larger SOTAs across grounding-heavy tasks and dense captioning. The recipe—decouple, pack semantics into <evi>, reference evidence in answers, and optimize preferences for both text and time—pays off.

05Discussion & Limitations

Limitations:

Some scores are still low on tough tasks like episodic memory and YouCook2 dense captioning, meaning complex multi-event reasoning with long contexts remains challenging.
The preference data synthesis focuses primarily on generating controlled negatives (dispreferred) rather than a broad palette of positive alternatives, which could limit diversity in what the model prefers.
While frame-wise similarity adds little overhead, it still introduces an extra step; extremely long videos at high FPS could increase compute.
The framework assumes events can be captured well at the sampling rate (e.g., 1 FPS). Very short, fleeting actions might need denser sampling or multi-scale features.

Required resources:

A vision backbone (e.g., ViT-G/14 from EVA-CLIP) plus a Q-Former-like compressor and a compact LLM (e.g., Phi-3-Mini-3.8B). The authors fine-tune with LoRA, making training feasible on 4× H100 in about a day on the used dataset.
A preference synthesis pipeline (provided) to create controlled perturbations at the event level.

When not to use:

Tasks without meaningful temporal events (pure image QA) don’t benefit from temporal grounding machinery.
If you only need a fluent summary without caring about exact evidence timing, simpler text-only models may suffice.
Ultra-long videos where precise, dense, frame-level alignment is unnecessary (e.g., hour-long vibe summaries) might not need <evi>-style referencing.

Open questions:

How far can this approach scale with larger base models and broader, generic pretraining? Could we keep the strong grounding while boosting general reasoning?
Can we design positive preference data synthesis (not just negatives) to teach the model multiple good ways to answer faithfully?
Could multi-scale or hierarchical <evi> tokens capture both short micro-events and long macro-events elegantly?
How might audio or sensor streams be folded into <evi> so grounding uses more than just visuals?
Can we make the similarity-to-probability mapping more robust or uncertainty-aware, especially for ambiguous or overlapping events?

06Conclusion & Future Work

Three-sentence summary: This paper shows that video-language models work better when they first ground events in time and then answer while explicitly referencing those grounded events. It introduces evidence tokens that carry both timestamps and event-level visual semantics, plus a preference-learning method (FPO) that rewards both good grounding and good text. With this design, a compact 3.8B model surpasses many larger state-of-the-art systems on temporal grounding and dense captioning benchmarks.

Main achievement: Decoupling “when” from “what” with an evidence-referencing design—powered by <evi> tokens that absorb visual semantics—and optimizing both halves with FPO, leading to clear, consistent gains across tasks.

Future directions: Scale up model and data to combine strong grounding with broader general reasoning; enrich preference data with diverse positive samples; explore multi-scale <evi> for both short and long events; integrate audio; and refine similarity-based probabilities with uncertainty modeling.

Why remember this: It reframes video QA as “ground first, then answer with proof,” turning timestamps from bare numbers into semantic anchors. That small shift—packing meaning into evidence tokens and training preferences for both text and time—makes answers more trustworthy and models more teachable.

Practical Applications

•Sports highlight editors that automatically find goals, fouls, or aces and summarize them with precise timestamps.
•Cooking assistants that mark each step’s interval and describe the action clearly for easy follow-along.
•Customer support tools that pinpoint when a device error occurs in a troubleshooting video and explain the fix.
•Education platforms that locate key moments in lectures (definitions, proofs, demos) with concise notes.
•Workplace safety systems that localize near-misses in surveillance footage and generate incident summaries.
•Video search engines that return exact moments for multi-event queries (e.g., “all tying knots” in a tutorial).
•Content moderation that grounds policy-violating moments and explains the decision with referenced evidence.
•Fitness apps that detect and timestamp reps or form mistakes, then provide grounded coaching tips.
•Smart video editing that auto-selects highlights and captions them with event-level references.
•Legal discovery tools that locate relevant scenes in hours of footage and produce grounded, explainable snippets.

Version: 1