Factorized Learning for Temporally Grounded Video-Language Models
Key Summary
- ā¢This paper teaches video-language models to first find when the proof happens in a video and then answer with that proof, instead of mixing both steps together.
- ā¢It introduces evidence tokens (<evi>) that not only point to timestamps but also carry the visual meaning of the event, like a smart bookmark with a summary.
- ā¢The method uses a two-stage recipe: pure grounding (find events) and then interleaved text-and-evidence answering to stay consistent with what was found.
- ā¢A new training method, Factorized Preference Optimization (FPO), learns preferences for both correct grounding and correct text, not just text.
- ā¢FPO models grounding probability directly from frameā<evi> similarities, turning similarities into a likelihood that an event is correctly localized.
- ā¢A synthetic data pipeline creates controlled, event-level mistakes (in time and text) so the model can learn what not to do without expensive labeling.
- ā¢Across multiple benchmarks (E.T. Bench, Charades-STA, YouCook2), the 3.8B model beats many larger state-of-the-art models, especially on temporal grounding.
- ā¢Decoupling the tasks and adding explicit visual semantics to <evi> tokens make training more stable and answers more faithful to the video.
- ā¢The approach remains lightweight, uses LoRA fine-tuning, and adds negligible runtime overhead for similarity calculations per token.
- ā¢Limitations include still-low scores on some tasks and a need to generate more diverse positive preference samples in the future.
Why This Research Matters
People want answers they can trust, especially when videos are long and details matter. By grounding first and answering with explicit references, this method shows exactly where the proof is and what it means. That helps in sports analysis, cooking instructions, product demos, safety reviews, and classroom videos. It also makes smaller models competitive, which can reduce costs and widen access. The synthetic preference pipeline lowers the need for expensive labeling. Overall, it pushes AI toward clearer, checkable, and fairer explanations of what really happens on screen.
Detailed Explanation
Tap terms for definitions01Background & Problem Definition
š Hook: You know how, when you watch a movie with a friend, one of you keeps track of when the key scenes happen (āThe surprise reveal is at 52 minutes!ā) while the other explains what those scenes mean? Mixing those two jobs can get confusing.
š„¬ The Concept (Video-Language Models, VLMs): A video-language model is a computer program that watches videos and answers questions about them in text.
- How it works: (1) It turns video frames into numbers (features). (2) It reads your question as text. (3) It generates an answer token by token. (4) Some models also try to tell you when (timestamps) the important proof happens in the video.
- Why it matters: Without VLMs, youād have to watch long videos yourself just to find moments like āwhen the dog jumpsā or āwhen the chef adds salt.ā
š Anchor: Imagine asking, āWhen does the soccer goal happen, and who scores?ā A good VLM should point to the exact seconds and explain what happened.
š Hook: Imagine youāre a detective with a timeline on the wall. If you canāt pin evidence to the right minute, your whole case falls apart.
š„¬ The Concept (Temporal Grounding): Temporal grounding means finding the exact time intervals in a video where the important events (evidence) happen.
- How it works: (1) Look through frames. (2) Decide which frames belong to the event. (3) Group them into one or more intervals. (4) Use those intervals as proof.
- Why it matters: If grounding is wrong, your answer will likely be wrong, like blaming the wrong suspect in a mystery.
š Anchor: For āWhen does the cat knock over the glass?ā, temporal grounding should return something like 10.6sā12.6s.
š Hook: Think of telling a friend about a game highlight: first you jump to the clip, then you describe it. If you just describe without the clip, they may not believe you.
š„¬ The Concept (Textual Response): Textual response is the explanation in words, based on the grounded video evidence.
- How it works: (1) Use the grounded moments. (2) Read visual details. (3) Convert them into a sentence that answers the question.
- Why it matters: Without a clear description, people donāt know what the evidence means.
š Anchor: āThe player in the red jersey scores at 55ā58s.ā Thatās a textual response tied to time.
š Hook: Picture doing math homework. If you try to solve and explain at the exact same time, you might mix steps. Solving first, then explaining, is easier.
š„¬ The Concept (Coupled vs. Decoupled Learning): Many past methods mixed grounding and answering into one tangled process.
- How it works (before): Generate text and timestamps together, with special time tokens sprinkled in the sentence.
- Why it breaks: The model can get confused about what to do next, and special tokens often act like numbers (timestamps) without capturing the eventās meaning.
š Anchor: Itās like trying to measure, cut, and paint wood at onceāyouāll probably spill paint on the ruler.
š Hook: Imagine a team with two clear jobs: one finds the clips, the other writes the storyābut they coordinate closely.
š„¬ The Concept (Factorized Learning): Factorized learning means breaking a big task into parts, training each with a clear objective, and keeping their connection strong.
- How it works: (1) First, do pure grounding. (2) Then, answer by referencing what you grounded. (3) Use special signals to keep both stages consistent.
- Why it matters: Clear jobs lead to cleaner learning and better results.
š Anchor: First mark where a basketball dunk happens, then describe the dunk while pointing back to the marked clip.
The world before: Video-language models could answer questions and sometimes output timestamps with special tokens. But two big issues kept showing up. First, the model often learned a single, mixed objective: it tried to localize events and write text in the same breath. That muddled its training signal. Second, grounding tokens mostly acted like timestamp indices; they didnāt carry the eventās visual meaning. So even if a model guessed the right time, it didnāt always understand what happened there.
The problem: We need a model that (a) precisely finds event intervals, (b) deeply understands the visual semantics within those intervals, and (c) writes answers that explicitly reference those grounded events.
Failed attempts: Prior work added more and more special tokens, or even extra decoders, but still learned grounding and text together. Models got better at printing times, but not at capturing the event-level meaning that helps the next words be right.
The gap: A missing structure that teaches āground first, then answer,ā while also forcing the answer to match the grounded evidence. And a missing way to train preferences not just for better text, but for better grounding too.
Real stakes: In everyday life, this matters for: (1) Sports highlightsāfind and explain the goal, not just guess the minute. (2) Cooking or DIYāpoint to each step and describe it correctly. (3) Safety reviewsāshow precisely when a near-miss happened, then clearly explain it. (4) Educationālocate and explain key moments in lecture videos. Precise, trustworthy answers save time and build confidence.
02Core Idea
š Hook: You know how a good YouTube recap first shows the exact moment of the big play, then explains what made it special? That order makes it easy to trust.
š„¬ The Concept (Aha!): The key insight is to factor the task: first ground the evidence in time, then answer while explicitly referencing that evidence using special evidence tokens that also carry visual meaning, and train preferences for both parts with FPO.
- How it works: (1) Stage 1: Pure Groundingāgenerate <evi> tokens that latch onto the right frames and absorb their visual semantics. (2) Stage 2: Interleaved Answeringāwrite text and re-generate matching <evi> tokens to reference the same events. (3) Enforce a consistency constraint so the second-stage <evi> tokens align with the first-stage ones. (4) Train with Factorized Preference Optimization (FPO) that rewards both good grounding and good text.
- Why it matters: Without factoring, the model mixes goals and confuses itself; without visual semantics in <evi>, text lacks solid context; without FPO, alignment ignores grounding quality.
š Anchor: āWhere do I put the glassware?ā The model first grounds the clip where you place it (e.g., 10.6ā12.6s), then answers āIn the dishwasher,ā while referencing the same <evi> interval.
Three analogies for the same idea:
- Librarian analogy: First, find the right chapter (grounding). Then, explain the plot while pointing to the passages (answering with evidence). The sticky note (<evi>) doesnāt just mark a page; it summarizes the key scene.
- Detective analogy: Pin the suspectās timeline to the board (grounding), then present the case citing those exact timestamps (answering). The pin (<evi>) includes a brief note of what happened there.
- Cooking show analogy: Bookmark each step when it happens (grounding), then narrate the recipe, reusing those bookmarks to remind viewers exactly where each step occurred (answering with referencing).
Before vs. After:
- Before: Time tokens were interleaved with text, objectives were coupled, and timestamp tokens lacked event meaning. Answers could drift from the true evidence.
- After: Two clear stages, <evi> tokens that carry event-level visual semantics, explicit consistency between stages, and FPO that optimizes both grounding and text.
Why it works (intuition):
- Separating the jobs simplifies learning: the model knows when to find evidence and when to explain it.
- Packing visual semantics into <evi> gives the language model a rich, local context right where it needs itāat the next token.
- Evidence referencing keeps the answer honest: reusing the same <evi> ties the text back to the original grounded moments.
- FPO closes the loop: the model is rewarded not only for nice-sounding text but also for precise grounding, measured via frameā<evi> similarity.
Building blocks (each introduced with a sandwich):
š Hook: Imagine a smart bookmark that also stores a tiny summary of the scene it marks. š„¬ The Concept (Evidence Token, <evi>): An <evi> token is a special token that grounds an event in time and absorbs the eventās visual meaning.
- How it works: (1) Generate <evi>. (2) Compute similarity between <evi> and each frame feature. (3) Pick salient frames with high similarity. (4) Aggregate their features into <evi> (e.g., average and add). (5) Use frame indices of salient frames to form timestamps.
- Why it matters: If <evi> only carried a number, the next words might drift. With visual semantics inside, the text stays tied to what actually happened. š Anchor: For āWhen does the dog fetch the ball?ā, <evi> pulls in frames of the fetch and helps the model say, āThe dog fetches the ball at 23.5ā26.1s.ā
š Hook: Think of switching from scouting to reporting: first you scout locations, then you write the story. š„¬ The Concept (GroundingāAnswering with Referencing): The response is generated in two stages: pure grounding, then interleaved text + evidence referencing.
- How it works: (1) Stage 1: Emit a sequence of <evi> tokens that localize each event. (2) Emit </evi> to signal the switch. (3) Stage 2: Write text and re-emit matching <evi> tokens so timestamps come from the same grounded evidence.
- Why it matters: Answers stay consistent with the original evidence; no drifting. š Anchor: āStep 1 at <evi>, Step 2 at <evi>ā¦ā Each <evi> in the answer matches the ones found earlier.
š Hook: If your second report disagrees with your first notes, somethingās off. š„¬ The Concept (Consistency Constraint): Force <evi> tokens in the answer stage to align with those from the grounding stage.
- How it works: Compare their features and minimize the difference.
- Why it matters: Prevents the answer from quietly changing the evidence. š Anchor: If grounding first said 12ā15s, the answer should reference the same interval via a matching <evi>.
š Hook: Imagine grading both the map and the essay, not just the essay. š„¬ The Concept (Factorized Preference Optimization, FPO): A training rule that increases the modelās preference for responses that have both better grounding and better text.
- How it works: (1) Compute the usual text log-prob. (2) Compute grounding log-prob from frameā<evi> similarities (product across the interval). (3) Compare preferred vs. dispreferred pairs and push up the better one.
- Why it matters: Text alone isnāt enough; we must also prefer precise evidence. š Anchor: If two answers sound okay but only one precisely marks the goal clip, FPO favors the precise one.
03Methodology
High-level pipeline: Input (Video + Question) ā Visual/Text Encoding ā Stage 1: Pure Grounding with <evi> ā Stage Switch (</evi>) ā Stage 2: Interleaved Text + <evi> Answering ā Output (Text + Timestamps from <evi>)
Step 0: Encoders
- What happens: Frames are sampled (e.g., 1 FPS), resized (e.g., 224Ć224), and encoded by a vision backbone (e.g., ViT-G/14 via EVA-CLIP) and a feature compressor. The question is tokenized by the language model. Both streams flow into a video-aware LLM decoder.
- Why this step exists: The LLM needs compact, aligned tokens to reason about video and language together.
- Example: A 100-second video becomes 100 frame tokens; the question āWhere did I put the glassware?ā becomes text tokens.
š Hook: Think of a special bookmark that both points to a moment and stores a mini description of it. š„¬ The Concept (Stage 1: Pure Grounding with <evi>): The model first emits <evi> tokens to localize events without writing the final answer yet.
- How it works: (1) Generate an <evi>. (2) Compute similarity with each frame token. (3) Mark salient frames whose similarity exceeds a threshold (e.g., 60% of the max similarity). (4) Aggregate their features into the <evi> (average and add). (5) The time interval comes from indices of salient frames. (6) Repeat for multiple events in order.
- Why it matters: This step ālocks inā where evidence lives and loads <evi> with the eventās visual semantics. š Anchor: For āFind all ātying somethingā actions,ā the model emits <evi> for 2.8ā11.4s and another for 19.2ā32.5s.
Details of visual semantic capture
- What happens: Before similarity, <evi> is projected by a small MLP to better serve two roles: (a) a generation token (via LM head) and (b) a query for similarity and aggregation.
- Why this step exists: It cleanly separates token classification from similarity-based grounding.
- Example: The projected <evi> aligns better with frame features when finding salient frames.
Task transition signal
- What happens: The model emits </evi> to signal itās done grounding and ready to answer.
- Why this step exists: Clear stage boundaries reduce confusion during generation.
- Example: After three <evi> tokens, a </evi> says, āswitch to answering now.ā
š Hook: Imagine writing a report and, whenever you claim something, you insert a little tag that points to the exact clip that proves it. š„¬ The Concept (Stage 2: Interleaved Text + Evidence Referencing): The model now writes the answer but reuses <evi> tokens to reference the same grounded events.
- How it works: (1) Generate a piece of text. (2) Insert an <evi> that should match the earlier one. (3) Continue text, then another <evi> if needed. (4) Timestamps are read from frameā<evi> similarity again.
- Why it matters: This keeps the answer faithful to the grounded evidence. š Anchor: āIn the dishwasher. The relevant event happens in <evi>.ā That <evi> decodes to the same 10.6ā12.6s interval.
š Hook: If your second note disagrees with your first note, you fix it. š„¬ The Concept (Consistency Constraint): Encourage the answer-stage <evi> tokens to match the grounding-stage <evi> tokens.
- How it works: Minimize the difference between their features, aligned event by event.
- Why it matters: Prevents quiet drift between stages. š Anchor: If the first-stage <evi> focused on frames 75ā83, the answer-stage <evi> should too.
Supervision losses
- What happens: The total loss L = L_sft (usual token prediction) + L_gnd (grounding with frame-wise BCE comparing similarity to ground-truth foreground/background) + L_cons (consistency between <evi> pairs).
- Why this step exists: Each loss teaches a different, essential behavior: fluent text, accurate grounding, and cross-stage consistency.
- Example: For a cooking step at 44ā57s, frames inside are positive (1), others negative (0) for L_gnd.
š Hook: Donāt just grade the essay; grade the map that the essay points to. š„¬ The Concept (Factorized Preference Optimization, FPO): Preference learning that scores both text and grounding quality.
- What it is: A DPO-like objective that compares a preferred response to a dispreferred one and pushes up the joint probability of the better one.
- How it works (step by step):
- For each response, compute text log-prob via the LLM (as usual).
- For each <evi>, compute grounding probability for its interval by multiplying per-frame similarities inside (and 1āsimilarity outside), then take log and add across <evi>.
- Sum text and grounding log-probs to get the joint log-prob.
- Use a logistic loss to prefer the better response.
- Why it matters: The model learns to value precise evidence, not just pretty sentences. š Anchor: Two captions sound similar, but only one nails the momentsāFPO rewards that one.
š Hook: To learn what āwrongā looks like, you sometimes practice with realistic mistakes. š„¬ The Concept (Factorized Data Synthesis): A pipeline that creates controlled, event-level perturbations for both time and text.
- What it is: Synthetic pairs with known causes of errorātemporal shifts/merges/adds/deletes and textual distortions/repeatsābuilt at the sub-video event level.
- How it works: (1) Start from a correct response. (2) Randomly pick factors (grounding or text) and sub-factors (e.g., shift interval, distort key info). (3) Apply to chosen events only. (4) Form preferred vs. dispreferred pairs with known reasons.
- Why it matters: Reliable, scalable preference data without manual labels, and perfectly matched to FPO. š Anchor: In dense captioning, you might shift āadd sugarā from 37ā44s to 33ā48s or replace āsugarā with āsaltā to create a precise negative example.
Computational note
- Frame-wise similarity adds minimal overhead (~0.4 ms per token on a 3090 GPU), about 1.4% of the total forward time, keeping inference practical.
Putting it all together (example)
- Input: āLocalize each step and briefly describe it.ā
- Stage 1: Emit three <evi> that find [75ā83], [120ā128], [126ā128].
- Stage 2: ā<evi>, remove the skin⦠<evi>, cut and dice⦠<evi>, enjoy your mangoes!ā Each <evi> re-references the grounded intervals, and timestamps are read from similarity.
04Experiments & Results
The test: The authors evaluate whether the model can (1) find the right times (temporal grounding) and (2) give good descriptions or answers (text quality). They measure this on diverse tasks so the model must handle single events, multiple events, long videos, and step-by-step procedures.
The competition: They compare against strong video LLMs, including larger ones (7Bā13B), like Video-LLaVA, TimeChat, LITA, VTG-LLM, TRACE, E.T. Chat, and others, many tailored for temporal reasoning. Their model is only 3.8B, so winning shows the methodānot just sizeāmatters.
The scoreboard with context:
- E.T. Bench Grounding (5 sub-tasks: moment retrieval, episodic memory, action localization, extractive summarization, highlight detection): D_VLM-3.8B raises the average F1 by at least 7.0% over recent SOTA. Think of this like jumping from a solid B to a clear A on a tough, multi-part exam.
- E.T. Bench Dense Captioning (dense video captioning, step localization & captioning): It leads in both grounding (at least +5.3% F1) and text similarity (at least +3.6% Sim). Thatās like being best both at finding the right clips and telling a good story about them.
- Charades-STA (moment retrieval): D_VLM-3.8B hits 50.3% R@1@0.5 and 26.0% R@1@0.7, beating prior 7B models and improving 4.4% over a closely related 3.8B baseline. Think of R@1@0.5 as āfound the right moment with acceptable overlapā and @0.7 as āfound it with tight overlap.ā
- YouCook2 (dense captioning): F1 improves by at least 4%, with CIDEr and SODAc up by at least 2.5 and 1.0, respectively. This shows better event detection and richer, more accurate descriptions of cooking steps.
Surprising findings and insights:
- Event-level visual semantic capture matters a lot: Removing it drops text quality noticeably, especially for dense captioning. This confirms that <evi> should be more than a timestampāit should carry event meaning.
- Decoupling helps both halves: Switching from a coupled objective to āground first, then answer with referencingā gives big jumps in both grounding and text metrics. Thatās strong evidence that clean stage separation simplifies learning.
- Interleaved answering with <evi> referencing beats plain text answering. In other words, explicit referencing keeps the answer faithful and boosts performance further.
- The consistency constraint adds a final polish, improving coherence between stages.
- FPO gives an extra lift, especially on grounding tasks: when training explicitly prefers better grounding, the model localizes events more accurately.
Ablations in plain language:
- Just decouple (ground then answer): big gains.
- Add interleaved text + <evi> referencing: more gains.
- Add consistency constraint: steady boost.
- Remove event-level modeling: big dropāinterval-level reasoning is crucial.
- Remove visual semantic capture: noticeable dropātext needs those semantics.
- Add FPO: consistent improvements, strongest for grounding.
General QA benchmarks:
- On MVBench and Video-MME, the method matches or beats some general models trained on more generic data and clearly beats grounding-focused baselines, despite no huge pretraining. This suggests the approach is versatile, and scaling data/model size could push it even further.
Bottom line: Even at 3.8B, the method consistently outperforms many larger SOTAs across grounding-heavy tasks and dense captioning. The recipeādecouple, pack semantics into <evi>, reference evidence in answers, and optimize preferences for both text and timeāpays off.
05Discussion & Limitations
Limitations:
- Some scores are still low on tough tasks like episodic memory and YouCook2 dense captioning, meaning complex multi-event reasoning with long contexts remains challenging.
- The preference data synthesis focuses primarily on generating controlled negatives (dispreferred) rather than a broad palette of positive alternatives, which could limit diversity in what the model prefers.
- While frame-wise similarity adds little overhead, it still introduces an extra step; extremely long videos at high FPS could increase compute.
- The framework assumes events can be captured well at the sampling rate (e.g., 1 FPS). Very short, fleeting actions might need denser sampling or multi-scale features.
Required resources:
- A vision backbone (e.g., ViT-G/14 from EVA-CLIP) plus a Q-Former-like compressor and a compact LLM (e.g., Phi-3-Mini-3.8B). The authors fine-tune with LoRA, making training feasible on 4Ć H100 in about a day on the used dataset.
- A preference synthesis pipeline (provided) to create controlled perturbations at the event level.
When not to use:
- Tasks without meaningful temporal events (pure image QA) donāt benefit from temporal grounding machinery.
- If you only need a fluent summary without caring about exact evidence timing, simpler text-only models may suffice.
- Ultra-long videos where precise, dense, frame-level alignment is unnecessary (e.g., hour-long vibe summaries) might not need <evi>-style referencing.
Open questions:
- How far can this approach scale with larger base models and broader, generic pretraining? Could we keep the strong grounding while boosting general reasoning?
- Can we design positive preference data synthesis (not just negatives) to teach the model multiple good ways to answer faithfully?
- Could multi-scale or hierarchical <evi> tokens capture both short micro-events and long macro-events elegantly?
- How might audio or sensor streams be folded into <evi> so grounding uses more than just visuals?
- Can we make the similarity-to-probability mapping more robust or uncertainty-aware, especially for ambiguous or overlapping events?
06Conclusion & Future Work
Three-sentence summary: This paper shows that video-language models work better when they first ground events in time and then answer while explicitly referencing those grounded events. It introduces evidence tokens that carry both timestamps and event-level visual semantics, plus a preference-learning method (FPO) that rewards both good grounding and good text. With this design, a compact 3.8B model surpasses many larger state-of-the-art systems on temporal grounding and dense captioning benchmarks.
Main achievement: Decoupling āwhenā from āwhatā with an evidence-referencing designāpowered by <evi> tokens that absorb visual semanticsāand optimizing both halves with FPO, leading to clear, consistent gains across tasks.
Future directions: Scale up model and data to combine strong grounding with broader general reasoning; enrich preference data with diverse positive samples; explore multi-scale <evi> for both short and long events; integrate audio; and refine similarity-based probabilities with uncertainty modeling.
Why remember this: It reframes video QA as āground first, then answer with proof,ā turning timestamps from bare numbers into semantic anchors. That small shiftāpacking meaning into evidence tokens and training preferences for both text and timeāmakes answers more trustworthy and models more teachable.
Practical Applications
- ā¢Sports highlight editors that automatically find goals, fouls, or aces and summarize them with precise timestamps.
- ā¢Cooking assistants that mark each stepās interval and describe the action clearly for easy follow-along.
- ā¢Customer support tools that pinpoint when a device error occurs in a troubleshooting video and explain the fix.
- ā¢Education platforms that locate key moments in lectures (definitions, proofs, demos) with concise notes.
- ā¢Workplace safety systems that localize near-misses in surveillance footage and generate incident summaries.
- ā¢Video search engines that return exact moments for multi-event queries (e.g., āall tying knotsā in a tutorial).
- ā¢Content moderation that grounds policy-violating moments and explains the decision with referenced evidence.
- ā¢Fitness apps that detect and timestamp reps or form mistakes, then provide grounded coaching tips.
- ā¢Smart video editing that auto-selects highlights and captions them with event-level references.
- ā¢Legal discovery tools that locate relevant scenes in hours of footage and produce grounded, explainable snippets.