VideoLoom: A Video Large Language Model for Joint Spatial-Temporal Understanding

Jiapeng Shi; Junke Wang; Zuyao You; Bo He; Zuxuan Wu

VideoLoom: A Video Large Language Model for Joint Spatial-Temporal Understanding

Intermediate

Jiapeng Shi, Junke Wang, Zuyao You et al.1/12/2026

arXiv PDF

Key Summary

•VideoLoom is a single AI model that can tell both when something happens in a video and where it happens, at the pixel level.
•It uses two kinds of visual tokens—fast for many low-detail frames (time) and slow for a few high-detail frames (space)—so it sees both motion and details.
•A new dataset, LoomData-8.7k, teaches the model using videos that have both exact timestamps and precise person masks.
•A clever bridge token, [SEG], lets the language model hand off its target to a video segmenter (SAM2) to paint the object exactly.
•On standard tests, VideoLoom hits state-of-the-art or very strong results in both temporal grounding and video object segmentation.
•The team built LoomBench to test questions that mix space and time (Combined), not just one or the other.
•VideoLoom beats a strong two-step baseline on LoomBench Combined questions by large margins in both time overlap and mask quality.
•The method scales with better base models and benefits from the new dataset across spatial, temporal, and general multimodal tasks.
•It still struggles with very fine sub-actions and depends on a multi-stage annotation pipeline, but points a clear path forward.
•This matters for things like smart video search, safety monitoring, sports highlights, and assistive tech that needs both timing and exact locations.

Why This Research Matters

Videos are everywhere—phones, classrooms, sports fields, and factories—so tools that understand both when and where are game-changers. VideoLoom can jump to the exact moment and highlight the exact pixels, making search, editing, and analytics much faster and more reliable. This helps safety monitoring (“alert me when and where a person crosses a line”), sports breakdowns (precise player actions at precise times), and education (pinpointing key steps in lab or cooking demos). It also boosts accessibility, by aligning narration with exact on-screen regions for people who benefit from guided visuals. And because the method scales with stronger base models and better data, its impact should keep growing as AI foundations improve.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine watching a long school play video and asking, “When does the kid in the red hat jump?” and also “Where is the kid exactly when he jumps?” You want the answer to tell you the time and also outline the kid pixel-by-pixel.

🥬 The Concept (Video understanding before this paper):

What it is: Many AI systems for videos could describe scenes or guess actions, but most were good at either timing (when) or location (where), not both together.
How it works (before):
1. Temporal models: sample lots of frames to find the right moment (start and end times).
2. Spatial models: use high-resolution images to segment objects very precisely.
3. People tried to mash them together by training on separate datasets with different labels.
Why it matters: Without a single brain that links time and space, the AI can say “the action happens around 14–20s” but not show you the exact pixels of the person then, or it can draw the person well but not know the correct moment.

🍞 Anchor: Think of an AI that can tell you, “He starts jumping at 16.8s and ends at 24.0s, and here is the exact person mask in those frames.” That’s the dream.

🍞 Hook: You know how a calendar and a map tell different things—time vs place? Videos need both.

🥬 The Concept (The problem researchers faced):

What it is: A lack of unified training data that ties exact times to exact places in the same video examples, plus compute tradeoffs (many frames vs high resolution).
How it works:
1. Existing datasets usually had either timestamps (temporal) or masks/boxes (spatial), not both together.
2. Training one model on a mix of mismatched datasets caused confusion and unstable learning.
3. Spatial tasks want big, clear frames; temporal tasks want many frames. Under fixed compute, balancing both is hard.
Why it matters: Without consistent, paired supervision and a compute-savvy design, models can’t form strong time–space links.

🍞 Anchor: It’s like learning to read music without ever hearing the song, or hearing the song without seeing the sheet music—you won’t play the piano well without both.

🍞 Hook: Imagine you try to solve this by just doing two steps: first find the time with one model, then draw the mask with another.

🥬 The Concept (Failed attempts):

What it is: Two-step or separate training paths.
How it works:
1. Temporal model finds the clip window.
2. A separate segmenter draws masks using text.
3. No shared memory across time and space; errors in step one hurt step two.
Why it matters: The pieces don’t fully talk to each other, losing context and making joint questions (when+where at once) unreliable.

🍞 Anchor: Like a relay race where the first runner drops the baton a bit—the second runner starts in the wrong place.

🍞 Hook: Think of a cookbook where every recipe lists both the cooking time and the exact portions. That’s what training a joint model needs.

🥬 The Concept (The gap this paper fills):

What it is: A consistent dataset with both timestamps and masks and a single model that balances frame coverage and image detail.
How it works:
1. Build LoomData-8.7k with aligned time and pixel masks around main people.
2. Design a model that uses many low-detail frames (for time) and a few high-detail frames (for space) together.
3. Create LoomBench to test temporal, spatial, and combined questions fairly.
Why it matters: Now the model can truly learn and be tested on the real thing we want—joint space–time understanding.

🍞 Anchor: It’s like giving a detective both the timeline and high-resolution photos of the suspects, then scoring the detective on solving full cases, not just half of them.

🍞 Hook: You know how you use a timeline slider to scrub through a video and then zoom in to see tiny details? That’s exactly why this research matters in daily life.

🥬 The Concept (Real stakes):

What it is: Solving when+where in videos changes how we search, edit, and automate tasks with video.
How it works:
1. Faster video search: “Find when the kid scores and show exactly where he is.”
2. Safety and assistance: “Alert me when the person in a yellow vest crosses this line—and show me their mask.”
3. Education and sports analytics: precise highlights at exact times with the right player outlined.
Why it matters: We watch and record lots of video—phones, classrooms, sports, factories. Better tools save time, boost safety, and unlock insights.

🍞 Anchor: Your future video app could jump straight to the moment your dog catches the frisbee and draw a perfect outline around your dog right then.

02Core Idea

🍞 Hook: Imagine braiding two strings—one for time and one for space—into one strong rope. That rope can pull heavy questions like “when and where” together.

🥬 The Concept (Aha! in one sentence):

What it is: Use two kinds of visual tokens (SlowFast) plus a bridge token ([SEG]) to let a single video language model answer both when and where at once.
How it works:
1. Fast tokens: many low-res frames for motion over time.
2. Slow tokens: a few high-res keyframes for fine spatial detail.
3. The language model reasons over both, outputs timestamps (by frame IDs) and a special [SEG] embedding.
4. A video segmenter (SAM2) uses that [SEG] to paint the exact object masks, which are then propagated across the video.
Why it matters: Without both token types and the [SEG] bridge, the model either misses details (blurry masks) or misses events (wrong times).

🍞 Anchor: Ask, “When does the swimmer dive, and where is she exactly then?” The model replies with times and paints her silhouette in those frames.

Multiple analogies:

Orchestra analogy: The fast tokens keep the beat across the whole piece (tempo over time), while the slow tokens are like the soloist’s crisp notes (fine details). The conductor ([SEG]) cues the exact instrument (the segmenter) to play the right part (mask) at the right moment.
Detective analogy: Fast tokens scan security footage all day (broad timing), slow tokens zoom into the crystal-clear snapshots (faces, clothes), and [SEG] hands the sketch artist the exact target to draw.
News reporter analogy: The reporter (LLM) takes notes with timestamps (frame IDs) and calls the camera crew ([SEG]→SAM2) to capture high-res shots of the right person in the right scene.

Before vs After:

Before: Separate models or steps for time and space; brittle handoffs; inconsistent data; missed details or missed events.
After: One brain sees both the movie timeline and sharp stills; answers when+where in one go; trained on consistent data, tested on a joint benchmark.

Why it works (intuition):

Long videos need many views to catch the right moment—fast tokens do this cheaply.
Precise masks need big, clear frames—slow tokens supply this detail.
By interleaving frame IDs into language, the LLM talks about time naturally (“from 70 to 100”).
The [SEG] token carries the LLM’s chosen target straight into SAM2, so the segmenter paints exactly what the language model meant.

Building blocks (first-time concept intros):

🍞 Hook: You know how a movie trailer shows many quick clips (to tell the story) and a few close-ups (to show details)? 🥬 The Concept (SlowFast Visual Tokens):

What it is: Two token streams—fast (many low-res frames) and slow (few high-res keyframes).
How it works:
1. Sample up to 128 frames for fast tokens; encode at lower spatial density (16 tokens per frame).
2. Pick 5 keyframes for slow tokens; encode with many tokens (256 per frame) for rich detail.
3. Feed both into the visual encoder and then the LLM.
Why it matters: Without fast tokens, you miss the right time; without slow tokens, masks get fuzzy. 🍞 Anchor: Finding “when the skateboarder flips” (fast tokens) and outlining the skateboard precisely (slow tokens).

🍞 Hook: Imagine the LLM saying, “Here’s the moment and the target—go paint it!” 🥬 The Concept (MLLM-SAM2 Architecture):

What it is: A language-vision model (InternVL3) plus a video segmenter/tracker (SAM2) connected by a [SEG] token.
How it works:
1. The MLLM reads SlowFast tokens and the question.
2. It outputs text with frame IDs (timestamps) and produces a [SEG] embedding.
3. SAM2 takes [SEG] and keyframe features to generate masks, then propagates masks across frames.
Why it matters: Without the [SEG] bridge, the segmenter wouldn’t know exactly which object the LLM meant. 🍞 Anchor: The user asks, “Where is the man in white on the monkey bars?” The LLM finds the time and passes [SEG]; SAM2 paints the man right then.

🍞 Hook: Think of calling out “Frame 26!” while scrubbing a video with friends so everyone knows the exact moment. 🥬 The Concept (Frame IDs for time):

What it is: Use simple text frame numbers to mark temporal order.
How it works:
1. After each fast-frame’s tokens, insert a text like “This sampled frame id is 26.”
2. The LLM answers with start and end frame IDs for time.
Why it matters: Without clear frame markers, the model’s time talk becomes vague. 🍞 Anchor: “The action happens from 48 to 63” is an exact, checkable answer.

🍞 Hook: Picture a coach who scores both timing (did we find the right seconds?) and drawing (did we paint the right pixels?). 🥬 The Concept (Joint training objective):

What it is: A loss for text answers (timestamps) plus a loss for masks.
How it works:
1. Text cross-entropy trains the LLM to output the right frame IDs and words.
2. Mask BCE + Dice trains the segmenter to paint the correct pixels.
Why it matters: Without both losses, the model might learn when but not where (or vice versa). 🍞 Anchor: The model learns to say “13.5–21.12s” and also outline the diver at those times.

03Methodology

At a high level: Video + Question → (SlowFast tokenization + frame IDs) → LLM reasoning → [SEG] target → SAM2 masks + time → Answer text + pixel masks.

Step-by-step recipe:

Input preparation (seeing both time and detail)

What happens: The system samples up to 128 frames across the whole video (fast) and picks 5 high-res keyframes (slow). It resizes frames to 448×448 for the LLM’s visual encoder and 1024×1024 for SAM2.
Why this step exists: Temporal tasks need many frames; spatial tasks need high resolution. Skipping either weakens either timing or mask accuracy.
Example: For a 2-minute cooking clip, select 128 evenly spaced frames (fast) and 5 keyframes around important moments (slow).

SlowFast tokenization (turning pixels into tokens)

What happens: The visual encoder turns frames into tokens: 16 fast tokens per frame (downsampled) and 256 slow tokens per keyframe (rich detail).
Why this step exists: Tokens are the bite-sized pieces the LLM can read. Without proper token counts, you either blow up compute or starve the model of information.
Example: A single keyframe becomes a dense grid of 256 slow tokens so the model can see clothing textures or small props.

Interleave frame IDs (making time speakable)

What happens: After each fast frame’s tokens, the system inserts a plain-text tag like “This sampled frame id is 26.” The full input becomes [fast tokens, ID, fast tokens, ID, …, slow tokens …].
Why this step exists: Using text numbers lets the LLM talk about time naturally. Without IDs, the LLM’s temporal answers would be fuzzy or inconsistent.
Example: Asked “When does the batter swing?”, the model can reply “from 70 to 100” (frame IDs) rather than vague words.

LLM reasoning (deciding when and who/what)

What happens: The MLLM (InternVL3-8B) reads the question and all tokens. It outputs answer text for timestamps and produces one special [SEG] token’s hidden state that encodes the target object.
Why this step exists: The LLM is the planner and explainer. Without it, we wouldn’t connect the right time span to the right visual target.
Example: For “Where is the woman rinsing the cloth?”, it chooses the frames and encodes the woman as [SEG].

Segmentation via SAM2 (painting the pixels)

What happens: SAM2 takes high-res keyframe features plus the [SEG]-derived target embedding to generate masks on keyframes, then propagates them across the video using its memory mechanism.
Why this step exists: High-quality masks need a specialized, pixel-accurate decoder. Without SAM2, masks would be coarse or unstable over time.
Example: In a diving video, SAM2 outlines the diver across keyframes and fills in between frames smoothly.

Training objective (teaching both talking and painting)

What happens: The model is trained end-to-end with a text loss (cross-entropy for answers) and a mask loss (BCE + Dice). Hyperparameters set both to weight 1.
Why this step exists: We must jointly teach “when to say” and “where to draw.” Removing one loss unbalances learning.
Example data: Mix classic VQA/segmentation datasets with LoomData-8.7k so supervision includes consistent time+mask pairs.

Efficient finetuning (practical setup)

What happens: The visual encoder is frozen; we finetune the LLM with LoRA and the SAM2 mask decoder. One epoch, batch size 64, LR 4e-5, on 8×NVIDIA H20 (96 GB) GPUs.
Why this step exists: Freezing big vision backbones keeps compute manageable and stable while adapting the language head and segmentation.
Example: Training completes in a single pass thanks to good initialization and carefully curated data.

Concrete walk-through with data:

Input question: “Where is the swimmer when she emerges and catches her breath?”
Fast tokens: hundreds of low-res glimpses detect the emerging moment (IDs 140–170).
Slow tokens: five 1024×1024 keyframes show water droplets and the swimmer’s face clearly.
LLM output: “The query happens from 23.12s to 28.46s” (converted from IDs) and emits [SEG].
SAM2 output: A tight mask around the swimmer during those frames, propagated across in-between frames.
Final answer: The time span plus the masks.

Secret sauce:

SlowFast tokens balance temporal coverage with spatial precision, squeezing more out of the same compute.
The [SEG] bridge lets the language brain aim the paintbrush exactly.
Frame IDs transform “temporal understanding” into ordinary language prediction, which LLMs are great at.
LoomData-8.7k supplies consistent time+mask labels so learning the joint skill is natural.

04Experiments & Results

The test: Does one model handle both timing and pixel-precise location?

Temporal grounding: Charades-STA, YouCook2 (dense captioning), QVHighlights (highlights).
Spatial segmentation: RefYTVOS, MeVIS, ReVOS (referring video object segmentation).
Joint evaluation: LoomBench with When, Where, and Combined (when+where at once) questions.

The competition: Strong video LLMs and task-specific systems like TimeChat, TRACE, TimeSuite, HawkEye (temporal), and Sa2VA, VRS-HQ (spatial), plus a pipeline baseline (TimeSuite → Sa2VA) for Combined questions.

Scoreboard with context:

Temporal (find the right time): On Charades-STA, VideoLoom gets R1@0.7 = 48.3, a top-tier score among unified models; on QVHighlights, HIT@1 = 63.3, which is like spotting the best moment on your very first guess more than six times out of ten; on YouCook2 it reaches 7.3 SODA_c and strong dense captioning metrics (e.g., F1 and mAP).
Spatial (paint the object): On MeVIS J&F = 51.7, RefYTVOS J&F = 71.3, ReVOS J&F = 63.1—state-of-the-art or highly competitive, even against tracking-focused models. That’s like outlining the right person crisply while they move.
Joint (do both at once): On LoomBench Combined, VideoLoom beats the two-step baseline by +16.2 tIoU and +15.4 on the new Bidirectional Foreground J&F metric. That’s the difference between a solid B and a convincing A.

Why these results matter:

One brain > two-step: A single model trained to think across time and space outperforms a pipeline that glues two separate experts.
Stable across tasks: Gains show up not only in temporal or spatial tests alone but also in the hardest combined setting.

Surprising findings:

Unified beats specialists: Despite being general, VideoLoom matches or surpasses models specialized for tracking or grounding.
Better data lifts all boats: Adding LoomData-8.7k bumps not only combined performance (+5.0 on J&F_bi-fore) but also standard temporal/spatial and general multimodal benchmarks.
Fairer metric needed: Standard J&F can be inflated by background frames in Combined questions. The new Bidirectional Foreground J&F gives more stable, honest scores across different segment lengths.

Extra insights:

SlowFast ablations: Using only slow or only fast tokens hurts the other side (space or time). Using both together produces consistent improvements across nearly all datasets.
Scaling helps: Swapping in stronger base MLLMs (e.g., InternVL3-8B) improves joint understanding, showing the approach is future-proof.
Generalization: Though LoomData is human-centric, segmentation quality for non-human classes also benefits, hinting that richer language descriptions help broader categories.

05Discussion & Limitations

Limitations:

Fine-grained temporal reasoning: The model can still miss very short sub-actions or the nth repetition within long videos.
Annotation pipeline complexity: Multi-stage automatic labeling (detection, tracking, captioning, merging) still needs manual verification for best quality.
Person-centric bias: LoomData focuses on main people; other categories benefit but are not the primary target.
Compute and memory: Training uses 8×H20 GPUs with large memory; edge devices or real-time constraints remain challenging.

Required resources:

Hardware: Multi-GPU setup (e.g., 8×NVIDIA H20, 96 GB each) for efficient finetuning.
Software: XTuner codebase, InternVL3 MLLM, SAM2 segmenter.
Data: Standard video-language datasets plus LoomData-8.7k.

When NOT to use:

Real-time on tiny devices or strict latency applications without adaptation.
Videos with dozens of similar, tiny objects where a single main target is unclear.
Domains with actions very far from training data (e.g., specialized medical procedures without fine-tuned data).
Queries requiring precise audio cues (no explicit audio modeling here).

Open questions:

Can we further automate the pipeline—both generation and verification—using stronger agents to reduce manual checks?
How to better capture micro-actions and repetitions without exploding compute (e.g., smarter keyframe selection, hierarchical time modeling)?
Can we unify learning so the segmenter and the LLM share more representations end-to-end?
How far does this scale with audio, 3D cues, or multi-person interactions where multiple targets must be coordinated jointly?
What’s the best universal, fair metric for combined space–time tasks beyond LoomBench’s setting?

06Conclusion & Future Work

Three-sentence summary: VideoLoom is a single video large language model that answers both when and where by combining fast tokens for time, slow tokens for space, and a [SEG] bridge to a powerful segmenter. A new dataset (LoomData-8.7k) provides consistent timestamp-plus-mask supervision, and a new benchmark (LoomBench) fairly tests temporal, spatial, and combined questions. The model achieves state-of-the-art or highly competitive results across diverse tasks and strongly outperforms a two-step pipeline on combined evaluations.

Main achievement: Proving that one unified architecture can jointly localize events in time and paint objects in space—reliably and efficiently—backed by a matching training dataset and a fair, comprehensive benchmark.

Future directions: Automate the annotation pipeline with stronger agents; improve sensitivity to micro-actions and repeated events; integrate audio and multi-person modeling; push end-to-end learning between the language head and the segmenter; scale with newer, stronger base MLLMs.

Why remember this: It turns the long-standing either-or (time vs space) into a both-and solution. As videos continue to explode in our lives, tools that can jump to the right moment and show the right pixels will reshape search, editing, safety, analytics, and assistive experiences.

Practical Applications

•Smart video search: Ask natural questions and jump to the exact time while seeing the person/object mask.
•Sports analytics: Auto-clip key plays and outline the involved player or equipment at the right frames.
•Safety and compliance: Detect when and where a person enters restricted zones and visualize it precisely.
•Video editing: Quickly locate segments and isolate subjects for compositing, b-roll, or effects.
•Education and tutorials: Highlight the exact tool or ingredient at each step, synced to timestamps.
•Retail and stores: Track when and where a shopper interacts with a product display for heatmaps and insights.
•Robotics and inspection: Mark the exact component a robot should manipulate at the correct step in a sequence.
•Assistive tech: Provide synchronized time+mask cues for users who need guided attention in complex scenes.
•Content moderation: Flag and localize prohibited actions with exact time spans and object masks.
•AR/VR experiences: Anchor virtual overlays to the correct person or object precisely when events occur.

Version: 1