From Segments to Scenes: Temporal Understanding in Autonomous Driving via Vision-Language Model

Kevin Cannons; Saeed Ranjbar Alvar; Mohammad Asiful Hossain; Ahmad Rezaei; Mohsen Gholami; Alireza Heidarikhazaei; Zhou Weimin; Yong Zhang; Mohammad Akbari

From Segments to Scenes: Temporal Understanding in Autonomous Driving via Vision-Language Model

Intermediate

Kevin Cannons, Saeed Ranjbar Alvar, Mohammad Asiful Hossain et al.12/4/2025

arXiv PDF

Key Summary

•This paper builds TAD, a brand-new test that checks if AI can understand what happens over time in real driving videos.
•TAD uses 150 short city-driving videos and 5,861 question–answer pairs that ask about both small moments (segments) and whole trips (scenes).
•Many top vision-language models (VLMs) struggle, especially with fine-grained motions like slow lane changes or when the ego car is not visible in the camera view.
•The authors add two training-free helpers: Scene-CoT (step-by-step reasoning descriptions per segment) and TCogMap (a simple, ego-centric motion map over time).
•Across models, TCogMap boosts accuracy the most, raising some systems by up to 17.72% on average.
•Human performance (about 74.7% average) still beats the best models (about 65.7% average), showing there is room to grow.
•Scene-CoT helps smaller models more; big models already do some internal reasoning and gain less from extra text descriptions.
•A surprising result: sometimes the ego-motion map alone helps more than raw images alone, but combining both works best.
•The benchmark and code are released so others can test and improve temporal understanding for safer autonomous driving.

Why This Research Matters

Driving is about timing as much as seeing: knowing who moved first, how long they waited, and when they turned is essential for safe choices. This work gives the field a fair, realistic test (TAD) for temporal understanding in ego-centric driving videos. It also shows that simple, training-free tools—step-by-step notes (Scene-CoT) and a short ego-motion timeline (TCogMap)—can make today’s models much better without retraining. The gains shrink the gap to human performance, suggesting safer, more explainable vehicle perception is within reach. Because the code and data are public, researchers and engineers can build on it right away. Ultimately, these advances help AI understand the flow of traffic the way people do, which is vital for real-world autonomy.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: You know how when you watch a short clip from a soccer game, you need to remember what happened a few seconds ago to know why the goalie dived? The past matters for making sense of the present.

🥬 The Concept (Temporal Understanding): It is the skill of tracking what happened first, next, and last, and how actions connect across time.

How it works: (1) Notice events, (2) place them in order, (3) connect causes and effects, (4) use this timeline to answer questions.
Why it matters: Without it, an AI treats a video like shuffled photos, missing who started, stopped, turned, or changed lanes—and why.

🍞 Anchor: If you ask, "Did the car stop before turning right?", temporal understanding lets the AI check the timeline instead of guessing.

🍞 Hook: Imagine recording a bike ride with a front camera. You can see the road and other riders, but you never see yourself in the frames—you only feel your motion.

🥬 The Concept (Ego-Centric View): It means the camera is on the ego vehicle, so we see the world from its perspective, not the car itself.

How it works: (1) The camera moves with the car, (2) background shifts reveal your own motion, (3) other agents appear and disappear at different times.
Why it matters: Without understanding ego motion, AI gets confused: a slow curve can look like others drifting, and you never directly see the ego car to label its action.

🍞 Anchor: On a left turn, buildings slide right in the video. That "slide" signals your ego car is turning, even though you never see the car itself.

🍞 Hook: Think of a student who can both look at a diagram and read a paragraph, then explain the science lab results.

🥬 The Concept (Vision-Language Models, VLMs): They are AIs that learn from images/videos plus text to answer questions about what they see.

How it works: (1) Turn frames into visual tokens, (2) read the question as text tokens, (3) reason over both, (4) produce an answer.
Why it matters: Without connecting sight and words, the AI can’t follow instructions like “Which action lasted longest?”

🍞 Anchor: Show a VLM a driving clip and ask, “When did the bus first appear?” It searches the video and replies with the right frames.

The world before this paper: Video benchmarks existed, but they mostly focused on sports, cooking, and movies. Cars are different. They move on roads with rules, have subtle maneuvers (like gentle lane changes), and the camera is anchored to the ego car. That makes ordering and timing trickier. Models that were good at general videos often stumbled in driving scenes, especially on fine-grained motion.

The problem: There was no benchmark laser-focused on the special temporal challenges of ego-centric autonomous driving (AD):

Varying temporal scales: blinks-and-you-miss-it lane changes vs. longer stops.
Ego-centric view: the ego car’s action must be inferred indirectly.
Fine-grained actions: “starting” vs. “stopped,” “turn left” vs. “change lane left.”

Failed attempts and gaps: General video QA benchmarks didn’t test the unique AD needs; AD QA sets existed, but few demanded precise event timing across both short segments and whole scenes. Models often guessed or mixed up event order, and struggled to localize exactly when things happened.

🍞 Hook: Imagine grading a spelling test without words that use silent letters—the tricky part wouldn’t be tested.

🥬 The Concept (TAD Benchmark): It’s a test tailored to AD temporal reasoning, built from real driving videos with questions that probe both small moments and long scenes.

How it works: (1) Use 150 NuScenes videos (~20s each), (2) divide into overlapping 5-second segments, (3) annotate fine-grained vehicle actions, (4) generate 5,861 QA pairs across 7 tasks, (5) evaluate models on timing, order, and actions.
Why it matters: Without a focused test, we can’t tell if models truly understand driving timelines.

🍞 Anchor: A TAD question might ask, “Which happened earlier for the ego car: Starting or Stopping?” The model must reason over the whole video to choose.

Real stakes: In daily life, cars need to know not just what is in front of them, but what happened a moment ago and what is likely next. Misreading whether a bus is stopped or just pausing can change how a car reacts. TAD exposes where AI fails now, so safer systems can be built next.

02Core Idea

🍞 Hook: You know how teachers sometimes ask you to show your work in math? Writing the steps helps you get the right answer, not just guess it.

🥬 The Concept (Chain-of-Thought, CoT): It’s a step-by-step explanation the AI creates before answering.

How it works: (1) Break a video into bite-sized segments, (2) describe the scene, (3) infer the ego car’s motion, (4) list nearby vehicles’ motions, (5) summarize cleanly, (6) answer the question using all these steps in order.
Why it matters: Without steps, AI may miss subtle motions or mix up event order.

🍞 Anchor: To answer, “When did the car first stop?”, the AI’s segment-by-segment notes make it easy to locate the stop point.

🍞 Hook: Imagine a travel diary where you jot down what you (not someone else) did at each point of the day.

🥬 The Concept (Ego-Centric Temporal Cognitive Map, TCogMap): It’s a simple timeline of the ego car’s actions across segments, computed from its trajectory.

How it works: (1) Read ego poses (position, rotation, timestamps), (2) compute speeds and yaw changes, (3) detect turning vs. lane changing vs. starting/stopping with thresholds, (4) produce a short motion label per segment.
Why it matters: Without a clear ego timeline, the model confuses its own motion with others’, hurting temporal reasoning.

🍞 Anchor: A TCogMap might say: “Frames 1–8: straight; 9–15: turn left; 16–22: straight.” That context helps the model answer timeline questions.

One-sentence “Aha!”: Testing and helping AI think over time in driving needs both the right exam (TAD) and simple, training-free hints (Scene-CoT and TCogMap) that make timelines clear.

Three analogies:

Movie script: TAD asks about the plot timeline; Scene-CoT writes scene-by-scene summaries; TCogMap is the main character’s diary.
Cooking show: TAD checks the order of steps; Scene-CoT is the recipe card; TCogMap says exactly when the chef stirred or turned the heat down.
Sports replay: TAD quizzes when moves happened; Scene-CoT is the commentator’s play-by-play; TCogMap is the player’s GPS track.

Before vs After:

Before: VLMs saw many frames at once and tried to answer, often mixing up timing or subtle actions.
After: With Scene-CoT and TCogMap, the same VLMs get structured time-cues and ego motion summaries, improving accuracy by up to 17.72% without extra training.

Why it works (intuition):

Bottleneck 1—Memory: Long videos overflow working memory. Segmenting + summaries reduce clutter.
Bottleneck 2—Ambiguity: Ego motion is invisible in the image. TCogMap reveals it directly from trajectory.
Bottleneck 3—Reasoning steps: CoT scaffolds the thinking process so the model doesn’t skip or mash steps.

Building blocks:

TAD: 5,861 QAs, 7 task types, segment and scene levels.
Segmenting videos: 5-second windows with overlap capture micro-actions.
Actions: eight fine-grained labels (straight, stopped, starting, stopping, turn left/right, change lane left/right).
Scene-CoT: four-step segment descriptions (scene, ego action, others’ actions, JSON summary) then QA.
TCogMap: compute speeds/yaw/lateral motion from ego poses, classify action per segment, inject as time-aligned text.
Evaluation: accuracy for multiple choice/exact match; temporal mIoU for frame-list localization.

03Methodology

High level recipe: Input video and question → split into segments → build time-aware summaries (either Scene-CoT or TCogMap) → feed frames + summaries + question into a VLM/LLM → output the answer.

🍞 Hook: Imagine cutting a long movie into short scenes to study who did what, when.

🥬 The Concept (Video Partitioning): Divide the ~20s video into overlapping 5-second segments to catch micro-motions.

How it works: (1) Uniformly split into l segments with ~50% overlap, (2) sample a few frames (e.g., four) per segment for description, (3) keep nearby vehicles within 50m for labeling.
Why it matters: Without segments, quick actions blur into longer ones; subtle cues get lost.

🍞 Anchor: A lane change often completes within 5 seconds—perfect for a segment window.

Scene-CoT (training-free CoT reasoning):

Inputs: sampled frames per segment.
Steps per segment:
1. Scene description: Write a concise high-level caption.
2. Ego motion: Infer the ego car’s action from background shifts.
3. Nearby vehicles: Describe each distinct agent’s motion.
4. Summary formatting: Produce a clean JSON-like motion summary.
Why each step?
- Scene description: primes context; without it, later steps miss setting.
- Ego motion: anchors the timeline; without it, others’ actions are ambiguous.
- Nearby vehicles: ties interactions; without it, multi-agent questions fail.
- Summary: standardizes info; without it, the QA step sifts through messy text.
Example: For segment j, the summary might say “ego: straight; nearby: blue taxi straight; bus stopping.” Later, the LLM strings together all segments’ summaries to answer, “Which happened earlier: stopping or starting?”
Secret sauce: a lightweight, human-like outline that makes event order explicit, especially helpful for smaller models with less internal reasoning.

🍞 Hook: Think of a fitness tracker that logs your path and speed so you later recall exactly where you sped up or turned.

🥬 The Concept (TCogMap Ego Motion Classification): Turn ego poses into per-segment motion labels.

What it is: A rule-based decoder from raw pose/velocity to action (straight, stopping, starting, turn left/right, lane change left/right, stopped).
How it works (step-by-step):
1. Compute frame-to-frame velocities and speeds from pose differences.
2. If speeds are low most of the time, label “stopped.”
3. Compute total yaw change across the segment: large positive → left turn; large negative → right turn.
4. Transform velocities to the ego’s local frame; if lateral and forward speeds cross thresholds together, label lane change (sign gives left vs right).
5. Compare starting vs ending speeds to detect “starting” or “stopping.” Otherwise, “straight.”
Why it matters: Without this, the VLM must infer ego motion from pixels alone—hard in an ego-centric view.

🍞 Anchor: If yaw changes by ~15° with steady speed, it’s likely a turn; if lateral speed is non-zero with forward speed, it’s a lane change.

VLM/LLM QA stage:

Scene-CoT path: The LLM reads (a) the question and (b) the ordered segment summaries, then answers.
TCogMap path: The VLM receives (a) the frames, (b) the question, and (c) time-stamped ego-motion summaries like “Frames 1–7: straight; 8–12: stopped.”
Why it matters: Without aligned time hints, the model may misplace events or conflate agents.

Example with actual data:

Task: Temporal Ordering (ego). Question: “Which came first: Starting (A) or Stopping (B)?”
Scene-CoT: finds segment summaries where starting occurs before stopping; answers “A.”
TCogMap: uses the ego-motion labels to see “starting” appears in segment 2 and “stopping” in segment 7; answers “A.”

Metrics and evaluation:

Multiple-choice/exact-answer tasks: accuracy.
Frame-list localization tasks: temporal mean IoU (mIoU) between predicted and ground-truth frame sets.
Baselines: frames + question; plus a “textual ego pose” variant; then add Scene-CoT or TCogMap.

Secret sauce of the whole method:

Keep it training-free and modular—drop-in helpers for any VLM.
Give models just enough structure to follow the timeline without overwhelming them.
Use ego poses (a free signal in AD datasets) to disambiguate what the camera is doing versus what others are doing.

04Experiments & Results

The test: TAD probes seven abilities—two segment-level action recognition tasks and five scene-level temporal tasks: Action Duration, Temporal Ordering, Temporal Action Localization, Relative Temporal Action Localization, and Temporal Object Localization. Most are multiple choice; localization tasks expect frame lists scored by temporal mIoU.

The competition: 9 models across 30 configurations—open-source generalist (Qwen2.5-VL, InternVL3 families), closed-source generalist (GPT-5-mini, Gemini-2.5-Flash), and AD specialists (RoboTron, Cosmos-Reason). Each tested as (a) baseline, (b) baseline + textual ego pose, (c) + Scene-CoT, (d) + TCogMap.

Scoreboard with context:

Human level (Avg): about 74.72%—like scoring an A when the test is tricky.
Chance level (Avg): about 34.37%—random guessing across mixed tasks.
Best model configuration (Avg): InternVL3-38B + TCogMap at about 65.66%—a strong B when humans get an A.
Closed-source (Avg): GPT-5-mini ~52.04%; Gemini-2.5-Flash ~52.10%—roughly mid-pack among open models.
Gains: TCogMap brings the biggest boosts, improving averages by up to 17.72% over corresponding baselines. Scene-CoT helps smaller models more than larger ones.

Meaning behind numbers:

There’s still a gap to humans (~9 percentage points), showing temporal reasoning in AD remains hard.
TCogMap vs textual ego pose: Giving raw pose text alone isn’t enough; converting it into an ego-motion timeline is what unlocks performance.
Scene-CoT: For compact models (e.g., Qwen2.5-VL-7B), adding the CoT summaries lifts average accuracy notably. For large models, gains are smaller (they already do more internal reasoning).

Surprising findings:

Ego vs non-ego: Baselines did better on ego questions (ego motion leaves strong camera-motion clues). Scene-CoT helped non-ego questions more; TCogMap supercharged ego questions and also modestly lifted non-ego ones by providing context.
Blind test: With question-only input, models hovered near chance—proof TAD isn’t answerable by pattern-guessing. Using TCogMap alone sometimes beat images alone on temporal tasks, but combining both was best, confirming that time-structure and visuals complement each other.
AD specialists: In-domain training didn’t guarantee better temporal reasoning—without explicit time scaffolding, they still stumbled on ordering/localization.

Efficiency note:

TCogMap runtime is about the same as baseline (tiny overhead to compute ego motion). Scene-CoT is slower due to many reasoning calls (about 40 per video), but can be sped up with batching/optimization.

05Discussion & Limitations

Limitations:

Scope: TAD zeroes in on temporal understanding; it doesn’t cover every AD need (e.g., long-horizon planning, rare corner-case intent, or multi-sensor fusion beyond front-camera frames in evaluation).
Granularity: Actions are eight coarse categories; some very subtle maneuvers or combined behaviors may still be ambiguous.
Ego-focus in TCogMap: Only ego motion is mapped; non-ego timelines are inferred indirectly.
Generalization: Built on NuScenes validation split; different cities/sensors or nighttime/rain extremes might shift results.

Required resources:

A VLM that can take sequences of frames (≈40 for scene tasks; ≈10 for segment tasks) and handle text prompts.
For TCogMap: access to ego poses (position, rotation, timestamps), which common AD datasets provide.
For Scene-CoT: an LLM to generate per-segment descriptions; more compute for multi-call reasoning.

When NOT to use:

If you lack ego pose data, TCogMap can’t be built as-is (though approximate odometry could help).
Ultra low-latency settings may avoid Scene-CoT unless optimized, due to multiple reasoning passes.
If your model already has strong built-in temporal modules and training data tailored to AD, the extra scaffolding may offer smaller gains.

Open questions:

Can we extend cognitive maps to non-ego agents without overwhelming the model with clutter?
What’s the best way to compress long temporal contexts (hours) without losing crucial micro-actions?
How can we fuse multi-sensor temporal signals (LiDAR, radar, maps) into similarly compact, helpful summaries?
Can we train models directly on CoT-like supervision or motion-map labels to replace inference-time scaffolding?

06Conclusion & Future Work

Three-sentence summary: This paper introduces TAD, a focused benchmark to test whether models can truly understand the timing and order of events in ego-centric driving videos. It also proposes two training-free helpers—Scene-CoT and TCogMap—that add step-by-step segment reasoning and an ego-motion timeline, significantly boosting performance. Results show large accuracy gains (up to 17.72%), but there’s still a gap to human-level understanding, pointing to exciting future work.

Main achievement: Establishing the first comprehensive AD temporal benchmark covering both segment- and scene-level questions and demonstrating that simple, structured, training-free time cues (especially TCogMap) meaningfully close the gap for many VLMs.

Future directions:

Expand cognitive maps to include non-ego agents with smart filtering to avoid clutter.
Build training datasets that teach models to produce/use temporal summaries natively.
Optimize Scene-CoT for speed via batching, pruning, and quantization.
Explore multi-sensor temporal maps that remain compact and informative.

Why remember this: It shows that in driving, time is as important as pixels—and that giving models the right kind of timelines and step-by-step notes can make them much better learners without retraining, bringing safer, more reliable perception closer to reality.

Practical Applications

•Benchmark your VLM’s temporal skills on AD videos using TAD’s 7 tasks before deployment.
•Plug TCogMap into your perception stack to add ego-motion context without retraining your model.
•Use Scene-CoT to generate concise per-segment summaries that improve small-model performance on timeline questions.
•Adopt TAD’s 5-second overlapping segments to capture micro-maneuvers like lane changes in your own datasets.
•Train lightweight classifiers on ego poses to auto-label turning, starting, stopping, and lane-changing events for QA or supervision.
•Add time-stamped ego-motion hints to prompts when asking long-video questions to reduce ordering errors.
•Use temporal mIoU in your evaluation to fairly score frame-range predictions for actions or object visibility.
•Run blind vs image vs TCogMap ablations to diagnose whether your model lacks visual or temporal structure.
•Optimize inference by batching Scene-CoT segment calls or using quantized models to reduce runtime.
•Leverage TAD’s segment-level action annotations (4,481 labels) to pretrain or validate fine-grained action recognizers.

Version: 1