4D-RGPT: Toward Region-level 4D Understanding via Perceptual Distillation

Chiao-An Yang; Ryo Hachiuma; Sifei Liu; Subhashree Radhakrishnan; Raymond A. Yeh; Yu-Chiang Frank Wang; Min-Hung Chen

4D-RGPT: Toward Region-level 4D Understanding via Perceptual Distillation

Intermediate

Chiao-An Yang, Ryo Hachiuma, Sifei Liu et al.12/18/2025

arXiv PDF

Key Summary

•This paper teaches a video-understanding AI to think in 3D plus time (4D) so it can answer questions about specific objects moving in videos.
•The key trick is Perceptual 4D Distillation (P4D), which lets a fast language+vision model learn depth, motion, and timing from a frozen expert without slowing down at test time.
•Two kinds of lessons are used: latent distillation (matching abstract hidden features) and explicit distillation (matching clear signals like depth and optical flow).
•A special Timestamp Positional Encoding (TPE) gives the model a built-in sense of when each frame happens.
•They also built R4D-Bench, a new test where questions point to exact regions in dynamic videos (like 'How fast is ⟨R1⟩?').
•4D-RGPT improves over strong baselines on six existing 3D/4D benchmarks by about 5.3% on average.
•On the new R4D-Bench, 4D-RGPT scores 4.3% higher than the baseline and does especially well on dynamic tasks like speed and displacement.
•The method adds no extra cost during inference because all the heavy 4D learning happens only during training.
•Results show depth and optical flow supervision matter most, and explicit time cues (TPE) fix common timing mistakes.
•This approach helps real-world tasks like safer robots, better driver assistance, and smarter video analytics that must track specific moving things.

Why This Research Matters

Real-life systems need to answer precise, time-sensitive questions about specific things that move. This work makes AI better at tracking the exact object you point to and measuring what it does over time, like speed, direction, and distance. Because the 4D learning happens only during training, the final model stays fast enough for real applications. Safer driver assistance can monitor the right car and estimate its approach rate. Robots can check if the gripper is moving toward the correct part and at a safe speed. Industrial inspectors can measure how a component shifts over time to catch problems early.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine watching a sports replay. You don’t just see where players are—you also track who moves where, how fast, and when the key play happens. Your brain does 3D plus time (that’s 4D) automatically.

🥬 The Concept (Multimodal Large Language Models—MLLMs):

What it is: MLLMs are AI systems that understand both words and visuals together.
How it works: (1) Read images or video frames. (2) Turn them into features the language model can understand. (3) Read the question in text. (4) Combine vision + text features. (5) Generate an answer token by token.
Why it matters: Without combining vision and language, an AI might spot objects but won’t answer precise questions like “How far did this car move?” 🍞 Anchor: When you ask, “Which direction is the red car going?” an MLLM uses the video plus the question to decide.

🍞 Hook: You know how a flipbook shows pictures that move when you flip the pages? That’s time added to 3D space—4D.

🥬 The Concept (4D Understanding):

What it is: Knowing where things are in 3D and how they change over time.
How it works: (1) Sense depth (near/far). (2) Sense motion (which way/how fast). (3) Track objects frame to frame. (4) Use timing to measure speed/acceleration. (5) Answer questions using all of the above.
Why it matters: Without 4D, the model can’t tell if something moved closer, how fast it went, or when it happened. 🍞 Anchor: “At 2 seconds, what’s the acceleration of ⟨R1⟩?” needs 3D movement and the exact time.

🍞 Hook: When you say, “That player,” you usually point to them so nobody gets confused.

🥬 The Concept (Region-level Understanding):

What it is: A way to tell the AI exactly which part of the image/video to focus on (like a highlighted box ⟨R1⟩).
How it works: (1) Mark a region. (2) Bind the question to that region. (3) Track that same region across frames. (4) Answer only about that region.
Why it matters: Without regions, the AI may guess the wrong person or object, especially if there are many similar ones. 🍞 Anchor: “How many times did ⟨R1⟩ hit ⟨R2⟩ upward?” only counts hits by those marked regions.

🍞 Hook: Hold your hand close to your face. It looks big; far away, it looks small. That feeling is depth.

🥬 The Concept (Depth Perception):

What it is: Estimating how near or far each pixel is.
How it works: (1) Read visual cues (size, texture, shading). (2) Predict a depth value per pixel. (3) Compare depth across frames. (4) Use it for distances and speeds.
Why it matters: Without depth, “closer vs. farther” questions fail, and speed can’t be computed correctly. 🍞 Anchor: “Is ⟨R1⟩ moving toward the camera?” depends on depth changing over time.

🍞 Hook: If you look at two photos taken a moment apart, you can draw little arrows showing where pixels moved. That’s optical flow.

🥬 The Concept (Optical Flow):

What it is: A per-pixel map of how things shift between frames.
How it works: (1) Compare frame A and frame B. (2) Estimate arrows (vectors) for each pixel. (3) Summarize direction and speed of motion. (4) Use it to track and measure movement.
Why it matters: Without optical flow, the AI struggles to tell subtle motion or compute path length. 🍞 Anchor: Counting how many times ⟨R1⟩ dribbles needs precise motion cues between frames.

The world before: MLLMs were great at captions and general Q&A, but they often stumbled on 4D tasks—like computing speed or tracking a specific object over time—especially when scenes were dynamic (both camera and objects moving). Training methods like supervised fine-tuning or reinforcement learning mostly adjusted the final text output, not the model’s inner sense of depth and motion. Other attempts plugged in external 3D modules during inference, which helped with static scenes but made systems slower and didn’t fully solve 4D timing.

The problem: We need a model that can truly perceive depth and motion over time, and answer about a marked region—without becoming slow or clunky at test time.

Failed attempts: (1) Just fine-tuning on text answers couldn’t teach low-level 4D perception. (2) Reinforcement learning improved some skills but lacked ground-truth 4D signals like depth. (3) Attaching external 3D/4D modules at inference added cost and mostly helped static videos.

The gap: A way to inject true 4D perceptual knowledge (depth, flow, motion) into the model’s own features during training only, so test-time stays fast.

Real stakes: In driver assistance, you must ask about the exact car ahead—how far and how fast it’s approaching. In robotics, you need to check if the robot’s gripper (not the box) is moving toward the mug, and how quickly. In industrial inspection, you track the same part over time and measure distances. Without region-level 4D understanding, the AI guesses or generalizes to the wrong object, which can be unsafe.

02Core Idea

🍞 Hook: Think of a swim coach standing at the poolside. They don’t jump in every race, but their training helps swimmers go faster without extra help later.

🥬 The Concept (Perceptual 4D Distillation—P4D):

What it is: A way to teach a language+vision model low-level 4D skills (depth, flow, motion) by learning from a frozen 4D expert during training only.
How it works: (1) Run a frozen teacher that already knows 4D perception. (2) Ask the student model to match the teacher’s hidden 4D features (latent distillation). (3) Also match teacher’s clear signals like depth/flow maps (explicit distillation). (4) Train end-to-end with both these losses plus normal text supervision. (5) Throw away the training-only parts—no extra cost at test time.
Why it matters: Without P4D, the model can talk but doesn’t truly perceive 4D; with P4D, it learns to feel motion and depth inside its own features. 🍞 Anchor: After P4D, when asked “What’s the average speed of ⟨R1⟩?”, the model uses learned depth and timing to compute the right answer.

🍞 Hook: When you watch a movie, timestamps (like 1:05, 1:06) help you say exactly when something happened.

🥬 The Concept (Timestamp Positional Encoding—TPE):

What it is: A time code added to each frame’s features so the model knows when that frame occurred.
How it works: (1) Convert each frame’s timestamp into a sine/cosine vector. (2) Add it to the visual features before fusion with text. (3) Let the model use time differences to compute speeds/acceleration. (4) Improve consistency across different frame rates.
Why it matters: Without TPE, the model often misjudges duration and timing, breaking speed/acceleration answers. 🍞 Anchor: With TPE, “At 2.0 sec, what’s the acceleration of ⟨R1⟩?” becomes answerable because the model knows what “2.0 sec” means inside its features.

🍞 Hook: Imagine a student (the talker) learning from a science lab assistant (the measurer). The student learns to describe and to measure.

🥬 The Concept (4D-RGPT—The Student Model):

What it is: A specialized multimodal LLM that learns 4D perception internally via P4D and uses TPE for time.
How it works: (1) Take video frames + question. (2) Encode visuals and text. (3) Use a training-only 4D decoder to form latent 4D features. (4) Predict explicit depth/flow/motion during training to match the teacher. (5) Answer questions at test time without extra modules.
Why it matters: Without a student that internalizes 4D, answers stay fuzzy; with 4D-RGPT, the model becomes precise on region-level 4D tasks. 🍞 Anchor: When asked “How many times did ⟨R1⟩ hit ⟨R2⟩ upward?”, 4D-RGPT follows the marked regions across time and counts correctly.

Three analogies for the same idea:

Coach and athlete: The 4D teacher coaches 4D-RGPT in practice (training), so the athlete competes alone on race day (inference) just as fast.
Training wheels: The teacher’s depth/flow maps are training wheels that teach balance (4D feel). They’re removed once the rider is steady.
Subtitles for time: TPE is like on-screen timestamps; the story (video) becomes easier to follow and measure.

Before vs. After:

Before: The model could describe scenes but mixed up timing and struggled with exact regions and speeds.
After: It tracks the right region, understands near/far and motion, and answers with correct numbers.

Why it works (intuition):

Matching latent features shapes the student’s internal sense of 4D structure (a good “feel”).
Matching explicit signals (depth/flow/motion) gives clear, per-pixel targets (a good “ruler”).
TPE gives stable timing, turning displacement into speed/acceleration.
All of this happens during training, so test-time stays lightweight.

Building blocks:

Student core (4D-RGPT) with training-only 4D perception heads.
P4D’s two branches: latent distillation (abstract alignment) + explicit distillation (signal alignment).
TPE to tie frames to real time.
Region binding so questions act on exactly ⟨R1⟩, ⟨R2⟩. Together, these pieces convert a chatty model into a careful measurer of 4D reality without slowing it down.

03Methodology

High-level recipe: Input video + question → (A) Visual encoding with timestamps → (B) Training-only 4D perception inside the student → (C) Distill from a frozen 4D teacher (latent + explicit) → (D) Language model answers → Output.

Step A: Add time and encode the video.

What happens: Each frame gets a Timestamp Positional Encoding (TPE) added to its visual features, then a projector aligns visuals with text.
Why it exists: Without explicit time, the model confuses durations and ruins speed/accel calculations.
Example: Frames at 0.0s, 0.5s, 1.0s each carry a distinct time code so “At 2.0s…” questions map to the right moment.

Step B: Build training-only 4D perception inside the student.

What happens: A small decoder on top of the model’s hidden states forms latent 4D features (an internal 4D summary). Heads predict explicit maps: depth (near/far per pixel), optical flow (per-pixel motion), motion mask (moving vs. static), and camera rays.
Why it exists: Learning to measure during training makes the language model’s insides sensitive to 4D cues.
Example: Watching a runner, the depth map shows distance to the camera, and flow arrows show direction; the model learns both.

🍞 Hook: You know how a science teacher shows you the right answer key, and you compare your work line by line?

🥬 The Concept (Teacher–Student Distillation):

What it is: A frozen expert 4D model (teacher) supervises the student during training.
How it works: (1) The teacher produces internal 4D features and explicit maps from the same video. (2) The student predicts its own versions. (3) Losses pull student outputs toward the teacher’s. (4) After training, only the student is used.
Why it matters: Without a teacher, the student guesses; with a teacher, the student learns precise 4D perception. 🍞 Anchor: The teacher’s depth map for frame 10 becomes the target that shapes the student’s predicted depth for frame 10.

Step C: Distill knowledge in two ways.

Latent distillation (abstract):
- What: Match the student’s latent 4D features to the teacher’s intermediate 4D embeddings.
- Why: Gives broad, structural guidance beyond pixel-by-pixel details.
- Example: Even if the exact depth number is a bit off, the student learns the scene’s overall 3D layout.
Explicit distillation (concrete):
- What: Match predicted depth, flow, motion, and camera-ray maps to the teacher’s.
- Why: Gives clear, interpretable supervision for near/far and movement.
- Example: The student aligns its flow arrows with the teacher’s to capture precise motion.

Step D: Train end-to-end with text answers.

What happens: Alongside distillation, the usual language loss teaches the model to produce the correct answer choice.
Why it exists: We want a measurer who can also explain—so the answer text stays grounded in the learned 4D perception.
Example: For “How many upward hits?”, the student’s flow helps count; the language head picks the right option.

🍞 Hook: When you label things on a picture with small tags like 1, 2, 3, everyone knows who’s who.

🥬 The Concept (Set-of-Marks—SoM):

What it is: A simple way to display tiny numbered markers on the first frame so questions can refer to regions as ⟨R1⟩, ⟨R2⟩.
How it works: (1) Detect or segment objects. (2) Place markers and names. (3) Replace nouns in questions with region tokens. (4) Verify the match.
Why it matters: Without clear markers, region-level questions are ambiguous. 🍞 Anchor: “How far is ⟨R1⟩ from ⟨R2⟩ at 7s?” means the model compares the exact two marked regions.

Putting it together with an example:

Input: A 16-frame clip of two cars; question: “At 3.0 sec, what is the instantaneous speed of ⟨R1⟩?”
Process: Frames get TPE; the student’s 4D heads predict depth/flow while the teacher supplies targets; latent and explicit losses shape the student; the language head chooses the right option.
Output: The correct speed choice (like 3.74 m/s).

The secret sauce:

Dual distillation: Latent (shape the inner sense) + explicit (teach exact measuring) makes the student robust and accurate.
TPE: A low-cost, always-on time sense that fixes common mistakes about durations.
No inference overhead: All special 4D predictors are training-only, so test-time speed is like a normal MLLM.

What breaks without each step:

No TPE → wrong durations → wrong speeds/accels.
No latent distillation → weak global 4D structure → brittle answers.
No explicit distillation → fuzzy depth/flow → bad numbers.
No region markers → answers about the wrong object.

Training data and practice:

The model is trained on mixed 3D/4D Q&A from several sources (robotics, driving, scene understanding) so it learns many motion/depth patterns.
Ablations show depth and flow signals help the most; combining latent+explicit is best; and TPE beats text or burned-in timestamp hacks.

04Experiments & Results

🍞 Hook: Think of a spelling bee. To prove you’re good, you spell lots of words on many stages against strong opponents.

🥬 The Concept (R4D-Bench—A New Stage):

What it is: A benchmark of region-level 4D questions on dynamic videos with clear region markers and multiple-choice answers.
How it works: (1) Start from existing 4D datasets. (2) Detect/segment objects. (3) Mark regions and rewrite questions with ⟨R1⟩, ⟨R2⟩. (4) Verify by humans. (5) Cover tasks like translational/rotational motion, counting, false positives, 3D grounding, dimensions, path length, speed/acceleration.
Why it matters: Without a test that targets exact regions in motion, we can’t measure whether models truly do region-level 4D. 🍞 Anchor: “How many times did ⟨R1⟩ hit ⟨R2⟩ upward?” or “At 2.0 sec, what’s the acceleration of ⟨R1⟩?”

The tests:

Non-region 3D/4D benchmarks: STI-Bench, VLM4D-real, OmniSpatial, MMSI-Bench, SAT, and VSTI-Bench.
New region-level benchmark: R4D-Bench (1,517 questions; static and dynamic tasks).

The competition:

Strong proprietary and open models: GPT-4o, GPT-5, Gemini 2.5-Pro, Qwen2.5-VL, SpaceR, SpatialReasoner, ViLaSR, and NVILA-Lite baselines.

Scoreboard with context:

On six existing 3D/4D tests, 4D-RGPT lifts average accuracy by about +5.3% over its baseline. Think moving from a solid B to a B+/A− across many classes.
On R4D-Bench, 4D-RGPT scores +4.3% on average above its baseline and shines especially on dynamic tasks like Speed & Acceleration and Displacement & Path Length—like being the only kid in class who both explains and measures motion correctly.
Against similar-sized open models, 4D-RGPT is state-of-the-art; it even challenges larger proprietary systems in some categories.

Concrete examples:

Where others guessed “left” or “right,” 4D-RGPT correctly answered “not moving” by using learned depth+flow.
For “At 2.0 sec, what’s the acceleration?,” 4D-RGPT’s TPE and explicit motion signals led it to the right numeric choice.

Surprising findings:

A model tuned for non-region VQA (SpaceR) fell behind on R4D-Bench, showing that region binding is a special—and necessary—skill.
Depth and optical flow supervision mattered most; adding motion masks and camera rays helped but less.
Explicit time cues via TPE beat using text prompts or burned-in video timestamps, and without distracting the visual reasoning.

Why these results matter:

They show that teaching 4D perception inside the model (during training) pays off across many datasets without slowing inference.
The dual distillation is complementary: latent shapes the inner map; explicit sharpens the measuring stick.
Region-level tests reveal gaps that non-region tests miss—exactly the kind of precision real-world apps need.

05Discussion & Limitations

Limitations:

Teacher dependence: If the frozen 4D teacher has blind spots, the student can inherit them.
Domain shift: Training data covers many scenes, but totally new environments (e.g., underwater footage, thermal cameras) may require adaptation.
Region annotations: R4D-Bench uses careful pipelines and human checks, yet extremely crowded scenes can still be tricky.
Fine-grained physics: Very subtle accelerations or complex rotations beyond typical camera/video noise can remain challenging.

Required resources:

A capable 4D teacher (like L4P) and compute for joint training with distillation.
Datasets that include varied motion/depth patterns.
For best results, tuning the projector and language model while keeping the vision encoder frozen (as found in ablations).

When not to use:

If you only need static image captions or generic scene descriptions—this 4D specialization may be overkill.
If you must run on ultra-tiny devices with no training budget—P4D’s benefits require a training phase with a teacher.
If your task relies on modalities unseen by the teacher (like LiDAR-only or non-visual sensors) without corresponding training.

Open questions:

Can we make the student learn from multiple teachers (e.g., different 4D experts) for even broader 4D skills?
How far can we push region-level 4D beyond bounding masks—e.g., articulated parts or deformable objects?
Can we extend TPE to handle very long videos with variable frame rates and dropped frames, while staying robust?
What’s the best curriculum to balance latent vs. explicit distillation across tasks and data scales?
Can the student eventually teach itself (self-distillation) to reduce reliance on external teachers?

06Conclusion & Future Work

Three-sentence summary:

This work makes a language+vision model truly 4D-aware by distilling depth and motion perception from a frozen expert while adding timestamps so it knows when things happen.
The student (4D-RGPT) learns with two lessons—latent and explicit distillation—so it both feels and measures 4D, then answers region-level questions quickly at test time.
A new benchmark (R4D-Bench) proves the method’s strength on dynamic, region-targeted tasks that mirror real-world needs.

Main achievement:

Showing that training-only perceptual distillation plus timestamp encoding can turn a chatty MLLM into a precise 4D reasoner with no extra inference cost.

Future directions:

Broaden teachers and data to cover more motions and sensors; explore longer, messier videos; and refine region grounding for crowded, real-world scenes.

Why remember this:

It’s a clear recipe for giving models an internal sense of space and time, focused on the exact object you care about—just like how we watch, track, and measure plays in a game. This shift unlocks safer robots, sharper analytics, and smarter assistants that don’t just see scenes—they understand how they unfold.

Practical Applications

•Advanced driver assistance: Ask about the exact car ahead—distance now, speed, and whether it’s approaching.
•Robotics safety: Verify that the robot’s gripper (⟨R1⟩) moves toward the target part (⟨R2⟩) at a safe rate.
•Sports analytics: Count how many upward hits a player made and measure ball speed changes over time.
•Industrial inspection: Track a specific part’s displacement and orientation changes across frames.
•AR/VR guidance: Anchor instructions to marked objects and adapt to their motion in real time.
•Surveillance triage: Focus on the marked person and estimate path length or acceleration for alerts.
•Education tools: Let students ask region-linked questions to learn motion, velocity, and distance concepts interactively.
•Video editing and VFX: Precisely follow a selected object’s path and speed to align effects or overlays.
•Human–robot collaboration: Region-tag tools and parts; ask time-based safety checks before actions.
•Medical training videos: Track a marked instrument’s motion and measure timing during procedures (with proper approvals).

Version: 1