TrackingWorld: World-centric Monocular 3D Tracking of Almost All Pixels

Jiahao Lu; Weitao Xiong; Jiacheng Deng; Peng Li; Tianyu Huang; Zhiyang Dou; Cheng Lin; Sai-Kit Yeung; Yuan Liu

TrackingWorld: World-centric Monocular 3D Tracking of Almost All Pixels

Intermediate

Jiahao Lu, Weitao Xiong, Jiacheng Deng et al.12/9/2025

arXiv PDF

Key Summary

•TrackingWorld turns a regular single-camera video into a map of where almost every pixel moves in 3D space over time.
•It first grows a few tracked dots into dense tracks for all frames, so even new objects that appear later are followed.
•Then it carefully figures out how the camera itself moved, separate from how things in the scene moved.
•It uses an "as-static-as-possible" trick to avoid being fooled by moving stuff in the background when estimating the camera path.
•After that, it lifts all 2D tracks into world-centered 3D paths using depth and the recovered camera poses.
•It runs a smart optimization (like fine-tuning a puzzle) so the 3D tracks line up with what the video shows and with the depth.
•Across several datasets, it beats previous methods on camera pose accuracy, 3D depth along tracks, and sparse 3D tracking quality.
•The method works with different depth and motion-mask tools and still stays strong, showing it is robust.
•The upsampler gives dense, accurate 2D tracks faster than directly predicting dense flow, making the whole system practical.
•The final result is long-term, world-centric 3D motion for almost all pixels, which is great for video editing, AR, robotics, and motion analysis.

Why This Research Matters

World-centric, dense 3D tracking from a single camera lets phones, drones, and robots truly understand how everything moves in a scene. Video editors can place effects that never drift, AR apps can anchor virtual objects that don’t jitter, and sports analysts can measure precise player and ball trajectories. It also improves safety and navigation for robots or autonomous devices by separating camera shake from real object motion. Because it works with common videos, not special rigs, this tech reaches everyday creators and consumers. In short, it upgrades the motion “sense” of any system that watches the world through a single lens.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine filming a school play with your phone. Later, you want to know exactly where every costume, prop, and actor moved in 3D, not just in the picture. That would help you make special effects that stick perfectly to people and objects.

🥬 Concept 1 — Monocular 3D Tracking

What it is: Monocular 3D tracking means following points through time in 3D using only one camera.
How it works:
1. Track where pixels move between frames in 2D.
2. Use depth hints to turn those 2D paths into 3D paths.
3. Keep them consistent over many frames so the motion makes sense.
4. Repeat for lots of pixels.
Why it matters: Without it, effects slide around, robots get confused, and videos look wobbly when edited. 🍞 Anchor: If you place a virtual hat on someone’s head in a video, 3D tracking keeps the hat glued in the right place, even when they move.

The world before: For years, computers got good at 2D tracking (following dots on the screen), and some methods started doing 3D, but usually in the camera’s own view (camera-centric). Many assumed the camera stayed still. That’s a big problem in real life, because phone videos almost always have camera motion.

🥬 Concept 2 — Camera Motion Estimation

What it is: Figuring out how the camera moved (its position and angle) for every frame.
How it works:
1. Find parts of the scene that don’t move (the background).
2. Use depth to lift those points into 3D.
3. Adjust the camera pose so those background points reproject correctly in each frame.
4. Do this across the video so the camera path is smooth and accurate.
Why it matters: If you mix up camera motion with object motion, you think the world is moving when it’s really just the camera. 🍞 Anchor: When riding a bus, trees look like they’re sliding by, but really it’s you moving. Camera motion estimation learns “you’re on the bus.”

🥬 Concept 3 — Dynamic Object Tracking

What it is: Following the 3D motion of things that actually move, like people, cars, or a bouncing ball.
How it works:
1. Use masks to guess which pixels are moving.
2. Track those pixels over time.
3. Turn their 2D paths into 3D using the camera poses and depth.
4. Keep the motion realistic and smooth.
Why it matters: Without it, moving stuff becomes blurry and misaligned in 3D. 🍞 Anchor: If someone waves their hand, dynamic tracking tells you the hand moved, not the whole world.

The problem: Past 3D tracking often (1) didn’t separate camera motion from object motion (they pretended the camera stayed still), and (2) could only track a few pixels chosen in the first frame. New objects entering later in the video weren’t handled well, and dense tracking everywhere was too slow.

Failed attempts: Feed-forward methods that predict 3D motion directly from features were fast but struggled with long videos (they drift) or couldn’t disentangle camera motion from object motion. Optimization-based ones could be more accurate but didn’t scale to dense tracks across all frames or still mixed motions together.

🥬 Concept 4 — World-centric Coordinate System

What it is: A single, shared 3D map for the whole scene, independent of the camera.
How it works:
1. Recover camera poses through time.
2. Keep one fixed world frame (like a stage) where everything’s measured.
3. Express all tracks and motions in this world frame.
4. Separate camera motion (how the viewer moves) from object motion (how actors move on stage).
Why it matters: Without a world frame, camera shakes look like objects wiggling. 🍞 Anchor: In sports replay, the field is the same for every camera angle. That’s world-centric thinking.

The gap: We needed a method that could (a) track almost all pixels in every frame (even new objects), and (b) cleanly split camera motion from object motion in a world frame. That’s what this paper delivers.

🥬 Concept 5 — Tracking Upsampler

What it is: A tool that turns a few tracked points into dense tracking over nearly all pixels.
How it works:
1. Start with sparse 2D tracks and their features.
2. Predict weights that say how each dense pixel should follow nearby sparse tracks.
3. Combine them to produce dense tracks across the frame.
4. Repeat for each frame.
Why it matters: Without upsampling, dense tracking would be too slow or memory-heavy. 🍞 Anchor: Like asking a few classmates where to sit and then using their advice to quickly assign seats to everyone else.

Real stakes: Better world-centric 3D tracking powers steadier video edits, AR that doesn’t jitter, robots that navigate safely, and sports/game analytics that understand motion precisely. It makes everyday phone videos ready for pro-level effects.

02Core Idea

🍞 Hook: You know how a good choreographer watches a whole stage: the dancers, the props, and even the moving spotlight, all at once? They need to know who moved, who stayed, and where the stage is.

Aha in one sentence: Track almost every pixel in every frame, remove redundant ones, then jointly estimate the camera path and lift all tracks into one world-centered 3D stage using an “as-static-as-possible” trick so camera motion and object motion are cleanly separated.

Three analogies:

Theater stage: The world is the fixed stage, dancers are dynamic objects, and the roaming cameraman is the camera. We first follow everyone (dense tracking), then figure out the cameraman’s path, and finally map every dancer’s moves on the stage.
Map and travelers: First, list where everyone went (dense 2D routes). Next, correct for the moving bus (camera) so routes are on the city map (world). This way, you know the real paths, not bus-shifted ones.
Sticker book: You put stickers (tracks) on everything in every page (frame). Then you straighten the book’s tilt (camera pose) so the sticker motion shows real object movement, not page wobble.

🥬 Concept 6 — Dense 2D Tracking (for every frame)

What it is: Following nearly all pixels, not just a few, across the whole video.
How it works:
1. Start with sparse tracks.
2. Upsample to dense tracks per frame.
3. Repeat for each frame to catch new objects.
4. Filter out overlapping regions to avoid duplicates.
Why it matters: Without tracking every frame, new objects and surfaces aren’t captured. 🍞 Anchor: A crowd enters the scene in frame 10; dense tracking means you still catch them.

Before vs After:

Before: Methods often tracked only from the first frame, assumed a static camera, and mixed up camera and object motion.
After: We grow tracks in all frames, filter redundancy, explicitly solve for camera motion, and place everything in a world frame. Result: cleaner, longer, denser, and more reliable 3D motion.

Why it works (intuition):

When you track densely and everywhere, you have more geometric clues.
If you treat the background as “as-static-as-possible,” true movers in the background can be detected and won’t mislead the camera estimation.
Bundle adjusting camera poses and tracks together aligns what the video shows with what the 3D model believes.

Building blocks (key pieces):

Upsampler to go from sparse to dense.
Track every frame and remove redundant overlaps.
Initial camera pose from coarse static regions + depth.
Dynamic background refinement with an “as-static-as-possible” offset so background motion doesn’t break pose estimation.
Lift 2D tracks to 3D using recovered camera poses and depths.
Optimize with geometric constraints (reprojection, depth consistency, rigidity, and temporal smoothness) to keep motion realistic.

🍞 Concept 7 — Redundancy Filtering

What it is: Throwing away dense tracks in areas already well covered.
How it works:
1. Check if a pixel lies close to an older track’s path.
2. If yes, skip adding a new track there.
3. Keep only new or poorly covered regions.
4. Remove tiny isolated bits (components) that are probably noise.
Why it matters: Without filtering, the system gets slow and memory-heavy with little accuracy gain. 🍞 Anchor: If three friends already measured a wall, you don’t need ten more measuring the same spot.

🍞 Concept 8 — World-centric Lifting (Back-projection + Reprojection)

What it is: Turning 2D points into 3D world points using depth and camera poses.
How it works:
1. Unproject 2D+depth into 3D.
2. Place 3D points in the world frame via the camera pose.
3. Reproject to check against image positions.
4. Adjust camera and 3D points to reduce error.
Why it matters: Without lifting, you only know where pixels move on the screen, not in the world. 🍞 Anchor: A dot at (x,y) with depth becomes a little 3D pebble you can place on the stage.

Bottom line: The clever combo—dense-everywhere tracking + pose recovery with an “as-static-as-possible” guardrail + world-frame optimization—turns messy 2D motion into clean, reliable, world-centric 3D motion for almost all pixels.

03Methodology

High-level recipe: Video → (Sparse tracks, depth, dynamic masks) → Upsample to dense 2D tracks for all frames + filter overlaps → Initial camera poses from “static” regions → Dynamic background refinement with “as-static-as-possible” offsets → Lift to world 3D and optimize static + dynamic tracks → Output camera poses and dense world-centric 3D trajectories.

Inputs: A single video; foundation-model estimates of sparse 2D tracks (e.g., DELTA/CoTrackerV3), monocular depth (e.g., UniDepth), and coarse dynamic masks (e.g., VLM+Grounded-SAM or Segment Any Motion).

Step A — Preprocessing with foundation models

What happens: Run a point tracker to get sparse tracks, a depth estimator for per-frame depth, and a dynamic-mask tool to mark likely moving regions.
Why it exists: We need initial motion, depth, and motion-seg cues to start; they don’t have to be perfect.
Example: If a ball rolls in at frame 8, the mask hints it’s dynamic, and the depth gives its distance in each frame.

🍞 Concept 9 — Sparse-to-Dense Upsampling

What it is: Expanding a small set of 2D tracks into dense tracks across the image.
How it works:
1. Take sparse tracks and their features.
2. Predict weights linking each dense pixel to nearby sparse tracks.
3. Combine neighbors with those weights to produce the dense track per pixel.
4. Do this per frame.
Why it matters: It’s far faster and just as accurate as predicting dense tracks from scratch. 🍞 Anchor: Like repainting a mural by using a few guide dots and smoothly filling in the rest.

Step B — Track every frame + filter redundancy

What happens: We apply the upsampler to each frame (not just the first) so new objects later are tracked. Then we discard dense pixels that lie too close to earlier tracks, and remove tiny isolated components.
Why it exists: Capturing late arrivals and new surfaces is essential; filtering keeps compute practical.
Example: If a door opens at frame 12, we keep the door’s new interior region; we skip re-tracking the unchanged wall.

🍞 Concept 10 — Initial Pose Estimation (Clip-to-Global)

What it is: Estimating the camera path in small chunks (clips), then stitching them globally.
How it works:
1. Use coarse dynamic masks to pick likely static tracks per clip.
2. Lift those tracks with depth to 3D.
3. Optimize camera poses so projections match observed 2D.
4. Merge clip poses into one global trajectory.
Why it matters: Solving globally at once can be heavy; clips make it faster and more stable. 🍞 Anchor: Like mapping a town by first mapping neighborhoods, then aligning the neighborhood maps.

🍞 Concept 11 — Bundle Adjustment (BA)

What it is: Jointly refining camera poses and 3D points to minimize reprojection errors.
How it works:
1. Start with initial poses and 3D guesses.
2. Project points back to each frame.
3. Measure how far off they are from tracked 2D positions.
4. Adjust both poses and 3D points to reduce the error across all frames.
Why it matters: Without BA, small errors snowball into drift and shaky 3D. 🍞 Anchor: Like retying a net so every knot lines up with its mark, keeping the whole mesh taut and even.

🍞 Concept 12 — As-Static-As-Possible (ASAP) Constraint

What it is: A rule that prefers zero motion for background points unless the data strongly says they move.
How it works:
1. Give each “static” point a tiny motion allowance (offset).
2. Penalize offsets so they stay near zero unless needed.
3. Points that truly move will take non-zero offsets.
4. Camera poses are then estimated without being fooled by those movers.
Why it matters: Noisy masks often miss moving stuff in the background; ASAP prevents those from corrupting pose. 🍞 Anchor: If a tree branch actually sways, the method allows it—but only if the video insists.

🍞 Concept 13 — Depth Consistency

What it is: Keeping the 3D depth along tracks consistent with the monocular depth estimates.
How it works:
1. Compute depth from the current 3D point and pose.
2. Compare to the predicted monocular depth at that pixel.
3. Penalize large mismatches.
4. Repeat to stabilize scale and geometry.
Why it matters: Without this, the system might invent odd shapes that still reproject well. 🍞 Anchor: If a lamp is predicted 2 m away, but the track says 5 m, something’s off—consistency fixes that.

Step C — Dynamic Object Tracking in 3D

What happens: For dynamic regions (including background points flagged as moving by non-zero ASAP offsets), we directly optimize their time-varying 3D paths using camera poses and constraints.
Why it exists: Moving things need their own evolving 3D paths in the world frame.
Example: A cyclist rides across the scene; each of their tracked points gets a 3D path that flows smoothly through time.

🍞 Concept 14 — As-Rigid-As-Possible (ARAP)

What it is: A gentle rule that discourages unrealistic stretching between neighboring dynamic points over time.
How it works:
1. For each point, find nearby points.
2. Keep their relative distances similar between frames.
3. Allow bending but resist rubbery distortions.
4. Balance this with other data terms.
Why it matters: Without ARAP, shapes can wobble or melt. 🍞 Anchor: A bicycle frame can move, but it shouldn’t stretch like taffy.

🍞 Concept 15 — Temporal Smoothness

What it is: A preference for gentle changes from one frame to the next.
How it works:
1. Penalize sudden jumps in a point’s 3D position.
2. Encourage steady motion unless evidence shows a sharp move.
3. Combine with other losses.
4. Keep tracks human-plausible.
Why it matters: Without it, small tracking noise becomes jittery 3D. 🍞 Anchor: A walking person’s head path is smooth, not zippy left-right every frame.

Outputs: World-frame camera poses; dense static tracks (with tiny ASAP offsets if needed); dynamic tracks for everything that moved.

Secret sauce:

Track every frame (catch newcomers), but filter overlaps (stay efficient).
Estimate camera poses while protecting them from sneaky moving background with ASAP.
Jointly optimize with depth consistency, BA, ARAP, and temporal smoothness for stable, realistic 3D.

04Experiments & Results

The test: The authors checked four things: (1) camera pose accuracy, (2) depth along dense tracks, (3) sparse 3D tracking quality, and (4) dense 2D tracking quality. They used standard datasets: Sintel (synthetic but challenging), Bonn and TUM-D (real-world RGB-D), plus ADT and Panoptic Studio for sparse 3D tracking, and CVO for long-range optical flow (2D).

Competition: They compared with strong baselines, including classical and modern methods: DROID-SLAM, DPVO, COLMAP, Robust-CVD, DUSt3R, MonST3R, Align3R, and Uni4D for camera/depth/pose; SpatialTracker, DELTA, OmniTrackFast for tracking; RAFT and CoTracker for 2D flow.

Scoreboard with context:

Camera pose (ATE↓, RTE↓, RRE↓): On Sintel, TrackingWorld reaches ATE ≈ 0.088 (DELTA-init) and ≈ 0.103 (CoTracker-init). Think of this as straight-A performance when others are mostly at A or B+; it’s the most stable and accurate camera path across datasets.
Dense 3D track depth (Abs Rel↓, δ<1.25↑): With UniDepth priors, TrackingWorld cuts depth errors dramatically (e.g., Sintel Abs Rel ≈ 0.218 vs prior DELTA ~0.636 with UniDepth), which is like shrinking the ruler error from a foot to just a few inches.
Sparse 3D tracking (AJ↑, APD3D↑, OA↑): On moving-camera scenes (ADT), explicitly separating camera and object motion boosts both geometric accuracy and track quality; on static-camera scenes (PStudio), gains are smaller but still solid.
Dense 2D tracking (EPE↓, IoU↑ on CVO): The upsampler plugged into CoTrackerV3 matches or beats dense baselines while being about 12× faster, showing that upsampling is both accurate and efficient.

Surprising findings:

The 2D upsampler generalizes: It doesn’t just work for DELTA; it also boosts CoTrackerV3, with better accuracy and big runtime savings.
ASAP is crucial: Treating background as “as-static-as-possible” filtered out sneaky movers (like swaying leaves), which made camera pose estimation clearly better.
Robust to different depth and mask tools: Swapping depth backbones (ZoeDepth, Depth Pro, UniDepth) or mask generators still yields strong results—so the pipeline isn’t fragile.

Takeaway: Across pose estimation, depth along tracks, sparse 3D, and dense 2D, TrackingWorld lands at or near the top. Its biggest wins come from (1) tracking all frames to catch new content, (2) disentangling camera/object motion in a world frame, and (3) optimizing everything together with geometric constraints.

05Discussion & Limitations

Limitations:

Dependence on auxiliary models: It needs a tracker, depth estimator, and motion masks. Poor inputs can hurt results, and running them adds compute time.
Heavy optimization: While made efficient (clip-to-global, filtering), it’s still more expensive than some feed-forward models.
Extreme dynamics: Very fast, blurry motion or heavy occlusion can still cause drift or missing tracks.
Thin or reflective objects: Depth priors can struggle, which can ripple into 3D track errors.

Required resources:

A GPU (e.g., RTX 4090 was used in the paper) for reasonable turnaround (≈20 minutes for 30 frames).
Pretrained models (tracker, depth, masks) and video frames.
Enough memory to hold dense tracks and optimization variables, though redundancy filtering helps a lot.

When NOT to use:

Real-time constraints: If you need instant results on-device, a fully feed-forward approach might be preferable.
Perfectly static scenes with strong multi-view data: Classical SfM/SLAM might be simpler and sufficient.
Videos with no reliable depth cues (e.g., textureless walls in dim light): Monocular depth priors may be too uncertain.

Open questions:

Can we merge tracking, depth, masks, and pose into one unified, feed-forward model without losing accuracy?
How do we best handle extreme motion blur or rolling shutter effects?
Can learned priors replace parts of the optimization while keeping world-centric accuracy?
How to robustly scale to hundreds or thousands of frames without any global drift?
Can we better reason about visibility and occlusion to further improve long-term consistency?

06Conclusion & Future Work

Three-sentence summary: TrackingWorld is a system that tracks almost every pixel in a single-camera video, figures out the camera’s path, and lifts those pixel paths into a shared world-coordinate 3D space. It does this by upsampling sparse tracks to dense ones in every frame, filtering duplicates, and then performing an optimization that separates camera motion from object motion using an “as-static-as-possible” constraint. The result is accurate camera poses and dense, world-centric 3D trajectories that outperform prior methods across several benchmarks.

Main achievement: Cleanly disentangling camera motion from object motion while achieving dense, world-centric 3D tracking for nearly all pixels—even for objects that appear in later frames.

Future directions: Build a unified, feed-forward model covering tracking, depth, masks, and pose; improve robustness under severe blur and occlusion; scale to very long videos with real-time or near-real-time performance; tighten occlusion/visibility modeling; and enhance temporal depth consistency.

Why remember this: It turns messy 2D motion into clean 3D world motion at pixel scale, unlocking rock-solid editing, AR anchoring, robotics navigation, and scientific motion analysis from everyday videos. By tracking every frame and guarding camera estimation from background movers, it makes world-centric 3D tracking practical and reliable.

Practical Applications

•AR stickers and graphics that stay perfectly glued to people and objects as they move.
•Video editing and VFX with rock-solid motion anchoring and camera matchmoving.
•Sports analytics that extract accurate 3D player and ball trajectories from broadcast video.
•Robotics navigation that separates camera motion from object motion for safer path planning.
•Cinematic stabilization and re-framing using accurate world-centric camera paths.
•3D scene understanding for autonomous drones using only onboard monocular cameras.
•Medical or lab video analysis that tracks subtle 3D motions of instruments or specimens.
•Education and science demos showing true 3D motion from simple classroom videos.
•Cultural heritage digitization where hand-shot videos are turned into motion-aware 3D.
•Pre-visualization for film and game production with quick world-centric motion drafts.

Version: 1