3AM: 3egment Anything with Geometric Consistency in Videos

Yang-Che Sun; Cheng Sun; Chin-Yang Lin; Fu-En Yang; Min-Hung Chen; Yen-Yu Lin; Yu-Lun Liu

3AM: 3egment Anything with Geometric Consistency in Videos

Intermediate

Yang-Che Sun, Cheng Sun, Chin-Yang Lin et al.1/13/2026

arXiv PDF

Key Summary

•3AM is a new way to track and segment the same object across a whole video, even when the camera view changes a lot.
•It mixes two kinds of clues: how things look (2D appearance from SAM2) and where they are in space (3D-like geometry from MUSt3R).
•A small Feature Merger learns to blend multi-level MUSt3R features with SAM2 features so the model recognizes objects by both looks and location.
•A field-of-view–aware training sampler only pairs frames that actually see overlapping parts of the object in 3D, making the learning signal clear instead of confusing.
•At test time, 3AM needs only normal RGB video and a simple user prompt (mask, point, or box); it does not need camera poses, depth, or 3D preprocessing.
•On tough datasets with big viewpoint changes (ScanNet++ and Replica), 3AM beats strong SAM2-based methods by large margins.
•3AM’s geometry-aware tracking also lifts nicely to 3D instance segmentation without heavy 3D merging, showing strong online performance.
•It stays promptable like SAM2, so users can interactively pick what to track in videos or casual photo collections.
•Ablations show 3AM works well with default memory selection and especially benefits from MUSt3R as the 3D backbone.
•Bottom line: adding geometric consistency to 2D tracking brings big, practical gains without adding burdens at inference.

Why This Research Matters

3AM makes video object tracking much more reliable when the camera moves a lot, which is common in phones, drones, AR headsets, and robots. It stays simple at test time—needing only normal RGB video and a quick user prompt—so it fits real apps without special sensors or costly preprocessing. Editors can keep masks steady during dramatic camera moves; AR stickers can cling to the correct item as you walk around; and robots can re-find the same tool from any side. It also lets casual photo collections act like mini multi-view videos, enabling consistent cross-photo selection and tracking. Finally, because its 2D tracks are geometry-aware, you can lift them into 3D more robustly without heavy 3D pipelines, saving both time and compute.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: You know how you can recognize your backpack whether you see it from the front, the side, or the top? Your brain uses both how it looks and where it is in space.

🥬 The Concept (Video Object Segmentation – VOS): What it is: VOS is teaching a computer to find and outline the same object in every frame of a video. How it works: (1) You pick an object with a point, box, or mask. (2) The model remembers its appearance. (3) It searches frame-by-frame to draw the object’s outline (mask). Why it matters: Without VOS, editing videos, guiding robots, or AR effects would often fail when the object moves or the camera turns.

🍞 Anchor: Like highlighting your dog in each frame of a backyard video so an AR hat always sticks to the right head.

🍞 Hook: Imagine building a LEGO room. To find the sofa in any photo, it helps to know where it sits in 3D, not just its color.

🥬 The Concept (3D Instance Segmentation): What it is: It separates different objects in a 3D scene, not just in flat pictures. How it works: (1) Gather views of a scene. (2) Reconstruct or reason about 3D. (3) Assign each 3D point to an object. Why it matters: Without 3D awareness, the same chair from two angles can look so different that a 2D tracker gets confused.

🍞 Anchor: In a virtual house tour, 3D instance segmentation keeps the same chair labeled as “chair” no matter where the camera stands.

The world before: Recent VOS systems like SAM2 are fast, promptable, and powerful at following objects. They store memories of the object’s features and look back into those memories when seeing new frames. That works great when the camera view is similar from frame to frame. But when the camera walks around a room (wide-baseline motion), appearance changes drastically: the same lamp from the back looks nothing like from the front. Purely 2D appearance signals struggle, causing drifts (mask slides off), identity switches (wrong object), or lost tracks.

🍞 Hook: Think of a school play filmed from many seats. If you only match costumes by color from one seat’s view, you might lose your friend when they turn or move behind someone.

🥬 The Concept (Cross-View Consistency): What it is: Keeping an object’s identity stable across very different viewpoints. How it works: (1) Learn signals that tie pixels to the same 3D place. (2) Prefer matches that agree in space, not only in color/texture. (3) Carry that agreement across time. Why it matters: Without it, the tracker treats “left side of couch” and “right side of couch” as different unknowns and gets lost.

🍞 Anchor: A character in a movie wears the same costume in all shots; cross-view consistency is the continuity checker making sure it really is the same person.

People tried two paths and both had gaps: (1) Strong 2D VOS with clever memories (SAM2Long, DAM4SAM) helps, but still leans on appearance, so big viewpoint jumps remain hard. (2) 3D pipelines (e.g., lifting 2D masks into point clouds and merging) can stay consistent, but they demand camera poses, depth maps, SfM, and heavy preprocessing; errors in reconstruction can cascade, and it’s not great for streaming or casual photo sets.

🍞 Hook: It’s like choosing between a light backpack with no map (quick but you get lost) and a heavy suitcase with a full atlas (accurate but slow and clunky).

🥬 The Concept (Geometry-Aware Tracking): What it is: Tracking that uses hints of 3D shape and position to keep identity steady across views. How it works: (1) Learn features that agree when two frames see the same 3D spot. (2) Fuse them with normal 2D appearance cues. (3) Use memory to connect frames over time. Why it matters: Without geometry-aware cues, tracking breaks when the look changes; with them, the tracker says, “This patch is the same place in space,” even if the lighting or angle changes.

🍞 Anchor: A soccer camera can still follow the ball after a long pass because it predicts where the ball is going in space, not just because it looks round.

The missing piece (the gap): Could we make a SAM2-style, promptable tracker learn geometric consistency during training, then run at test time with only RGB (no camera poses, no depth), staying light and interactive? That’s exactly what 3AM answers.

Real stakes: This matters for AR try-ons that must stick to the same item as you turn around, for home robots that must recognize objects from any side, for video editors who need stable masks when the camera circles a subject, and for casual multi-view photos where you want the same object tracked across your album without setting up a 3D scanner.

In short, 3AM trains a 2D tracker to think a bit like 3D—enough to keep the identity right—while keeping test-time simple and fast.

02Core Idea

Aha! The key idea in one sentence: Teach a 2D video segmenter (SAM2) to borrow 3D-savvy features (from MUSt3R) during training and fuse them with a small Feature Merger so, at test time, it reliably tracks the same object across big viewpoint changes using only RGB and a user prompt.

Three analogies:

Two kinds of glasses: One pair shows sharp colors and textures (2D appearance), the other shows depth and shape (3D hints). 3AM wears both to see the same object clearly from any seat in the theater.
Treasure map + photo: A photo tells you what the treasure looks like; a map tells you where it is. 3AM uses both so you don’t lose the treasure when lighting or angle changes.
Choir + conductor: SAM2’s appearance voices sing melody; MUSt3R’s geometry voices sing harmony. The Feature Merger is the conductor blending them into a stable song that stays on key from any microphone angle.

Before vs. after:

Before: Strong, promptable 2D VOS often forgot who’s who when the camera moved a lot; 3D pipelines were accurate but heavy, slow, and needed poses/depth.
After: 3AM keeps SAM2’s speed and promptability but learns to stay consistent via geometry-aware features from MUSt3R; no 3D inputs are needed at inference.

Why it works (intuition, no equations): MUSt3R is trained for multi-view consistency, so its features light up similarly when two different frames see the same physical point. SAM2’s features are great at fine object boundaries and semantics. The Feature Merger first gathers MUSt3R signals from multiple depths (shallow = more semantic, deep = more geometric), aligns them with attention, then mixes them with SAM2’s high-resolution map. During training, a field-of-view–aware sampler only pairs frames that truly overlap on the object in 3D, avoiding the “left side vs. right side” confusion for large objects. The model learns, “Match in space first, then refine by appearance,” which stays robust when views change.

Building blocks, explained with the sandwich pattern when they appear:

🍞 Hook: Imagine using SAM2 like a smart coloring tool that can stay inside the lines as you move from page to page. 🥬 The Concept (SAM2 Framework): What it is: A promptable image/video segmenter with a memory that helps track objects over time. How it works: (1) Encode each frame to get 2D appearance features. (2) Use a memory bank to retrieve past object information. (3) Decode a mask guided by the user’s prompt (point/box/mask). Why it matters: Without SAM2’s strong backbone and promptability, interactive tracking would be clunky and less accurate. 🍞 Anchor: You click on a cat once, and SAM2 keeps outlining the same cat as it walks across the room.

🍞 Hook: Think of MUSt3R as a friend who’s great at noticing when two photos are looking at the same corner of a room. 🥬 The Concept (MUSt3R Features): What it is: Features learned from many views that become similar when two images see the same 3D spot. How it works: (1) MUSt3R looks across multiple frames. (2) It learns correspondences tied to physical geometry. (3) Deeper layers become more 3D-structure-aware. Why it matters: Without MUSt3R, the model can’t easily tell that two different-looking patches are actually the same place in space. 🍞 Anchor: Two vacation photos from different angles still match the same painting on the wall because MUSt3R recognizes the shared 3D point.

🍞 Hook: Mixing frosting flavors gives you a better cake; mixing feature types gives you a better tracker. 🥬 The Concept (Feature Merger): What it is: A small module that fuses multi-level MUSt3R features with SAM2 features. How it works: (1) Start with a shallow MUSt3R layer (more semantic). (2) Add deeper MUSt3R layers one by one using cross-attention (more geometric). (3) Convolve with SAM2’s 2D map to restore sharp spatial detail. Why it matters: Without the Merger, you either keep detail but lose 3D consistency, or gain 3D hints but lose fine boundaries and semantics. 🍞 Anchor: The Merger lets the model say, “This is the same mug because it’s in the same place in space—and yes, its rim lines up crisply.”

🍞 Hook: When taking two photos of a long couch, you need both shots to overlap the same cushion to compare them fairly. 🥬 The Concept (Field-of-View–Aware Sampling): What it is: A training rule that only pairs frames that actually see overlapping 3D parts of the object. How it works: (1) Pick a reference frame. (2) Back-project candidate masks to 3D and reproject into the reference. (3) Keep only candidates whose points fall inside the reference view enough of the time. Why it matters: Without this, the model is told that far-apart parts (left arm vs. right arm) are “the same,” which confuses learning. 🍞 Anchor: Comparing the same couch cushion (overlap) teaches good matching; comparing left arm to right arm (no overlap) teaches the wrong lesson.

Put together, 3AM is promptable like SAM2, geometry-aware like a 3D system, but light at test time—no poses, no depth—because it learned the 3D-like consistency during training.

03Methodology

High-level recipe: Input (RGB frames + user prompt) → extract SAM2 appearance features → extract MUSt3R multi-level geometry-aware features → Feature Merger fuses them → memory attention carries identity over time → mask decoder outputs segmentation → update memory for the next frames.

Step-by-step with purpose, what breaks without it, and a concrete feel:

Input and prompting

What happens: A user selects an object in a frame with a point, box, or mask. The video frames stream in.
Why this step exists: Prompting tells the model which object matters in a busy scene.
What breaks without it: The model might track the wrong thing (e.g., the other red backpack).
Example: You click on the blue mug in frame 10; that becomes the target across the sequence.

SAM2 feature extraction (2D appearance)

What happens: SAM2’s image encoder produces a high-resolution 2D feature map capturing textures, colors, and boundaries.
Why: Sharp masks depend on rich 2D details; these features power the promptable mask decoder.
What breaks without it: Masks get blobby and lose fine edges (e.g., mug handle holes).
Example: The shiny mug rim and handle silhouette are cleanly represented in these features.

MUSt3R feature extraction (geometry-aware multi-view signals)

What happens: In training, frames are fed to MUSt3R, which builds a multi-view memory internally and outputs features that correlate when two frames see the same 3D points. We tap multiple layers—shallow (semantic) and deeper (geometric).
Why: Viewpoint-robust correspondence needs geometry; MUSt3R learned it from multi-view consistency.
What breaks without it: From the back view, the mug looks different; pure 2D similarity may match a different object instead.
Example: A point on the mug’s logo corresponds to the same 3D spot even when the camera walks around; MUSt3R features keep that match stable.

Feature Merger (cross-attention + convolutional refinement)

What happens: The Merger starts with the shallow MUSt3R layer (semantically rich), runs self-attention to stabilize it, then progressively integrates deeper MUSt3R layers with cross-attention (adding geometric structure). Finally, it fuses with SAM2’s 2D map via convolutions to restore detail.
Why: Different MUSt3R layers carry complementary strengths; cross-attention aligns them; convolution with SAM2 recovers crisp spatial layout.
What breaks without it: Using only shallow MUSt3R misses geometry; using only deep MUSt3R loses semantics; skipping the final 2D fusion blurs edges.
Example: A cosine-similarity heatmap from vanilla SAM2 misfires after a big angle change; the merged feature zeros in on the correct mug spot across views.

Memory attention (temporal linking)

What happens: The merged feature queries past memories (key frames and masks) to keep the same identity over time.
Why: Videos have occlusions, reappearances, and distractors; memory reduces identity switches.
What breaks without it: After the mug goes behind a book and comes back, the tracker may jump to a different mug.
Example: The model re-identifies the same blue mug after a full camera pan because memory plus geometry-aware features agree.

Mask decoding and memory update

What happens: The mask decoder uses the prompt token and the merged features to output the segmentation. The encoded result is stored into the memory bank for future frames.
Why: Each new correct mask strengthens future tracking.
What breaks without it: The model can’t accumulate trustworthy context; errors repeat.
Example: The decoder picks the best of several mask candidates using a predicted IoU score, then saves it for the next steps.

Training-time field-of-view–aware sampling (FoV-aware)

What happens: To teach proper cross-view matching, we choose pairs of frames that actually overlap on the object in 3D. We back-project candidate masks using depth/poses (on datasets that have them), reproject into the reference, and only keep frames with enough overlap. We also mix in normal continuous sampling to preserve short-range matching skills.
Why: If you pair far-apart object parts (e.g., couch left arm vs. right arm), the model gets a contradictory lesson.
What breaks without it: The model is over-regularized or confused, degrading both cross-view and within-view matching.
Example: For ScanNet++, about 80% of sampled training pairs use FoV-aware filtering with a threshold to ensure shared 3D coverage; MOSE (without geometry) keeps continuous sampling.

Training details (light touch)

What happens: Only a few modules are trained: the memory attention, mask decoder, and the Feature Merger; the rest follows SAM2 defaults (e.g., 8 memory slots). Losses are standard (focal + dice for masks, etc.). MUSt3R features are precomputed for some datasets to speed up sampling.
Why: This keeps training efficient and stable, preserving SAM2’s strengths while injecting geometry awareness.
What breaks without it: Training everything from scratch would be slow and prone to overfitting; not training the Merger would block fusion benefits.
Example: Learning rates are small (e.g., 5e-6 for memory/decoder, 1e-5 for Merger) over 1M steps.

Secret sauce (what makes 3AM clever):

It does not require camera poses or depth at test time; all 3D-like consistency is learned during training.
The Feature Merger knows which MUSt3R layers to trust for semantics vs. geometry and blends them with SAM2’s detailed 2D features.
The FoV-aware sampler turns multi-view geometry into a clean teacher: only compare apples to apples (overlapping object parts), so the model learns the right rule—match in space, refine by appearance.

04Experiments & Results

🍞 Hook: Think of a school race where runners start in different lanes and face hurdles; we want our runner to win not just by being fast, but by staying on track when the hurdles get tricky.

🥬 The Concept (IoU family of metrics): What it is: Measures of how well the predicted mask overlaps the ground-truth mask. How it works: (1) IoU: overlap across all frames, including when the object is absent. (2) Positive IoU: only when the object is present. (3) Successful IoU: only when the prediction overlaps at all (non-zero IoU) on visible frames. Why it matters: Without these, scores don’t tell us if we’re accurate when it counts (visible frames) or robust when objects disappear and return.

🍞 Anchor: It’s like grading a drawing contest: overall neatness (IoU), accuracy when the object is there (Positive IoU), and score when you at least found the object (Successful IoU).

Test setup and competitors: The authors test on ScanNet++ and Replica—datasets with big viewpoint changes—because typical VOS sets often have steadier cameras. Baselines include SAM2 (strong promptable VOS), SAM2Long (memory trees for long videos), and DAM4SAM (distractor-aware memory). 3AM keeps SAM2’s default memory selection in reported results to isolate the benefit of geometry-aware fusion.

Main scoreboard (ScanNet++):

Whole Set: 3AM gets IoU 0.8898, Positive IoU 0.5630, Successful IoU 0.7155—beating SAM2 (IoU 0.4392) and improving over SAM2Long/DAM4SAM (~0.82 IoU) by a notable margin.
Selected Subset (focused on reappearance under large viewpoint change): 3AM hits IoU 0.9061 and Positive IoU 0.7168, outperforming SAM2Long (0.7474 / 0.4133) and DAM4SAM (0.7648 / 0.4356). Successful IoU is best at 0.7737, showing strong identity preservation across disappear–reappear events.
An interesting note: Finetuning SAM2 alone performed worse here, likely because, without MUSt3R grounding, memory attention learned confusing cues from the geometry-heavy training mix.

Two-view matching comparison: Even when forced into a two-view protocol (no multi-frame memory) against SegMASt3R, 3AM scores higher (IoU 0.8915 vs. 0.6800; Positive IoU 0.5115 vs. 0.3628; Successful IoU 0.6405 vs. 0.4053). This shows the fused features themselves are strong at wide-baseline matching.

Replica results: 3AM again leads with IoU 0.8119, Positive IoU 0.6381, and Successful IoU 0.6793, surpassing SAM2Long and DAM4SAM. The biggest jumps are in Positive IoU and Successful IoU, highlighting better accuracy when the object is visible and better localization when it’s found.

3D instance segmentation (class-agnostic) via projection: By projecting 3AM’s 2D tracks into 3D and doing lightweight merging, 3AM achieves the best online AP (47.3) among listed online methods on ScanNet200 and strong AP scores at common thresholds (59.7 and 75.3), without relying on 3D ground-truth merging. Message: robust 3D instances can emerge from geometry-aware 2D tracking.

Ablations and insights:

Memory selection: Even with plain SAM2 memory selection, 3AM is strong (IoU 0.8898). Adding DAM4SAM or SAM2Long-style selection yields small gains (e.g., IoU ≈ 0.9004 with SAM2Long-3AM), suggesting the big boost comes from geometry-aware fusion, not just memory policy.
3D backbone choice: MUSt3R stands out in online settings for stable object alignment across viewpoints. CUT3R, despite being online, offers weaker instance-level alignment (Positive IoU 0.2751 on the tough subset), while offline backbones like VGGT or π don’t suit streaming. MUSt3R’s consistency is crucial for reliable 2D cross-view matching.

Surprising findings:

Pure SAM2 finetuning on this training mix can degrade performance, underscoring that appearance-only memory attention needs geometric anchors to generalize under wide baselines.
When constrained to two-view matching, 3AM still shines, indicating the fused features themselves encode robust correspondence, not just the temporal memory.

Takeaway: On scenes with wide-angle moves and reappearances, 3AM turns painful failure cases for 2D trackers into solid successes while staying promptable and light at inference.

05Discussion & Limitations

Limitations:

Training-time dependencies: To build geometry awareness, 3AM trains with MUSt3R and uses camera poses/depth for field-of-view filtering on datasets that have them. While inference is RGB-only, assembling the training mix and precomputing MUSt3R features take effort and storage.
Domain shifts and dynamics: Extremely dynamic scenes where object geometry changes rapidly (deformation) or where overlap with the reference is minimal may still challenge correspondence.
Sparse or tiny objects: Very small, low-texture, or barely visible objects could be hard to align across large viewpoint changes, even with geometry-aware cues.
Memory budget: With 8 memory slots (default), extremely long videos may benefit from more advanced memory selection to maximize gains.

Required resources:

A GPU setup capable of running SAM2 and MUSt3R during training; storage for precomputed MUSt3R features on large datasets; typical training time for ~1M iterations with small learning rates.
At inference: only RGB frames and prompts; compute roughly similar to SAM2 with a lightweight fusion overhead.

When not to use:

If you already require accurate online 3D reconstructions (with reliable poses/depth), a full 3D pipeline might be preferable for certain 3D tasks.
If viewpoint changes are tiny and speed is the only priority, plain SAM2 may be “good enough.”
For purely deformable, fast-changing targets (e.g., cloth flapping with little 3D overlap), geometry-based consistency may offer limited extra help.

Open questions:

Can we learn similar geometry-aware cues without MUSt3R, e.g., via self-supervised geometric constraints in-the-loop?
How to design memory selection specifically for 3AM’s fused features to push long-range identity further?
Can we extend the approach to moving cameras and moving objects with learned dynamic geometry models while staying online?
How far can we scale promptability (e.g., open-vocabulary, multi-object) while retaining robust cross-view consistency?

Overall, 3AM demonstrates that weaving geometry-aware signals into a promptable 2D tracker produces large, practical wins—while leaving room for even smarter memory and backbone choices in the future.

06Conclusion & Future Work

Three-sentence summary: 3AM teaches a strong 2D video segmenter (SAM2) to think with 3D-like consistency by fusing MUSt3R’s multi-view features through a small Feature Merger and training with a field-of-view–aware sampler. This makes object identity stick across big viewpoint changes while keeping inference simple—just RGB and a prompt, no poses or depth. Experiments on wide-baseline datasets show big gains over top SAM2 variants, and the geometry-aware tracks even lift well to 3D instance segmentation online.

Main achievement: Showing that geometry-consistent recognition can be learned at training time and delivered at test time without any explicit 3D inputs—significantly strengthening promptable VOS under tough viewpoint shifts.

Future directions: Tailored memory selection for fused features; exploring alternate or lighter 3D backbones; extending to dynamic geometry; scaling to richer prompts and multi-object, open-vocabulary setups; and pushing performance on outdoor, high-motion videos.

Why remember this: 3AM proves you don’t need heavy 3D machinery at inference to enjoy 3D-style consistency—teaching a 2D tracker good geometric habits early yields stable, practical gains for video editing, robotics, AR, and beyond.

Practical Applications

•Stable video editing masks for cinematic camera moves, pans, and orbits.
•AR effects that stay stuck to the right object as users walk around it.
•Home robots that re-identify the same item from different sides to fetch or place it.
•Sports analytics that follow a player or ball across wide camera angles.
•E-commerce try-on or product demos that keep segmenting the right garment or object as the user turns.
•Casual photo set organization: select an object once, propagate consistent selections across the album.
•On-the-fly 3D instance segmentation by projecting robust 2D tracks into 3D without heavy merging.
•Video surveillance scenarios where the target reappears after occlusion or from a new angle.
•Education and science demos that highlight the same specimen across rotating views.
•Pre-visualization in filmmaking: consistent object masks across varied shots without 3D rigs.

Version: 1