SAMTok turns any object’s mask in an image into just two special “words” so language models can handle pixels like they handle text.
OpenVoxel is a training-free way to understand 3D scenes by grouping tiny 3D blocks (voxels) into objects and giving each object a clear caption.
3AM is a new way to track and segment the same object across a whole video, even when the camera view changes a lot.
VideoLoom is a single AI model that can tell both when something happens in a video and where it happens, at the pixel level.
The paper teaches a video generator to move things realistically by borrowing motion knowledge from a strong video tracker.
OpenSubject is a giant video-based dataset (2.5M samples, 4.35M images) built to help AI make pictures that keep each person or object looking like themselves, even in busy scenes.
ReVSeg teaches an AI to segment objects in videos by thinking step-by-step instead of guessing everything at once.