Vision Transformers (ViTs) are great at recognizing what is in a whole image but often blur the tiny details needed to label each pixel (segmentation).
This paper teaches an AI to segment any object you name (open-vocabulary) much better by adding a few example pictures with pixel labels and smart retrieval.
This paper builds a medical image segmentation system that uses both pictures (like X-rays) and words (short clinical text) at the same time.
OpenVoxel is a training-free way to understand 3D scenes by grouping tiny 3D blocks (voxels) into objects and giving each object a clear caption.
Co2S is a new way to train segmentation models with very few labels by letting two different students (CLIP and DINOv3) learn together and correct each other.
TimeLens studies how to teach AI not just what happens in a video, but exactly when it happens, which is called video temporal grounding (VTG).