ReVSeg teaches an AI to segment objects in videos by thinking step-by-step instead of guessing everything at once.
Before this work, big vision-language models (VLMs) were great at understanding pictures and words together but not at making new pictures.