This paper teaches an AI to segment any object you name (open-vocabulary) much better by adding a few example pictures with pixel labels and smart retrieval.
People often pick CLIP-like models for image labeling, but this paper shows that large multimodal models (LMMs) can be just as good—or even better—when you give them a few examples in the prompt (in-context learning).
Most people on Earth speak more than one language and often switch languages in the same chat, but AI tools aren’t tested well on this real behavior.
IC-Effect is a new way to add special effects to existing videos by following a text instruction while keeping everything else unchanged.