This paper introduces Causal-JEPA (C-JEPA), a world model that learns by hiding entire objects in its memory and forcing itself to predict them from other objects.
Real people often ask vague questions with pictures, and todayβs vision-language models (VLMs) struggle with them.