This paper introduces Causal-JEPA (C-JEPA), a world model that learns by hiding entire objects in its memory and forcing itself to predict them from other objects.
This paper shows that great image understanding features alone are not enough for making great images; you also need strong pixel-level detail.