DreamWorld is a new way to make videos that not only look real but also follow common-sense rules about motion, space, and meaning.
WoG (World Guidance) teaches a robot to imagine just the right bits of the near future and use those bits to pick better actions.
The paper turns image editing from a one-step “before → after” trick into a mini physics simulation that follows real-world rules.
VidEoMT shows that a single, well‑trained Vision Transformer (ViT) can segment and track objects in videos without extra tracking gadgets.
This paper introduces Causal-JEPA (C-JEPA), a world model that learns by hiding entire objects in its memory and forcing itself to predict them from other objects.
PixelGen is a new image generator that works directly with pixels and uses what-looks-good-to-people guidance (perceptual loss) to improve quality.
This paper fixes a hidden flaw in a popular image tokenizer (FSQ) with a simple one-line change to its activation function.
This paper shows that the best VAEs for image generation are the ones whose latents neatly separate object attributes, a property called semantic disentanglement.
This paper teaches an AI model to understand both which way an object is facing (orientation) and how it turns between views (rotation), all in one system.
DiffProxy turns tricky multi-camera photos of a person into a clean 3D body and hands by first painting a precise 'map' on each pixel and then fitting a standard body model to that map.
MorphAny3D is a training-free way to smoothly change one 3D object into another, even if they are totally different (like a bee into a biplane).
This paper shows that great image understanding features alone are not enough for making great images; you also need strong pixel-level detail.