Utonia is a single brain (encoder) that learns from many kinds of 3D point clouds, like indoor rooms, outdoor streets, tiny toys, and even city maps.
This paper shows you can train a big text-to-image diffusion model directly on the features of a vision foundation model (like DINOv3) without using a VAE.