This paper shows you can train a big text-to-image diffusion model directly on the features of a vision foundation model (like DINOv3) without using a VAE.
Training a neural network is like finding the lowest spot in a giant, bumpy landscape called the loss landscape.