This paper shows how to get strong text embeddings from decoder-only language models without any training.
This paper shows you can train a big text-to-image diffusion model directly on the features of a vision foundation model (like DINOv3) without using a VAE.