The paper shows that using information from many layers of a language model (not just one) helps text-to-image diffusion transformers follow prompts much better.
Big AI models used to get better by getting wider or reading longer texts, but those tricks are slowing down.