The paper shows that using information from many layers of a language model (not just one) helps text-to-image diffusion transformers follow prompts much better.
Robots usually think in words and pictures, but their hands need exact motions, so there is a gap between understanding and doing.