Robots learn better when they predict short, meaningful summaries of future images instead of drawing every pixel of the future scene.
The paper teaches small AI models to make high‑quality text embeddings by first copying a big expert model (distillation) and then practicing four jobs with special mini‑modules (LoRA adapters): retrieval, similarity, clustering, and classification.
ArcFlow is a new way to make text-to-image models draw great pictures in only 2 steps instead of 50, giving about a 40× speed boost.