Before this work, most text-to-image models used VAEs (small, squished image codes) and struggled with slow training and overfitting on high-quality fine-tuning sets.
OpenVision 3 is a single vision encoder that learns one set of image tokens that work well for both understanding images (like answering questions) and generating images (like making new pictures).
UniX is a new medical AI that both understands chest X-rays (writes accurate reports) and generates chest X-ray images (high visual quality) without making the two jobs fight each other.
Reasoning Palette gives a language or vision-language model a tiny hidden “mood” (a latent code) before it starts answering, so it chooses a smarter plan rather than just rolling dice on each next word.
FlashPortrait makes talking-portrait videos that keep a person’s identity steady for as long as you want—minutes or even hours.
The paper turns one flat picture into a neat stack of see‑through layers, so you can edit one thing without messing up the rest.
SS4D is a new AI model that turns a short single-camera video into a full 3D object that moves over time (that’s 4D), and it does this in about 2 minutes.
UniUGP is a single system that learns to understand road scenes, explain its thinking, plan safe paths, and even imagine future video frames.
This paper shows that we can turn big, smart vision features into a small, easy-to-use code for image generation with just one attention layer.