This paper shows that many AI models that both read images and write images are not truly unified inside—they often understand well but fail to generate (or the other way around).
This paper shows that giving an AI a safe, tiny virtual computer (a sandbox) lets it solve many kinds of problems better, not just coding ones.
TAG-MoE is a new way to steer Mixture-of-Experts (MoE) models using clear task hints, so the right “mini-experts” handle the right parts of an image job.
Image-to-Video models often keep the picture looking right but ignore parts of the text instructions.
This paper fixes a common problem in multimodal AI: models can understand pictures and words well but stumble when asked to create matching images.
Traditional self-driving used separate boxes for seeing, thinking, and acting, but tiny mistakes in early boxes could snowball into big problems later.