The paper shows that using information from many layers of a language model (not just one) helps text-to-image diffusion transformers follow prompts much better.
Green-VLA is a step-by-step training recipe that teaches one model to see, understand language, and move many kinds of robots safely and efficiently.
LingBot-VLA is a robot brain that listens to language, looks at the world, and decides smooth actions to get tasks done.
Before this work, most text-to-image models used VAEs (small, squished image codes) and struggled with slow training and overfitting on high-quality fine-tuning sets.
Qwen3-TTS is a family of text-to-speech models that can talk in 10+ languages, clone a new voice from just 3 seconds, and follow detailed style instructions in real time.
Robots usually think in words and pictures, but their hands need exact motions, so there is a gap between understanding and doing.
This paper shows how to make powerful image‑generating Transformers run fast on phones without needing the cloud.
TAG-MoE is a new way to steer Mixture-of-Experts (MoE) models using clear task hints, so the right “mini-experts” handle the right parts of an image job.
This paper makes video editing easier by teaching an AI to spread changes from the first frame across the whole video smoothly and accurately.
This paper shows that when teaching image generators with reinforcement learning, only a few early, very noisy steps actually help the model learn what people like.
AniX is a system that lets you place any character into any 3D world and control them with plain language, like “run forward” or “play a guitar.”
RecTok is a new visual tokenizer that teaches the whole training path of a diffusion model (the forward flow) to be smart about image meaning, not just the starting latent features.