This paper turns a popular image-guidance trick (Classifier-Free Guidance) into a feedback-control problem, just like keeping a car steady in its lane.
GigaBrain-0.5M* is a robot brain that sees, reads, and acts, and it gets smarter by imagining the future before moving.
Text-to-image models using GRPO used to give the same final reward to every step, which is like giving the whole team the same grade no matter who did what.
The paper shows that using information from many layers of a language model (not just one) helps text-to-image diffusion transformers follow prompts much better.
Green-VLA is a step-by-step training recipe that teaches one model to see, understand language, and move many kinds of robots safely and efficiently.
LingBot-VLA is a robot brain that listens to language, looks at the world, and decides smooth actions to get tasks done.
Before this work, most text-to-image models used VAEs (small, squished image codes) and struggled with slow training and overfitting on high-quality fine-tuning sets.
Qwen3-TTS is a family of text-to-speech models that can talk in 10+ languages, clone a new voice from just 3 seconds, and follow detailed style instructions in real time.
Robots usually think in words and pictures, but their hands need exact motions, so there is a gap between understanding and doing.
This paper shows how to make powerful image‑generating Transformers run fast on phones without needing the cloud.
TAG-MoE is a new way to steer Mixture-of-Experts (MoE) models using clear task hints, so the right “mini-experts” handle the right parts of an image job.
This paper makes video editing easier by teaching an AI to spread changes from the first frame across the whole video smoothly and accurately.