Qwen3-TTS is a family of text-to-speech models that can talk in 10+ languages, clone a new voice from just 3 seconds, and follow detailed style instructions in real time.
The paper shows how to control accents in text-to-speech (TTS) by mixing simple, linguistics-based sound-change rules with speaker embeddings.
Digital humans used to just copy motions; this paper makes them think, speak, and move in sync like real people.