Qwen3-TTS Technical Report
Key Summary
- âąQwen3-TTS is a family of text-to-speech models that can talk in 10+ languages, clone a new voice from just 3 seconds, and follow detailed style instructions in real time.
- âąIt comes with two special speech tokenizers: a 25 Hz one that mixes meaning and sound for very natural speech, and a 12.5 Hz one that is ultra-fast for streaming with first audio in about 97â101 ms.
- âąA dual-track architecture lets the model read text and immediately predict the next bit of speech tokens, so audio starts playing almost right away.
- âąThe 12 Hz models achieve state-of-the-art accuracy on zero-shot voice cloning (e.g., WER 1.24 on English Seed-TTS), and beat strong commercial systems on multilingual speaker similarity.
- âąFor cross-language voice transfer (like Chinese voice speaking Korean), Qwen3-TTS cuts errors dramatically (about 66% lower than a top baseline in zhâko).
- âąInstruction-following is strong: the Voice Design model sets new open-source records on the InstructTTSEval benchmark, even rivaling commercial tools.
- âąFor long speeches (10+ minutes), the 25 Hz model is extra stable and keeps prosody smooth without chunk boundaries.
- âąTraining uses over 5 million hours of speech, long-context pretraining, and human-feedback alignment (DPO/GSPO) to make speech both natural and reliable.
- âąAll models and tokenizers are released under Apache 2.0, encouraging community use and research.
- âąBottom line: Qwen3-TTS makes high-quality, controllable, multilingual, and low-latency speech generation practical at scale.
Why This Research Matters
Real-time, natural-sounding speech makes conversations with computers feel human, reducing friction in assistants, customer support, and education. With strong multilingual ability, the same trusted voice can serve global users, improving brand consistency and accessibility. Fine-grained control over tone, speed, and style helps creators, teachers, and companies tailor audio exactly to the moment. Ultra-low latency enables interactive scenarios like live tutoring, on-the-fly translation, and responsive gaming NPCs. Accurate voice cloning from just seconds reduces production costs and expands creative options while highlighting the need for consent and safety practices. Stable long-form narration unlocks podcasts, audiobooks, and training content generation without artifacts. Open-source licensing accelerates adoption and innovation across startups, research labs, and accessibility tech.
Detailed Explanation
Tap terms for definitions01Background & Problem Definition
đ Top Bread (Hook): You know how an audiobook narrator reads words on a page and brings them to life with the right tone and rhythm?
đ„Ź Filling (The Actual Concept): Text-to-Speech (TTS) is technology that turns written words into spoken speech.
- What it is: A system that reads text out loud like a digital narrator.
- How it works: 1) Read the text, 2) Plan how it should sound (pronunciation, pauses, emotion), 3) Turn that plan into audio.
- Why it matters: Without TTS, computers canât speak to us naturallyâmaking assistants, accessibility tools, and language learning much harder.
đ Bottom Bread (Anchor): When your map app says âTurn left in 200 meters,â thatâs TTS doing the talking.
The World Before: Early TTS could be robotic and monotone. Newer neural TTS sounded better but still struggled with three big realities: real-time streaming (starting audio instantly, not after a long wait), fine-grained control (e.g., âspeak like a calm teacher with a warm smileâ), and multilingual consistency (keeping the same voice across different languages). Voice cloning existed, but often needed longer samples, and changing style or emotion precisely was tricky.
đ Top Bread (Hook): Imagine building with LEGO bricksâyou need the right-sized blocks to make strong models fast.
đ„Ź Filling (The Actual Concept): Discrete Speech Tokenization breaks speech into small building-block tokens for models to handle.
- What it is: A way to turn continuous audio into sequences of symbols (tokens), like letters for sound.
- How it works: 1) Analyze audio, 2) Compress it into tokens that keep meaning and sound details, 3) Let a model predict these tokens, then 4) Turn tokens back to audio.
- Why it matters: Without tokens, the model has to juggle raw waveformsâtoo big and too detailed for fast, stable generation.
đ Bottom Bread (Anchor): Itâs like zipping a huge video into a smaller file so you can stream it smoothly and then unzip it to watch clearly.
The Problem: Make TTS that is at once multilingual, controllable, robust for long passages, and ultra-low-latency for streamingâplus clone a voice from just 3 seconds. Teams found trade-offs: fast systems could sound less expressive; expressive systems could be slow; multilingual systems might change voice timbre or accent too much.
đ Top Bread (Hook): Picture trying to pour water as someone keeps handing you new cupsâyou want the water to start flowing immediately, not after you finish stacking all the cups.
đ„Ź Filling (The Actual Concept): Streaming Synthesis means the model starts speaking while itâs still reading the text.
- What it is: Audio begins almost right away and keeps flowing as more text arrives.
- How it works: 1) Read a bit of text, 2) Predict the next chunk of speech tokens, 3) Decode and play them, 4) Repeat.
- Why it matters: Without streaming, users wait for the whole sentence or paragraph before hearing anything.
đ Bottom Bread (Anchor): Voice assistants that answer mid-sentence instead of making you wait are using streaming.
Failed Attempts: Purely semantic tokenizers (great for the âwhatâ being said) often lacked expressive detail. Purely acoustic tokenizers (great for the âhowâ itâs said) could overload the language model with fine-grained bits, causing drift and errors over long speech. Diffusion decoders produced high quality but added delay; some autoregressive systems were quick but fragile for long or multilingual speech.
đ Top Bread (Hook): Think of a friend who can copy voices after hearing a short clip from a cartoonâtheir brain grabs what makes that voice unique.
đ„Ź Filling (The Actual Concept): Voice Cloning copies a personâs voice from a short sample.
- What it is: Making a new digital speaker that sounds like the target.
- How it works: 1) Listen to a reference clip, 2) Extract speaker traits (timbre, pitch habits), 3) Use those traits to guide new speech.
- Why it matters: Without cloning, every voice would sound generic or need hours of training.
đ Bottom Bread (Anchor): Giving your game character your own voice after the model hears a 3-second clip is voice cloning.
The Gap: We needed a design that kept meaning stable, carried expressive detail, started speaking in under ~100 ms, and held the same voice across languages and minutes of speech. We also needed natural language control (âbrighter tone, slightly slower, hint of excitementâ) that the model truly follows.
đ Top Bread (Hook): Imagine the same actor performing in English, Chinese, and Frenchâbut always sounding like themselves.
đ„Ź Filling (The Actual Concept): Multilingual Capability lets the same digital voice speak many languages consistently.
- What it is: One model that reads multiple languages without changing identity.
- How it works: 1) Train on many languages, 2) Learn shared patterns of speech and pronunciation, 3) Keep speaker traits steady across languages.
- Why it matters: Without it, your âvoiceâ becomes a different person when switching languages.
đ Bottom Bread (Anchor): A cloned âAidenâ voice speaking both English and Korean while sounding like the same person is multilingual capability at work.
Real Stakes: Faster, clearer TTS improves accessibility (screen readers), customer support, education, entertainment, and voice assistants. Low latency makes conversations feel natural. Controllability and cloning let creators and companies craft distinct, brand-safe voices without big recording sessions. Multilingual consistency bridges global audiences with the same trusted voice.
02Core Idea
đ Top Bread (Hook): Imagine a factory with two conveyor belts: one brings in words, the other instantly sends out the matching sounds.
đ„Ź Filling (The Actual Concept): The key insight is to pair a dual-track language model that reads text and immediately predicts speech tokens with two complementary speech tokenizersâone rich and one ultra-fastâso you get high quality and ultra-low latency together.
- What it is: A unified system where text tokens and speech tokens flow side-by-side, decoded by tokenizers tuned for either expressivity (25 Hz) or speed (12.5 Hz).
- How it works: 1) Convert text to tokens, 2) Dual-track model predicts the next speech tokens as soon as text arrives, 3) Decode tokens to audio with the appropriate tokenizerâs decoder, 4) Stream it out, instantly.
- Why it matters: Without this pairing, you usually pick either fast or expressive; here you get both, plus strong control.
đ Bottom Bread (Anchor): You type âRead in a friendly, calm tone,â and within about a tenth of a second the voice starts speaking that way.
Multiple Analogies:
- Camera lenses: The 25 Hz tokenizer is like a portrait lens (expressive detail), while the 12.5 Hz tokenizer is a fast-action lens (quick capture). The body (dual-track LM) swaps lenses depending on the shot.
- Highways: One lane carries âmeaningâ traffic; another carries âsound styleâ traffic. Smart merging keeps flow smooth without jams.
- Orchestra: Text is the sheet music; tokenizers are instrument sections (strings for warmth, percussion for timing); the dual-track conductor coordinates them in real time for a seamless performance.
Before vs After:
- Before: You waited longer for first audio, had weaker control, or lost consistency in long/multilingual passages.
- After: First audio in ~97â101 ms (12 Hz variants), precise style control via instructions, strong multilingual identity, and stable long-form speech.
đ Top Bread (Hook): Think of solving a puzzle by placing several pieces at once instead of one-by-one.
đ„Ź Filling (The Actual Concept): Multi-Token Prediction (MTP) predicts multiple codebook tokens per time step.
- What it is: A module that outputs several speech tokens together (semantic layer first, then residual acoustic layers).
- How it works: 1) Predict the main semantic token, 2) Predict all residual acoustic tokens in parallel, 3) Combine them to form a rich frame.
- Why it matters: Without MTP, generation is slower and may miss fine details or add delay.
đ Bottom Bread (Anchor): Itâs like guessing not just the next letter, but the next letter plus the accents and flourishes in one goâfaster and prettier.
đ Top Bread (Hook): You know how some backpacks have one big compartment (simple but cramped), while others have many pockets (organized and fast to access)?
đ„Ź Filling (The Actual Concept): Two Tokenizersâ25 Hz single-codebook and 12.5 Hz multi-codebookâcover different needs.
- What it is: 25 Hz blends semantic and acoustic cues with a diffusion transformer for high-fidelity; 12.5 Hz separates them across many small codebooks and uses a lightweight causal ConvNet for decoding.
- How it works: 25 Hz: single track â mel via flow matching (DiT) â BigVGAN vocoder. 12.5 Hz: semantic head + 15-layer RVQ for acoustics â causal ConvNet decodes directly.
- Why it matters: Without both, youâd be stuck either with higher latency (25 Hz only) or less general expressivity/control (12.5 Hz only). Together, you choose the best tool per use case.
đ Bottom Bread (Anchor): A live chatbot picks the 12 Hz path to talk instantly; a studio-quality narration can pick 25 Hz for lush expressivity.
Why It Works (intuition):
- Discrete tokens simplify what the LM must learn, avoiding long-horizon waveform errors.
- Separating semantic and acoustic layers (especially at 12.5 Hz) keeps meaning stable while letting style be refined.
- The dual-track arrangement reduces âthinking delay,â mapping text to sound continuously.
- Post-training with human preferences (DPO/GSPO) tunes outputs toward what listeners like.
Building Blocks (with sandwiches as introduced):
-
Dual-track LM Architecture đ Hook: Imagine two runners passing a baton in syncâtext leads, audio follows immediately. đ„Ź Filling: A model that ingests text tokens and emits aligned speech tokens right away; keeps speaker identity via a learned speaker encoder. Why it matters: Without it, youâd wait for whole sentences and lose real-time feel. đ Anchor: While you type a question, the voice already begins answering.
-
Qwen-TTS-Tokenizer-25Hz đ Hook: Like a high-res camera that captures both face and background texture. đ„Ź Filling: Single-codebook at 25 Hz; tokens converted to mel with a Flow-Matching DiT, then BigVGAN to waveform; supports streaming with block-wise attention. Why it matters: Without it, you lose some expressive richness. đ Anchor: Rich, human-like storytelling with smooth prosody.
-
Qwen-TTS-Tokenizer-12Hz đ Hook: A sprint shoeâbuilt for fast starts and efficient strides. đ„Ź Filling: 12.5 Hz, multi-codebook (semantic + 15-layer RVQ acoustics), fully causal, decoded by a lightweight ConvNet; emits audio immediately. Why it matters: Without it, you canât hit ~100 ms first audio in streaming. đ Anchor: A customer support bot that answers with near-zero delay.
-
Multi-Token Prediction (MTP) (Sandwich above)
-
Instruction Following and Voice Cloning đ Hook: Like giving a directorâs noteââmore cheerful, a bit slowerââand the actor adjusts. đ„Ź Filling: Prompts include style controls; cloning from 3 s reference or from a textâspeech example; post-training aligns behavior to human preferences. Why it matters: Without this, you canât reliably steer tone or identity. đ Anchor: âUse Aidenâs voice, slower pace, soft enthusiasm.â
03Methodology
At a high level: Input (text + optional voice reference + style instructions) â Tokenization (text tokens + speech tokens for references) â Dual-track LM predicts next speech tokens as text arrives â Code2Wav decodes speech tokens to audio â Streamed Output.
Step 1: Prepare Inputs (ChatML formatting)
- What happens: The system receives text plus optional controls: a 3-second voice clip for cloning, a chosen preset voice, or a descriptive instruction like âwarm, slower, news anchor style.â All are wrapped in a consistent ChatML format so the model reads them like a conversation.
- Why it exists: Without a unified format, the model would struggle to parse instructions and references consistently.
- Example: User: âPlease read this paragraph in Aidenâs voice, 10% slower, gentle and friendly.â + a 3-second Aiden clip.
Step 2: Tokenize Text and (if any) Reference Audio
- What happens: Text â standard Qwen tokenizer. If cloning from a reference clip, the learnable speaker encoder extracts a speaker embedding. If using in-context learning (textâspeech pair), the speech is also encoded as tokens for the model to imitate prosody.
- Why it exists: Models need compact, structured inputs (tokens/embeddings) to reason efficiently.
- Example: âBonjour Ă tous !â becomes text tokens; a 3-second sample becomes a speaker vector.
Step 3: Choose a Tokenizer Path (25 Hz vs 12.5 Hz)
- What happens: The system can deploy either: a) 25 Hz Single-Codebook Path (expressivity): Tokens later pass through a Flow-Matching Diffusion Transformer (DiT) to mel-spectrograms, then BigVGAN to waveform. Streaming is supported by chunking with a sliding-window attention mask (current block + 3-lookback + 1-lookahead), preserving context while keeping latency manageable. b) 12.5 Hz Multi-Codebook Path (ultra-low latency): A semantic codebook + 15-layer RVQ acoustic codebooks run fully causally. A lightweight causal ConvNet reconstructs waveforms incrementally with no look-ahead.
- Why it exists: Different applications need different trade-offs: lush narration vs instant replies.
- Example: Live voice chat uses 12.5 Hz; audiobook recording might prefer 25 Hz.
Step 4: Dual-Track Autoregressive Prediction
- What happens: The LM processes incoming text tokens and immediately predicts the aligned speech tokens frame-by-frame. For 12.5 Hz, it first predicts the semantic codebook, then the Multi-Token Prediction (MTP) module produces all residual acoustic codebooks at once.
- Why it exists: Without dual-track prediction, youâd wait for larger text chunks, increasing latency and risking instability.
- Example: As the phrase âthe capital of Franceâ is read, the next few speech frames are already predicted and sent to the decoder.
Step 5: Streaming Decoding to Waveform (Code2Wav)
- 25 Hz path: DiT with Flow Matching converts code to mel in chunks using block-wise attention; BigVGAN turns mel into audio. Because DiT needs a small look-ahead, the first packet includes waiting for sufficient tokens (e.g., LM must output 16 tokens for the first 8-token mel chunk; at 25 Hz â 320 ms mel per packet + ~130 ms vocoder right-context). Then steady packets roll every 8 tokens.
- 12.5 Hz path: Pure left-context decoding with a causal ConvNet; one token â 80 ms. To reduce overhead, packets are 4 tokens (â 320 ms) but can start as soon as tokens arriveâno look-ahead required.
- Why it exists: Efficient decoding is crucial for first-audio latency and scalability with many users.
- Example: 12.5 Hz 0.6B variant hits â97 ms first packet end-to-end in tests.
Step 6: Training Strategy for Quality and Stability
- Pre-training S1 (General): Train on 5M+ hours multilingual audio to map text â speech robustly.
- Pre-training S2 (High-Quality CPT): Emphasize high-quality data to reduce hallucinations and improve naturalness.
- Pre-training S3 (Long-Context): Raise max tokens from 8,192 to 32,768 and upsample long clips for stability over 10+ minutes.
- Post-training: a) Direct Preference Optimization (DPO) aligns outputs to human likes via preference pairs. b) GSPO + rule-based rewards improve intelligibility, style following, and stability. c) Lightweight speaker fine-tuning enables strong target-voice adaptation while staying multilingual and stable.
- Why it exists: Without staged training and alignment, models either sound bland, wander off-topic, or break in long passages.
- Example: After S3, the model can narrate a 12-minute article without odd resets or repeated words.
Step 7: Controllability and Thinking Pattern
- What happens: Style/control instructions are prepended to inputs. A probabilistically activated âthinking patternâ during training helps the model interpret complex descriptions step-by-step before speaking.
- Why it exists: Without explicit instruction handling, the model may miss nuanced requests (e.g., âslower but brighter, slight smileâ).
- Example: âUse a confident yet empathetic tone; keep pauses after commas; medium volume.â
Secret Sauce:
- Dual-track + MTP: Immediate alignment between text and rich speech tokens, with multiple codebooks predicted per step.
- Two tokenizers, two strengths: 25 Hz for expressivity and long-form smoothness; 12.5 Hz for near-instant streaming.
- Causal, batch-friendly decoders: Especially the 12.5 Hz ConvNet keeps latency low even with high concurrency.
- Human-feedback alignment: DPO/GSPO ensures the outputs feel natural to listeners, not just score well on ASR.
Concrete Data Walkthrough:
- Input: âRead: âThe aurora danced across the polar sky.â Voice: Aiden; slower 10%; calm, warm.â
- Tokenization: Text â tokens; Aiden 3s â speaker vector.
- Dual-track LM (12.5 Hz): Predict semantic token for frame t, MTP predicts 15 residual acoustic tokens; repeat.
- Decode: 4 tokens (â320 ms) form first packet; causal ConvNet outputs audio; user hears speech â100 ms after start.
- Outcome: Natural, warm, slightly slower voice consistent with Aiden across an entire paragraph.
04Experiments & Results
The Test: Researchers measured how accurate, natural, controllable, and fast Qwen3-TTS is.
- Accuracy: Word Error Rate (WER) on Seed-TTS and multilingual sets; speech reconstruction quality (PESQ/STOI/UTMOS) and speaker similarity (SIM) for tokenizers.
- Control: InstructTTSEval for following detailed style/voice instructions.
- Cross-lingual: Keeping the same voice across languages on CV3-Eval.
- Long-form: Stability over 10+ minutes.
- Speed: First-packet latency and real-time factor (RTF) under concurrency.
The Competition: Compared with open and commercial systems like CosyVoice 3, MiniMax, ElevenLabs Multilingual v2, GPT-4o-mini-tts, seed baselines, and others (Higgs-Audio-v2, VibeVoice, VoxCPM).
Scoreboard (with context):
- Zero-shot cloning (Seed-TTS): Qwen3-TTS-12Hz-1.7B reached WER 0.77 (zh) | 1.24 (en), which is like scoring an A+ when top classmates are at A or B (e.g., CosyVoice 3 at 0.71 | 1.45; Seed-TTS baseline at 1.12 | 2.25). The 12 Hz models consistently beat 25 Hz on WER, showing better long-range stability.
- Multilingual (10 languages): Qwen3-TTS gets best WER in 6/10 languages (Chinese, English, Italian, French, Korean, Russian) and best speaker similarity in all 10, topping MiniMax and ElevenLabs. Thatâs like winning most academic subjects and also taking gold in the voice âlook-alikeâ contest.
- Cross-lingual (voice transfer): New SOTA in tough cases like zhâko: error rate drops â66% vs CosyVoice 3 (4.82 vs 14.4). Voice identity stays consistent across languages while content stays clear.
- Controllability (InstructTTSEval): The 12Hz-1.7B Voice Design model sets a new open-source high score (better DSD/RP than specialized systems), and Qwen3-TTS beats GPT-4o-mini-tts by large margins in target-speaker editing while remaining competitive with Gemini models.
- Target-speaker SFT: After fine-tuning on one voice (monolingual), Qwen3-TTS generalizes that identity to 10 languages and beats GPT-4o-Audio-Preview in 7/10 languages on WER.
- Long-form: The 25 Hz 1.7B variant shines here (lowest WER among baselines), suggesting the semantic-rich path can be extra stable over 10+ minutes, whereas some chunk-based systems show boundaries or drift.
Speed and Scaling:
- First-Packet Latency: â97 ms (12Hz-0.6B), â101 ms (12Hz-1.7B) end-to-end. Thatâs faster than a blink.
- Under concurrency, the 12.5 Hz causal decoder keeps RTF low and scales wellâpractical for live services.
Surprising Findings:
- 12.5 Hz wins on content accuracy (WER) in many settings, likely because coarser frames simplify long-horizon prediction.
- 25 Hz can be more stable for very long passagesâsemantic emphasis may help resist drift.
- A single SFT voice transfers well to unseen languagesâstrong cross-lingual generalization with limited fine-tuning.
Tokenizer Benchmarks:
- 25 Hz Stage 1 matches or beats S3 tokenizers on ASR WER; Stage 2 trades slight ASR drop for richer acousticsâuseful for TTS quality.
- 12 Hz tokenizer sets new SOTA on reconstruction (PESQ_WB/NB, STOI, UTMOS, SIM) at 12.5 Hz, balancing quality and bitrate exceptionally well.
05Discussion & Limitations
Limitations:
- Language and dialect coverage: Strong across 10+ languages, but not exhaustive; rare dialects or code-switching nuances may still trip the model.
- Extreme styles: Singing, shouting, or highly theatrical prosody may require specialized training.
- Latency trade-offs: The 25 Hz path needs limited look-ahead (DiT + vocoder), so it wonât be as snappy as 12.5 Hz in first audio.
- Safety and identity: Voice cloning requires careful consent and watermarking policies beyond the model itself.
Required Resources:
- Training scale: 5M+ hours and multi-stage post-training, typically demanding large compute clusters.
- Serving: Low-latency inference stack (vLLM-like engine, CUDA Graphs, torch.compile), fast storage, and GPU(s) per concurrent users at scale.
- Data curation: High-quality pipelines for continual pretraining (CPT) and long-context upsampling.
When NOT to Use:
- Ultra-precise lip-sync for dubbing without alignment toolsâneeds dedicated timing alignment.
- Singing synthesis or music generationâdifferent modeling targets.
- Very low-resource or unseen languages/dialects without fine-tuningâquality may vary.
Open Questions:
- Fairness and accent bias: How evenly does quality hold across accents and dialects not seen often in training?
- Finer control knobs: Can we expose interpretable sliders (speed, brightness, breathiness) that are guaranteed and disentangled?
- Robustness: How to further reduce rare long-form drift and improve code-switching?
- Safety: Built-in watermarking or speaker-consent checks at the model level.
- Efficiency: Can we keep 25 Hz expressivity but shave down look-ahead even more?
06Conclusion & Future Work
Three-Sentence Summary: Qwen3-TTS unifies high-quality, controllable, multilingual, and streaming text-to-speech by pairing a dual-track language model with two complementary tokenizers (25 Hz expressive and 12.5 Hz ultra-fast). It achieves state-of-the-art results in zero-shot cloning, cross-lingual transfer, instruction following, and long-form stability, while delivering â100 ms first-audio latency. Open-sourced under Apache 2.0, itâs practical for real-time assistants, content creation, and accessibility.
Main Achievement: Showing that you donât have to choose between speed, control, and qualityâdual-track prediction plus carefully designed tokenizers and post-training alignment make them work together at scale.
Future Directions: Expand language coverage and dialect depth, add more granular and disentangled style controls, further cut 25 Hz latency, explore watermarking/consent flows, and generalize to broader audio generation tasks beyond TTS.
Why Remember This: Qwen3-TTS demonstrates that instant, natural, multilingual speech with fine control is now feasible and open, setting a blueprint for next-generation, voice-first humanâcomputer interaction.
Practical Applications
- âąLive voice assistants that start speaking within ~100 ms and adapt tone on demand.
- âąCustomer support bots that keep a consistent brand voice across languages.
- âąAudiobook and podcast generation with long-form stability and expressive narration.
- âąLanguage learning apps that read examples in the same friendly voice in multiple languages.
- âąGame NPCs that react instantly with controllable emotion and keep the same identity across quests.
- âąAccessible screen readers that can match a preferred reading style and tempo.
- âąRapid voice previews for creators: type a style instruction and hear it immediately.
- âąCross-lingual dubbing prototypes that keep a speakerâs identity when changing languages (with alignment tools).
- âąVoice A/B testing in product teams using DPO/GSPO-aligned styles to pick the most engaging delivery.
- âąOn-device or edge-accelerated TTS for kiosks and robots needing low-latency responses.