Qwen3-ASR Technical Report
Key Summary
- âąQwen3âASR is a family of speech models that hear, understand, and write down speech in 52 languages and dialects, plus they can tell you when each word was spoken.
- âąThere are two allâinâone recognizers (1.7B and 0.6B parameters) and a new nonâautoregressive forced aligner that predicts timestamps quickly and accurately in 11 languages.
- âąThe models are built on the Qwen3âOmni foundation and use a special AuT audio encoder with dynamic attention windows for both streaming and longâform audio.
- âąTraining happens in four stages (AuT pretraining, Omni pretraining, supervised ASR finetuning, and RL with GSPO) to boost accuracy, stability, and noise robustness.
- âąOn public and internal tests, Qwen3âASRâ1.7B hits stateâofâtheâart accuracy among openâsource models and competes closely with top commercial APIs.
- âąThe lightweight Qwen3âASRâ0.6B is very fast (as low as 92 ms timeâtoâfirstâtoken) and can process up to 2,000 seconds of audio per second at high concurrency.
- âąThe Qwen3âForcedAlignerâ0.6B reduces timestamp shift by about 67%â77% versus strong baselines and works on long utterances and crossâlingual audio.
- âąThe models handle noisy audio, accents, children and elderly speech, and even singing and full songs with background music.
- âąLanguage identification is built in, outperforming Whisperâlargeâv3 on several multilingual sets while keeping recognition accurate.
- âąEverything is openâsourced (Apache 2.0) with an inference and finetuning toolkit to speed up community research and deployment.
Why This Research Matters
Better ASR and timestamps make videos, meetings, classes, and podcasts more accessible and searchable for everyone, including people who are deaf or hard of hearing. Multilingual and dialect support means global users get accurate captions and voice control in their own speech style. High speed and efficiency reduce cloud costs and enable lowâlatency experiences like live captions and realâtime translation. Robustness to noise, accents, kids and elderly voices, and even singing unlocks new media and music applications. Openâsourcing under Apache 2.0 lets startups, schools, and researchers build on strong models without licensing barriers.
Detailed Explanation
Tap terms for definitions01Background & Problem Definition
đ Hook: You know how in movies the subtitles need to pop up at the right time so you can follow along? And voice assistants need to understand you even if the TV is loud or you have an accent?
đ„Ź The Concept: Automatic Speech Recognition (ASR) is how computers turn speech into text. How it works (big picture):
- Listen to the audio waves.
- Turn sound into features the computer can compare.
- Use a language brain to guess the most likely words. Why it matters: Without ASR, captions, voice search, and smart assistants wouldnât understand usâespecially in realâworld noise. đ Anchor: When you say âSet a timer for five minutes,â ASR is what writes those words inside your phone so it can act.
The World Before:
- Traditional endâtoâend ASR (like Transducer or ListenâAttendâSpell) was great on clean, short speech and single languages, but it could stumble on long talks, noisy cafĂ©s, heavy accents, and many dialects. It also usually needed extra tools for timestamps (the âwhenâ of each word).
- Getting timestamps was often done afterward with addâons like CTC or CIF. This worked, but could be finicky, slower, and languageâbyâlanguage.
đ Hook: Imagine youâre taking notes during a very long class while a band practices next door and students speak with many accents.
đ„Ź The Concept: Large AudioâLanguage Models (LALMs) are ASR models with a big language brain attached. How it works:
- An audio encoder builds a highâlevel understanding of the sound.
- A large language model (LLM) uses world knowledge and context to choose the right words. Why it matters: Without the language brain, models might mishear names, places, or long sentences, and break down in noisy or long recordings. đ Anchor: When someone sings âLet it be,â a LALM knows itâs likely the Beatles lyric, not âLet it pea.â
The Problem:
- Real life is messy: background music, overlapping speakers, long podcasts, kids and elderly voices, dialects, codeâswitching, and the need for accurate timestamps for subtitles.
- Benchmarks on clean test sets started to level offâmany models looked similar on paperâbut in the wild they could still differ a lot.
Failed Attempts:
- Purely acoustic, bottomâup recognizers struggled to keep context over long audio and to recognize names or world facts.
- Postâprocessing timestampers added latency and complexity, and often needed languageâspecific recipes.
đ Hook: Imagine packing one suitcase that has separate pouches for every item, so you donât juggle many bags.
đ„Ź The Concept: An allâinâone multilingual ASR with builtâin language ID and robust timestamping. How it works:
- One model detects the language, transcribes it, and can provide timestamps.
- It uses a strong audio encoder plus a knowledgeable LLM. Why it matters: Without a unified model, youâd chain multiple tools, risking errors, delays, and limited language coverage. đ Anchor: You upload a long song or a noisy chat in Cantonese, and the same system outputs the right text, language tag, and precise word times.
The Gap:
- The field lacked a single, open model family that: (a) does strong multilingual ASR (including many dialects), (b) handles long and streaming audio, (c) nails singing and musicâmixed audio, and (d) produces accurate timestamps quickly in multiple languages.
Real Stakes (Why care?):
- Accessibility: Better live captions help people who are deaf or hard of hearing, across many languages.
- Education and media: Fast, accurate transcripts and timestamps make lectures, podcasts, and videos easy to search and subtitle.
- Customer support and safety: Call centers and moderation tools can understand varied accents and noisy environments.
- Creativity: Lyrics transcription and musicâmixed speech recognition enable new music tech and karaoke experiences.
- Onâdevice speed and privacy: Smaller, efficient models reduce costs and can run locally, protecting user data.
Enter Qwen3âASR:
- Two recognizers (1.7B and 0.6B) cover 30 languages and 22 Chinese dialects with strong accuracy, plus streaming and longâform support.
- A novel, lightweight, nonâautoregressive forced aligner predicts timestamps at word/character/sentence levels in 11 languages without slow tokenâbyâtoken decoding.
- Extensive internal tests reveal differences missed by standard benchmarks, proving realâworld gains.
- Openâsourced weights and toolkit under Apache 2.0 help everyone build better speech apps.
02Core Idea
đ Hook: Think of a pro sports team: a speedy scout hears the game (audio encoder), a smart coach understands strategy (LLM), and a timekeeper marks exactly when plays happen (forced aligner). When they work together, the team wins.
đ„Ź The Concept (Aha! in one sentence): Combine a strong audio encoder with a powerful language model and a new slotâfilling, nonâautoregressive timestamp head to build one multilingual system that is accurate, fast, robust, and timeâaware. How it works:
- AuT encoder turns audio into meaningful features with dynamic attention windows.
- Qwen3âOmni language model reasons about words, entities, and context.
- Supervised finetuning reshapes the model into a focused ASR tool, with builtâin language ID.
- RL (GSPO) further boosts stability in noise and tricky cases.
- A separate forced aligner uses slotâfilling to predict all timestamps at once (NAR), making it fast and accurate. Why it matters: Without this combo, you either get models that are accurate but slow, fast but brittle, or multilingual but weak on timingâthis design aims to give you all three: accuracy, speed, and timestamps. đ Anchor: Upload a 10âminute bilingual podcast with street noise; the model outputs the right language tags, clean text, and word times in one go.
Analogy 1 (Studio Mixer): The AuT encoder is the sound engineer cleaning and structuring the audio tracks; the LLM is the producer choosing the right takes and lyrics; the forced aligner is the timestamped track list. Analogy 2 (Detective Team): One detective catalogs clues (features), another connects them into a story (LLM), and the clerk records when each clue appeared (timestamps)âcase solved. Analogy 3 (Assembly Line): The first station shapes raw materials (audio frames), the second assembles into a product (sentences), and the last stamps the time code on every piece (aligner) for shipping (subtitles).
Before vs After:
- Before: Separate tools for recognition, language ID, and timestamps; uneven performance across languages; longâform and singing were brittle.
- After: One family handles 30 languages and many dialects, streams or processes long audio, and aligns timestamps in 11 languages with fewer errors and lower latency.
đ Hook: Imagine reading a long book with a builtâin table of contents that updates itself as you read.
đ„Ź The Concept: Dynamic attention windows let the model slide between short chunks (streaming) and long context (offline) without retraining. How it works:
- Use small windows for live streaming.
- Expand to larger windows for long recordings.
- Keep information flowing so the story (context) stays clear. Why it matters: Without dynamic windows, streaming would lag or long audio would lose context. đ Anchor: Live captions for a meeting (small windows) and full meeting minutes later (big windows) come from the same model.
Why It Works (intuition, no equations):
- The encoder lowers the audio speed to manageable steps and highlights what matters.
- The language model brings world knowledge for names, places, and tricky words.
- Finetuning narrows the modelâs job to âjust ASR,â avoiding instruction distractions.
- RL nudges the model to be steadier in realâworld chaos.
- The NAR slotâfilling aligner uses the entire context to guess all timestamps at once, cutting delay and error accumulation.
Building Blocks (miniâideas with sandwiches):
- đ Hook: You know how radios compress sound to make it easier to send?
đ„Ź The Concept: AuT encoder downsampling turns 100 tiny audio steps into about 12.5 steps per second, keeping meaning while reducing length.
How: Convolutions + attention; output features at 12.5 Hz.
Why: Without downsampling, decoding would be slow and memoryâheavy.
đ Anchor: A 2âminute clip becomes a compact feature sequence the LLM can handle quickly. - đ Hook: Picking the right plug to connect two gadgets.
đ„Ź The Concept: A projector maps audio features into the LLMâs language space.
How: A small neural layer reshapes feature size.
Why: Without it, the LLM canât âspeak audio.â
đ Anchor: The projector is the adapter that lets the audio brain talk to the language brain. - đ Hook: Dressing for one job instead of every job.
đ„Ź The Concept: Supervised ASR finetuning (SFT) trains the model to output only ASR responses in a strict format with language tags.
How: Feed diverse ASR data, streamâstyle data, and contextâbias examples.
Why: Without SFT, the model might follow random instructions or drift offâtask.
đ Anchor: Prompts like âlanguage English<asr_text> âŠâ lead to clean transcripts, not chitâchat. - đ Hook: Practicing with a coach who gives targeted drills.
đ„Ź The Concept: RL with GSPO improves noise robustness and stability.
How: Reward better transcripts on hard, noisy, multilingual cases.
Why: Without RL, the model could wobble on tough audio.
đ Anchor: After RL, tongueâtwisters in traffic get fewer mistakes. - đ Hook: Filling blanks in a worksheet all at once.
đ„Ź The Concept: Nonâautoregressive (NAR) slotâfilling forced alignment predicts all timestamp slots simultaneously.
How: Replace timestamps with [time] tokens and predict indices for each slot together.
Why: Without NAR, wordâbyâword timing is slower and error can snowball.
đ Anchor: A page of lyrics gets every wordâs time in one pass.
03Methodology
HighâLevel Recipe: Input audio â AuT encoder (downsample + features) â projector â Qwen3âOmni LLM (reason about words, entities) â ASR text with language tag. For timestamps, a sister model (ForcedAligner) takes audio + transcript with [time] slots â predicts all time indices at once.
Step 1: Audio Features with AuT encoder
- What happens: The raw waveform becomes logâmel filterbank (FBank) features at ~100 frames/sec, then the AuT attentionâencoderâdecoder downsamples 8Ă to ~12.5 frames/sec and builds contextual audio embeddings.
- Why this step exists: It compacts long audio and highlights speech patterns so the LLM doesnât drown in too many frames.
- Example: A 60âsecond clip with ~6,000 frames becomes ~750 highâquality stepsâeasier to process.
đ Hook: Like summarizing each minute of a lecture into a few key sentences.
đ„Ź The Concept: Downsampling keeps the gist while shrinking length.
How: Convolutions + attention pool frames; dynamic attention windows adjust from ~1s to ~8s.
Why: Without it, memory and speed would break for long or many concurrent audios.
đ Anchor: Turning a 20âminute meeting into a compact feature movie the LLM can watch quickly.
Step 2: Projector bridges audio and language
- What happens: A small neural layer reshapes the audio features to the LLMâs token space.
- Why: The LLM expects inputs in its own dimension; this is the adapter.
- Example: 896/1024âdim audio features â LLM hidden size.
Step 3: Qwen3âOmni language model decodes words
- What happens: The LLM uses context and world knowledge to pick the best sequence of words.
- Why: Pure acoustics confuse rare names or longârange grammar; the LLM resolves ambiguity.
- Example: Deciding between âParisâ and âpairsâ by context: âWhat is the capital of France?â â âParis.â
đ Hook: Reading a mystery and focusing on important clues.
đ„Ź The Concept: Attention in the LLM scores which audio/text parts matter now.
How: Crossâattention looks from text tokens to audio features; selfâattention keeps language flow.
Why: Without attention, the model treats âtheâ and âexplosionâ the same.
đ Anchor: âWhatâs the capital of France?â â attention favors âcapitalâ and âFrance,â outputs âParis.â
Step 4: Output style and language identification (LID)
- What happens: The model first emits a language tag, then the transcript; or outputs None if no speech.
- Why: LID prevents wrong alphabets or tokenizers and chooses the right vocabulary.
- Example: âlanguage English<asr_text>Today we releaseâŠâ vs âlanguage None<asr_text>â.
đ Hook: Hearing someone say âbonjourâ and instantly knowing itâs French.
đ„Ź The Concept: Language Identification tells which language is spoken.
How: The model scores language candidates using audio cues and context.
Why: Without LID, it might transcribe French with English spellings.
đ Anchor: A clip starts with âgraciasâ â âlanguage Spanish<asr_text>âŠâ
Step 5: Training pipeline (four stages) A) AuT pretraining
- What: Train the audio encoder on ~40M hours of pseudoâlabeled speech, mostly Chinese and English.
- Why: Build a general, stable encoder across window sizes.
- Example: After this, the encoder knows common sounds, background noise, and timing patterns.
B) Omni pretraining
- What: Qwen3âOmni learns across audio, vision, and text with ~3T tokens.
- Why: Give the LLM strong world and multimodal understanding to inform ASR decisions.
- Example: Recognizing âBachâ is a name in music contexts.
C) ASR Supervised Finetuning (SFT)
- What: Train on curated multilingual ASR data plus streaming, nonâspeech, and contextâbiasing data to output strict ASR format only.
- Why: Make a focused ASR that ignores chitâchat prompts and resists instruction injection.
- Example: System prompt can include custom vocabulary (like product names) to bias results.
D) ASR Reinforcement Learning (RL) with GSPO
- What: Reward better transcripts for hard, noisy, multilingual, and functional cases (~50k utterances).
- Why: Improve robustness, stability, and handling of difficult audio.
- Example: Fewer hesitations and hallucinations when two speakers overlap.
đ Hook: Practicing tongueâtwisters makes you clearer when speaking fast.
đ„Ź The Concept: RL polishing teaches the model to be steady under stress.
How: Compare outputs, assign rewards, update policies (GSPO).
Why: Without RL, the model may degrade in the wild.
đ Anchor: After RL, âShe sells seashellsâ in a windy park gets fewer mistakes.
Step 6: Streaming and longâform with dynamic attention
- What: Use shorter windows for live captions and larger windows for full recordings.
- Why: One model serves both use cases.
- Example: 2âsecond chunk streaming vs 20âminute podcast offline.
đ Hook: Zooming your camera in for details, out for the big picture.
đ„Ź The Concept: Dynamic attention windows switch context size on the fly.
How: The model changes attention span between 1â8 seconds per chunk.
Why: Fixed windows would either lag (if too big) or miss context (if too small).
đ Anchor: Live transcribe a meeting now; summarize the whole meeting laterâsame model.
Step 7: Forced alignment with NAR slotâfilling
- What happens: Prepare the transcript with [time] tokens after words/characters; the model predicts all time indices simultaneously (nonâautoregressive). Multiply by 80 ms to get times.
- Why: Faster and more consistent than wordâbyâword generation; languageâagnostic.
- Example: âHello [time][time] world [time][time]â â start/end times for âHelloâ and âworld.â
đ Hook: Filling all blanks on a worksheet at once instead of one per minute.
đ„Ź The Concept: Nonâautoregressive (NAR) timestamping.
How: The aligner sees the whole sentence and outputs times for every slot together.
Why: Without NAR, time predictions are slower and can drift.
đ Anchor: A 1âminute news clip gets every wordâs time in a single pass.
Secret Sauce (whatâs clever):
- Tight combo of a robust, downsampling audio encoder and a knowledgeable LLM.
- Dynamic attention unifies streaming and longâform.
- SFT locks output to ASR style and uses context bias.
- RL (GSPO) specifically targets noise and tricky cases.
- NAR slotâfilling aligner avoids languageâspecific phonemes or dictionaries and stays fast and stable on long audio.
04Experiments & Results
The Test: The team measured recognition accuracy (WER/CER), language ID accuracy, timing accuracy (AAS shift), and speed (timeâtoâfirstâtoken and realâtime factor). They tested on public English/Chinese sets, a tough internal robustness suite (accents, dialects, noise, kids/elderly, tongueâtwisters), broad multilingual benchmarks, singing voice and full songs with background music, streaming vs offline modes, and forced alignment.
The Competition: Proprietary APIs (GPTâ4oâTranscribe, Geminiâ2.5âPro, DoubaoâASR) and open models (Whisperâlargeâv3, FunASRâMLTâNano, GLMâASRâNano) formed strong baselines.
Scoreboard with context:
- English & Chinese: Qwen3âASRâ1.7B consistently ranked at or near the top, often beating openâsource baselines and staying competitive with commercial APIs. On diverse, realâworld English data, it shined beyond âcleanâ audiobooks. On Mandarin (including big, noisy sets like WenetSpeech), it held a clear edge.
- Dialects: Across Cantonese and 22 Chinese dialects, Qwen3âASR stayed strongâeven for long utterancesâshowing it handles pronunciation and word variation without perâdialect tuning.
- Internal robustness: In accented English across 16 accents, Qwen3âASR achieved the lowest errors among all systems; for Mandarin challenges (elderly/kids, extreme noise, tongueâtwisters, dialogs), 1.7B led, with 0.6B close behindâevidence of realâworld toughness.
- Multilingual: On MLS, Common Voice, and MLCâSLM, 1.7B outperformed Whisperâlargeâv3 overall; on Fleurs, it led on 12â and 20âlanguage subsets but slipped on the full 30âlanguage set, showing room to grow on longâtail languages. Still, 1.7B > 0.6B, showing scale helps.
- Language ID: Both Qwen3âASR models beat Whisperâlargeâv3 on LID accuracy across multiple multilingual sets; most remaining errors were between very similar languages (Malay vs Indonesian).
- Singing & songs with BGM: On singingâonly sets (M4Singer, MIRâ1kâvocal, Popcs), Qwen3âASRâ1.7B was best or nearâbest; on Opencpop it placed a close second. For long songs with music, it dramatically outperformed open baselines and was competitive with top APIsâbig win for musicâmixed audio.
- Streaming vs offline: Using 2âsecond chunks and mild lookâback, the streaming mode kept accuracy close to offline for both sizes, enabling one model for live captions and full transcripts.
- Speed: Qwen3âASRâ0.6B achieved a timeâtoâfirstâtoken as low as 92 ms and, at concurrency 128, processed ~2,000 seconds of audio per second (RTFâ0.064). Translation: itâs fast enough for large deployments and snappy user experiences.
- Forced alignment (timestamps): The Qwen3âForcedAlignerâ0.6B cut accumulated average shift by roughly 67%â77% compared to strong baselines, stayed accurate on long (up to 300s) and crossâlingual audio, and ran extremely fast (RTF near 0.001 at high concurrency). Despite being trained on MFA pseudoâlabels, it generalized well to humanâlabeled sets.
đ Hook: Getting a report card where Aâs appear not just in math but also in art, sports, and science.
đ„Ź The Concept: Balanced excellenceâaccuracy, speed, multilingual breadth, robustness, and timing.
How: Careful pretraining, focused finetuning, RL polishing, dynamic attention, and NAR timestamping.
Why it matters: Without balance, a system might ace one test (clean English) but fail in real life (noisy, long, multilingual).
đ Anchor: A call center with many accents, a karaoke app with full songs, and a lecture recorderâall perform well using the same family.
Surprises and insights:
- Internal evaluations revealed bigger gaps than public benchmarks suggestedârealâworld stress tests matter.
- The NAR aligner, trained on noisy pseudoâlabels, still beat baselines on humanâlabeled dataâshowing robust label distillation.
- Model scaling (0.6B â 1.7B) reliably boosted multilingual and toughâcase performance, though the 0.6B remained the speed champion.
05Discussion & Limitations
Limitations (honest view):
- Longâtail languages: On the largest Fleurs set (30 languages), accuracy dipped versus Whisperâlargeâv3, suggesting more work is needed for rarer languages and diverse scripts.
- Timestamp supervision: The aligner learns from MFA pseudoâlabels that contain noise; while it improves over them, niche phonetic edge cases may still show small shifts.
- Length caps: ASR is designed for single audios up to ~20 minutes, and the aligner up to ~300 seconds; very long audios may need chunking or postâmerging.
- Similarâlanguage LID confusions: Pairs like Malay vs Indonesian can still trip the system.
- Resources: Best speed uses GPUs (bfloat16, CUDA Graphs, FlashAttention, vLLM for ASR). On tiny devices, you may need to trade speed or accuracy.
Required resources:
- For production ASR: A GPU server with vLLM (for batching and async serving), bfloat16 inference, and sufficient VRAM for 0.6B or 1.7B models.
- For forced alignment: PyTorch inference with FlashAttention; CPUâonly is possible but slower.
- Data considerations: Context biasing works best with wellâcurated prompts; streaming accuracy benefits from careful chunk and fallback settings.
When NOT to use:
- Ultraâlow resource onâdevice scenarios needing offline, alwaysâon ASR with minimal computeâconsider the 0.6B with optimization, or even smaller specialized models.
- Extreme longâtail languages or code systems not in the current coverageâaccuracy may lag.
- Ultraâlong unsegmented audio (multiâhour) without a segmentation pipeline.
Open questions:
- How to push longâtail and lowâresource languages without exploding training cost? (Active learning, synthetic data, multilingual lexicon prompts.)
- Can we add speaker diarization and overlapâaware decoding natively?
- How to further cut latency while keeping accuracy (speculative decoding, better streaming chunk policies)?
- Can timestamp resolution go finer than 80 ms frames while staying fast and stable?
- How to robustly reduce LID confusions among very similar languages?
đ Hook: Like a great car that still needs better mileage and more charging stations.
đ„Ź The Concept: Strong today, but room to improve breadth, efficiency, and ultraâlong workflows.
How: Smarter data, better streaming, tighter diarization and overlap handling, and finer timing.
Why it matters: Without continued upgrades, edge cases remain hard.
đ Anchor: Future versions could seamlessly handle a 3âhour multilingual debate with perfect perâspeaker, perâword timestamps.
06Conclusion & Future Work
Threeâsentence summary:
- Qwen3âASR brings two allâinâone ASR models and a fast, accurate, multilingual forced aligner that together deliver strong recognition, builtâin language ID, and precise timestamps.
- By pairing a robust AuT encoder with the Qwen3âOmni language model, then refining via SFT and RL, the family excels on noisy, accented, longâform, and even singing audio across many languages and dialects.
- The nonâautoregressive slotâfilling aligner reduces timestamp errors dramatically, and all models are openâsourced with a practical serving and finetuning toolkit.
Main achievement:
- A unified, open, multilingual speech stack that is accurate (SOTA among open models), fast (low TTFT, high throughput), robust (noise/dialects/singing), and timeâaware (NAR aligner) in realâworld conditions.
Future directions:
- Broaden longâtail language coverage, reduce LID confusions, push even faster streaming, extend maximum lengths, and add native speaker/overlap awareness and finer time resolution.
Why remember this:
- It shows how combining a strong audio encoder, a capable language model, focused finetuning, RL polishing, and a clever NAR timestamp head can turn messy realâworld speechâeven songsâinto readable, timeâstamped text quickly and accurately, all in one open toolbox.
Practical Applications
- âąLive multilingual captions for meetings, classrooms, and events with low latency.
- âąAutoâsubtitle generation with precise wordâ or sentenceâlevel timestamps for video platforms.
- âąQuickly creating searchable transcripts of podcasts, lectures, and customer calls.
- âąOnâdevice or edge ASR for kiosks, cars, and wearables using the 0.6B model for privacy and speed.
- âąKaraoke and music apps that transcribe lyrics and align them to the song, even with background music.
- âąCustomerâsupport analytics across many accents and dialects, with entityâaware transcripts.
- âąMedia production workflows: roughâcut editing by clicking on transcript words linked to audio timestamps.
- âąLanguageâlearning apps that show exactly when each word is spoken and track pronunciation.
- âąCompliance and safety tools that monitor spoken content accurately in noisy, multilingual environments.
- âąRapid dataset labeling: aligning large audioâtext corpora using the NAR forced aligner.