VIBEVOICE-ASR Technical Report

Zhiliang Peng; Jianwei Yu; Yaoyao Chang; Zilong Wang; Li Dong; Yingbo Hao; Yujie Tu; Chenyu Yang; Wenhui Wang; Songchen Xu; Yutao Sun; Hangbo Bao; Weijiang Xu; Yi Zhu; Zehua Wang; Ting Song; Yan Xia; Zewen Chi; Shaohan Huang; Liang Wang; Chuang Ding; Shuai Wang; Xie Chen; Furu Wei

VIBEVOICE-ASR Technical Report

Beginner

Zhiliang Peng, Jianwei Yu, Yaoyao Chang et al.1/26/2026

arXiv PDF

Key Summary

•VIBEVOICE-ASR is a single-pass system that listens to up to 60 minutes of audio at once and outputs who spoke, when they spoke, and what they said in one stream.
•It fixes the common long-audio problem where chopping the sound into tiny pieces makes the model forget the bigger story (context fragmentation).
•It combines three jobs—speech-to-text (ASR), who-is-speaking (diarization), and timing (timestamps)—into a unified, end-to-end generation task.
•A special double audio tokenizer shrinks sound into about 7.5 tokens per second, so one hour fits inside a modern language model’s window (~27k tokens).
•You can add helpful hints (prompt-based context) like names, product terms, or acronyms to boost accuracy on tricky words and code-switching.
•On five public multi-speaker benchmarks, it beats strong closed-source baselines (Gemini-2.5/3-Pro) on who-when-what accuracy, especially tcpWER and DER.
•The model supports 50+ languages without setting a language switch and handles code-switching inside and across sentences.
•A careful training recipe mixes real data, refined long-form transcripts, synthetic context-rich dialogues, and music robustness data.
•Limitations include weaker performance on low-resource languages after fine-tuning and missing information during overlapping speech.
•Microsoft Research is open-sourcing model weights and pipelines to help the community adapt it to more languages and uses.

Why This Research Matters

Long meetings, classes, and podcasts are hard to skim, search, and trust if transcripts mix up who spoke or when things happened. VIBEVOICE-ASR keeps the entire hour in memory and writes a neatly tagged script, so teams can jump straight to decisions, quotes, or to-dos. It supports many languages and handles code-switching, which helps global companies, schools, and communities. Domain prompts let hospitals, law firms, and engineers get tough terms right without building custom models. Non-speech tags reduce hallucinations, making captions and summaries more reliable. Open-sourcing the model and pipelines invites the community to adapt it for low-resource languages and new use cases.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Top Bread (Hook): Imagine you’re trying to write down everything that happens during a one-hour class—who asked each question, when they asked it, and the exact words they said. If you only listen to 30 seconds at a time, you’ll keep losing the big picture and mix up who said what.

🥬 Filling (The Actual Concept: Automatic Speech Recognition, Speaker Diarization, and Timestamping, plus the problems they face)

🍞🍞 New Concept 1: Automatic Speech Recognition (ASR) 🍞 Hook: You know how a friend writes notes while you talk so they don’t forget? That’s like a talking-to-writing machine. 🥬 The Concept: ASR turns speech into text.

How it works:
1. Listen to sound waves.
2. Recognize speech patterns and map them to words.
3. Type out the words in order.
Why it matters: Without ASR, we only have sounds, not searchable, editable text. 🍞 Anchor: When you say ‘Set a timer for ten minutes,’ ASR is what makes the assistant write ‘set a timer for 10 minutes’ so it can act.

🍞🍞 New Concept 2: Speaker Diarization 🍞 Hook: Imagine a group photo where you label each face with a name—now you know who is who. 🥬 The Concept: Diarization figures out who is speaking at different times.

How it works:
1. Listen for voice fingerprints (how each person’s voice sounds).
2. Split the audio by who is talking.
3. Tag each speech part with a speaker ID.
Why it matters: Without diarization, quotes get misattributed, and meeting minutes become confusing. 🍞 Anchor: In a podcast with three hosts, diarization separates Host A’s jokes from Host B’s facts and Host C’s questions.

🍞🍞 New Concept 3: Timestamping 🍞 Hook: Think of a sports scoreboard that shows exactly when each goal happens. 🥬 The Concept: Timestamping marks the start and end times of words or sentences.

How it works:
1. Align pieces of sound with recognized words.
2. Record exact times for when speech starts and stops.
3. Keep the timings with the transcript.
Why it matters: Without timestamps, you can’t jump to the exact moment in a recording. 🍞 Anchor: If your teacher says ‘Pop quiz next Friday’ at 12:35, timestamping lets you skip straight to 12:35 later.

🍞🍞 New Concept 4: Context Fragmentation 🍞 Hook: Imagine reading a novel one paragraph per day, out of order—you’d lose the plot. 🥬 The Concept: Context fragmentation is when chopping audio into short clips makes the system forget long-range meaning.

How it works:
1. The audio is split into tiny chunks.
2. Each chunk is processed alone, without the rest of the story.
3. Important clues across chunks get lost.
Why it matters: Without full context, homophones (like ‘pair’ vs ‘pear’) and references (‘she’ meaning who?) get mixed up. 🍞 Anchor: In a long meeting, ‘API’ vs ‘AP eye’ or ‘the budget’ vs ‘the gadget’ can be confused if earlier context is missing.

🍞🍞 New Concept 5: Multi-speaker Complexity 🍞 Hook: Imagine trying to follow three friends telling stories at the same time in a noisy cafeteria. 🥬 The Concept: Multi-speaker complexity is the challenge of tracking multiple voices, turn-taking, and interruptions.

How it works:
1. Voices overlap or switch quickly.
2. The model must detect changes fast and consistently.
3. Small diarization mistakes snowball into wrong attributions.
Why it matters: Without handling this, meeting transcripts become a jumble that no one trusts. 🍞 Anchor: During a heated debate, knowing exactly who interrupted whom matters for fair notes.

The world before: Many systems handled long audio by chopping it into small pieces (under 30 seconds), running separate tools for ASR, speaker diarization, and timestamps, then stitching results together. This pipeline approach seemed practical, but two big things went wrong: the model lost the big story when segments were isolated (context fragmentation), and mistakes in one stage (like wrong speaker clusters) ruined later stages (error propagation). People tried smarter chunking, better alignment tools, and complex heuristics to re-merge speakers. But the more rules they added, the more brittle the system became, and it still couldn’t ‘remember’ an entire hour at once.

The missing ingredient: a way to fit the whole hour into a model’s head so it could understand everything together—who, when, and what—without gluing many separate parts. That’s what VIBEVOICE-ASR brings: a unified, single-pass model that listens once, keeps the global story, and writes a rich transcript in one go.

Why this matters to daily life: It affects how we record classes, summarize long meetings, caption podcasts, and search important moments in lectures. If the system mixes up speakers or times, it can affect decisions, grades, and teamwork. Getting who-when-what right saves time, reduces confusion, and builds trust in AI notes people actually use.

02Core Idea

🍞 Top Bread (Hook): Imagine a super note-taker who can listen to an entire one-hour show at once, remember the whole story, and type out exactly who said what and when—without stopping.

🥬 Filling (The Actual Concept: What the paper’s ‘aha!’ is and how it changes things)

The ‘Aha!’ in one sentence: Compress the audio so an hour fits inside a language model’s context window, then generate a single, structured stream that interleaves who, when, and what—no chopped pieces, no fragile pipelines.

🍞🍞 New Concept 6: Rich Transcription 🍞 Hook: Think of a movie script that includes dialogue plus who’s speaking and stage directions. 🥬 The Concept: Rich transcription is a transcript that explicitly includes speaker IDs (who), timestamps (when), and words (what) in one sequence.

How it works:
1. The model outputs tags like [Speaker 2, 10.3–33.33] followed by the spoken text.
2. It keeps producing these tagged blocks in time order.
3. Non-speech events (like [Music], [Silence]) can also be included to avoid hallucinations.
Why it matters: Without rich structure, you can’t search by speaker, jump by time, or trust who-when alignment. 🍞 Anchor: In meeting notes, you can filter ‘Show only Speaker 1 between 12:00 and 12:10’ to find an action item instantly.

🍞🍞 New Concept 7: Single-pass Processing 🍞 Hook: You know how it’s easier to understand a story when you read it in one sitting instead of tiny chunks? 🥬 The Concept: Single-pass processing means the model listens to the whole long audio one time and outputs everything in one go.

How it works:
1. Feed the entire compressed audio into the model’s context.
2. The model attends across the full hour for global clues.
3. It generates a continuous, coherent transcript with who-when-what.
Why it matters: Without single-pass, information gets split; references and speaker consistency suffer. 🍞 Anchor: A one-hour class stays coherent, so ‘she’ clearly refers to the professor introduced 25 minutes earlier.

🍞🍞 New Concept 8: Dual-tokenizers 🍞 Hook: Picture using two different kinds of scissors—one for fabric (sound quality), one for paper (word meaning). 🥬 The Concept: Dual-tokenizers turn raw sound into two compact streams: acoustic tokens for sound fidelity and semantic tokens for language meaning.

How it works:
1. The Acoustic Tokenizer hugely downsamples the waveform to about 7.5 tokens per second.
2. The Semantic Tokenizer extracts features aligned with words and grammar.
3. The two are fused as continuous audio embeddings.
Why it matters: Without this compression, an hour wouldn’t fit; without semantic cues, words would be less accurate. 🍞 Anchor: One hour becomes ~27,000 tokens (3600 × 7.5), which modern language models can handle in one pass.

🍞🍞 New Concept 9: Large Language Models (LLMs) 🍞 Hook: Imagine a super-smart librarian who knows patterns in language and can write fluently. 🥬 The Concept: An LLM reads the compressed audio embeddings and generates the rich transcript token by token.

How it works:
1. Treat long-form speech understanding as language modeling.
2. Autoregressively produce tokens that describe speaker, time, and content.
3. Use the full context window to stay consistent.
Why it matters: Without an LLM backbone, it’s hard to fuse who-when-what and keep long-range coherence. 🍞 Anchor: The model uses what it ‘heard’ at minute 3 to correctly interpret a callback joke at minute 53.

🍞🍞 New Concept 10: Prompt-based Context Injection 🍞 Hook: Like getting a cheat sheet with hard names before a spelling bee. 🥬 The Concept: Users can prepend helpful text hints (hotwords, acronyms, background) to guide recognition.

How it works:
1. Add a short prompt with domain terms (‘HLA-B27’, ‘GraphQL’, ‘X Æ A-12’).
2. The model uses the prompt to bias towards the right spellings and meanings.
3. It especially helps with niche jargon and code-switching.
Why it matters: Without helpful hints, rare names and technical words often come out wrong. 🍞 Anchor: A hospital adds ‘metoprolol’ and ‘echocardiogram’ to the prompt; the notes stop misspelling them.

Before vs. after: Before, we had three separate tools plus glue code; errors cascaded, and the big picture vanished. After, one model attends to the whole hour and writes an organized transcript, keeping speakers straight and timings tight.

Why it works (intuition): The ultra-low frame rate keeps enough acoustic detail while making the sequence short enough for the LLM to ‘remember everything at once.’ The LLM then treats the job like writing a carefully tagged story—if the whole story fits in memory, pronouns, speaker identities, and timing stay consistent. Context prompts act like signposts that nudge the model toward correct, domain-specific choices.

Building blocks:

Ultra-low-rate acoustic tokenizer (about 3200× downsampling at 24 kHz).
Semantic tokenizer for language-aligned features.
Decoder-only LLM (e.g., Qwen 2.5) with large context window and curriculum training from 8k to 65k tokens.
Structured output format: [Who], [When], [What], plus non-speech tags to prevent hallucinations.
Optional context prompts for domain terms and background.

03Methodology

At a high level: Audio (up to 60 minutes) → Dual-tokenizers produce continuous audio embeddings (+ optional text prompt) → Decoder-only LLM attends across the full context → Generates Rich Transcription stream with [Who], [When], [What] (and non-speech tags).

Step-by-step, like a recipe:

Prepare the inputs

What happens: The system takes long-form audio and sends it through two tokenizers: an Acoustic Tokenizer that massively compresses the waveform to ~7.5 tokens/sec, and a Semantic Tokenizer that extracts meaning-related features. Optional: prepend a short text prompt containing names, jargon, or background.
Why it exists: Without compression, an hour of raw audio would be too long for a language model to ‘remember.’ Without semantic cues, word choices would drift. Without prompts, niche terms get mangled.
Example: A 60-minute meeting becomes about 27,000 tokens. Prompt: ‘Attendees: Dr. Lee, Prof. Gómez. Topics: GraphQL, HLA-B27.’

Fuse into a single sequence

What happens: The acoustic and semantic embeddings are combined into a continuous stream the LLM can read.
Why it exists: The LLM expects a single timeline to attend over; separate streams would complicate alignment.
Example: Minute 12’s soft laugh (acoustic) helps the model not mistake it for the word ‘ha,’ while the semantic features keep ‘cache’ vs ‘cash’ straight.

Let the LLM attend across the full hour

What happens: A decoder-only LLM (e.g., Qwen 2.5) autoregressively generates tokens. Because the whole hour fits, it can use early clues to interpret later speech.
Why it exists: Long-range attention prevents context fragmentation; the model keeps track of speakers, topics, and references.
Example: If ‘API’ was defined at minute 2, when someone says ‘call it,’ at minute 45 the LLM knows they likely mean calling the API.

Generate Rich Transcription (Who–When–What)

What happens: The model emits structured tags, e.g., ‘Speaker 2, 10.3–33.33: let’s review the roadmap…’ and inserts non-speech tags like [Silence], [Music], [Noise] when needed.
Why it exists: Merging diarization, timestamps, and content avoids glue code and error chains.
Example: ‘Speaker 1, 0.0–10.25: Welcome to Vibe…’ then ‘[Noise]’ then ‘Speaker 2, 10.3–33.33: Thanks for joining…’

Handle non-speech intervals

What happens: The model learns to label non-speech (e.g., [Unintelligible Speech], [Music], [Environmental Sounds]).
Why it exists: Without explicit non-speech labels, models often hallucinate words during silence or background noise.
Example: Soft piano in the background becomes [Music], not ‘muse’ or random words.

Training data pipeline (pre-training)

What happens: Long recordings are first segmented by voice activity detection, transcribed (e.g., with Whisper large-v3 turbo), diarized (WeSpeaker embeddings + HDBSCAN clustering), refined (merge clusters if cosine similarity > 0.67), then filtered (drop noisy samples).
Why it exists: Pre-training needs lots of reasonably labeled data to teach the model what speech, speakers, and timings look like.
Example: A messy town-hall gets cleaned into reliable speaker turns and word-level timestamps.

Supervised fine-tuning (SFT)

What happens: Use high-quality multi-speaker datasets (e.g., MLC-SLM, Fisher) and a music set (Muse) to improve base skills. Build synthetic, context-rich dialogues (via a strong text model) paired with prompts, then synthesize multi-speaker audio, verify transcripts, and keep only good samples. Restore long-form coherence by asking a text model to rewrite and merge chunked transcripts into globally consistent long texts; label non-speech with audio-tagging tools.
Why it exists: SFT aligns the model with instructions (who-when-what format), boosts domain and code-switch performance, and teaches it to behave well on long recordings.
Example: A bilingual tech talk is synthesized with a prompt listing ‘GraphQL, Kubernetes, 张三,’ so the model learns to get tricky spellings and switches right.

Curriculum for long contexts

What happens: The LLM is trained with gradually increasing input lengths (from ~8k to ~65k tokens).
Why it exists: Jumping straight to very long sequences is hard; curriculum helps stabilize learning.
Example: Like practicing a 5-minute speech before a 60-minute lecture.

The secret sauce:

Ultra-low frame rate audio tokenizer lets an hour fit comfortably in context so the model can reason globally in a single pass.
Unified generation of who-when-what avoids brittle pipelines and error cascades.
Prompt-based context injection gives the model a steer toward rare names, domain jargon, and robust code-switching.
Non-speech tagging reduces hallucinations during silence and background sounds.

What breaks without each step:

No compression: the model can’t see the whole hour; context is lost.
No semantic features: more homophone errors and weaker grammar alignment.
No prompts: rare terms and names are misheard.
No unified output: reconciling separate modules reintroduces fragmentation and drift.
No non-speech tags: the model invents words in quiet parts.

04Experiments & Results

🍞🍞 New Concept 11: Word Error Rate (WER) 🍞 Hook: Think of grading a spelling test—how many words did you misspell? 🥬 The Concept: WER measures how many words in the transcript are wrong compared to the reference.

How it works:
1. Compare the model’s words to the correct words.
2. Count substitutions, deletions, and insertions.
3. Divide by the total number of correct words.
Why it matters: Without WER, you can’t tell how accurate the plain text part is. 🍞 Anchor: If the model writes ‘their’ instead of ‘there’ many times, WER goes up.

🍞🍞 New Concept 12: Diarization Error Rate (DER) 🍞 Hook: Imagine a quiz where you match quotes to the right friend; each mismatch is a point off. 🥬 The Concept: DER measures errors in who is speaking (missed speech, false alarms, and speaker confusions).

How it works:
1. Align predicted speaker turns with the true ones.
2. Count where the system assigned the wrong person or missed a segment.
3. Compute error as a percentage of total time.
Why it matters: Without low DER, meeting notes misattribute decisions. 🍞 Anchor: If it says ‘Teacher’ when the student was talking, DER increases.

🍞🍞 New Concept 13: Concatenated minimum-Permutation WER (cpWER) 🍞 Hook: Suppose you group everything each person said into one big paragraph, then check mistakes per person. 🥬 The Concept: cpWER checks content accuracy while allowing the model to permute speaker labels to best match the truth.

How it works:
1. Concatenate all utterances per predicted speaker.
2. Try all permutations of speaker mapping.
3. Pick the mapping with the lowest WER.
Why it matters: It reflects both transcription and speaker consistency, without punishing tiny timing slips. 🍞 Anchor: If the model kept mixing up Speaker A and B labels but consistently separated their words, cpWER still gives fair credit.

🍞🍞 New Concept 14: Time-Constrained minimum-Permutation WER (tcpWER) 🍞 Hook: Like cpWER, but you only get points if the words also show up at about the right times. 🥬 The Concept: tcpWER measures who-what-when together—words must match within a time window.

How it works:
1. Do cpWER matching.
2. Enforce a temporal collar so matches count only if times align.
3. Compute WER with this timing rule.
Why it matters: Without tcpWER, a model could get words right but place them at the wrong moment. 🍞 Anchor: If it moves a sentence from minute 5 to minute 15, tcpWER penalizes that.

The test: The team evaluated on multi-speaker datasets across languages: AISHELL-4 (Chinese), AMI (English, IHM/SDM), AliMeeting (Chinese), and MLC Challenge (many languages). They followed MeetEval to report four views of quality: WER (what), DER (who), cpWER (speaker-aware content), and tcpWER (who+what+when).

The competition: They compared against strong, closed-source multimodal models (Gemini-2.5-Pro and Gemini-3-Pro). To be fair and stable for baselines, they fed Gemini 240-second chunks; VIBEVOICE-ASR processed each full recording in a single pass.

The scoreboard with context:

Across all datasets, VIBEVOICE-ASR achieved the best DER and tcpWER, meaning it most reliably knew who spoke when and aligned words to the right times. Think of this like getting an A+ in ‘who-when’ while others got B’s.
Example numbers: On AISHELL-4 (Chinese), DER dropped to 6.77 versus 15.32 and 22.03 for Gemini models—a big win in speaker correctness. On the same set, tcpWER improved to 25.35 from 31.60 and 38.75—better timing-accurate content.
On AMI and AliMeeting (English and Chinese), VIBEVOICE-ASR similarly beat baselines in DER and tcpWER, showing stronger diarization and time alignment in busy, multi-speaker rooms.
In cpWER (speaker-aware content), the model led in most languages of the MLC Challenge (11/16 cases), signaling better speaker consistency.
In plain WER (what), it was best in about half the settings and close on the rest—so it balances raw word accuracy with superior who-when structure.

Surprising findings:

Closed-source baselines showed timestamp drift and occasional hallucinations on long audio; single-pass global attention likely kept VIBEVOICE-ASR more grounded.
Prompt-based context gave notable boosts on rare terms and code-switching, acting like a friendly compass during tricky passages.
Music robustness data helped the model avoid turning background melodies into fake words.

Bottom line: If your top priorities are ‘who said what and when’ in long, multi-speaker audio, VIBEVOICE-ASR sets a new state-of-the-art on public tests and handles 50+ languages without a manual language toggle.

05Discussion & Limitations

Limitations (honest and specific):

Multilingual forgetting during SFT: Although pre-training covered 50+ languages, fine-tuning emphasized English, Chinese, and code-switching. This can reduce performance on low-resource languages not seen in SFT. Community fine-tuning with the released tools can help close this gap.
Overlapping speech: The model writes a serialized stream and does not explicitly separate simultaneous talkers. In ‘cocktail party’ moments, it tends to follow the dominant speaker, missing secondary speech.
Prompt dependence for jargon: While context prompts are powerful, performance on rare terms may dip if prompts are missing or incomplete.
Long-context compute needs: Single-pass processing over ~27k tokens requires sufficient memory and optimized inference (e.g., vLLM), which may be heavy for edge devices.
Data pipeline biases: Even with careful filtering and synthetic data, certain accents, domains, or background noises may be underrepresented.

Required resources:

A GPU or accelerator that can handle long-context LLM inference (tens of thousands of tokens) efficiently.
Access to the open-source model weights and tokenizers, plus the preprocessing pipeline (VAD, transcription, diarization) for additional training.
Optional but useful: domain prompts (hotwords, acronyms), especially for specialized fields.

When not to use:

Highly overlapped multi-speaker scenarios where simultaneous transcription of all voices is essential (e.g., courtroom crosstalk). A separation-aware system may be better.
Ultra-low-latency streaming needs under a few hundred milliseconds; the model targets single-pass long-form rather than frame-by-frame streaming.
Settings with strict on-device constraints and no access to long-context inference.

Open questions:

Can we integrate separation-aware modeling so overlapping speakers are all captured without losing single-pass simplicity?
How far can multilingual performance go with community SFT on low-resource languages using the released tools?
Can we further shrink compute via smarter compression or sparse attention while keeping accuracy?
What are the best practices for writing prompts that maximally boost domain terms without biasing content?
How should we evaluate richer outputs (non-speech tags, code-switch fidelity) beyond WER/DER/tcpWER to reflect real user needs?

06Conclusion & Future Work

Three-sentence summary: VIBEVOICE-ASR listens to long audio (up to 60 minutes) in a single pass and writes a rich transcript that interleaves who, when, and what. It gets rid of context fragmentation by compressing audio to fit inside a large language model’s window and unifying ASR, diarization, and timestamps into one end-to-end generation task. On public benchmarks across languages, it beats strong baselines in speaker attribution and timing-accurate transcription while staying competitive on plain word accuracy.

Main achievement: Turning long-form, multi-speaker transcription into a single, coherent language modeling problem—powered by ultra-low-rate audio tokenization, a long-context LLM, and a structured output format—so the system maintains global context and avoids error-prone pipelines.

Future directions: Add separation-aware capability for overlapping speech, expand instruction-tuning to more low-resource languages, reduce compute with smarter attention and compression, and standardize richer evaluations (like non-speech tagging and code-switch quality). Prompts can also be studied as a ‘context API’—how small, well-crafted hints can reliably steer recognition in difficult domains.

Why remember this: It’s a blueprint for how to make AI listen like a great human note-taker—hold the whole story in mind, keep speakers straight, place words at the right times, and use hints wisely. That design unlocks trustworthy meeting notes, better captions, faster search in long talks, and smoother multilingual experiences.

Practical Applications

•Auto-generated meeting minutes that accurately attribute action items to the right speaker with time markers.
•Lecture transcripts where students can click timestamps to replay key explanations or questions.
•Podcast and webinar search: jump to the exact minute a guest mentions a product, person, or statistic.
•Court or council session archives with precise who-when-what tagging for public records.
•Medical dictation with prompts for drug names and procedures to improve term accuracy.
•Customer support call analytics that separate agents from customers and flag important moments.
•Bilingual classroom captions that handle code-switching smoothly without manual language settings.
•Media production pipelines that label [Music], [Noise], and [Silence] to help editors cut cleaner scenes.
•Market research focus group transcripts that stay consistent across long, multi-speaker discussions.
•Enterprise knowledge bases built from hour-long briefings with trustworthy speaker and time anchors.

Version: 1