Typhoon ASR Real-time: FastConformer-Transducer for Thai Automatic Speech Recognition

Warit Sirichotedumrong; Adisai Na-Thalang; Potsawee Manakul; Pittawat Taveekitworachai; Sittipong Sripaisarnmongkol; Kunat Pipatanakul

Typhoon ASR Real-time: FastConformer-Transducer for Thai Automatic Speech Recognition

Intermediate

Warit Sirichotedumrong, Adisai Na-Thalang, Potsawee Manakul et al.1/19/2026

arXiv PDF

Key Summary

•Big models like Whisper are great for accuracy but too slow for live captions; this paper builds a smaller, faster Thai speech recognizer for real-time use.
•The new model uses a FastConformer-Transducer design with only 115 million parameters, making it about 45× cheaper to run than Whisper Large-v3 while staying nearly as accurate.
•A strict text normalization pipeline cleans up tricky Thai cases (like numbers and the repetition mark ๆ) so the model learns consistent targets and avoids hallucinations.
•They create 11,000 hours of training data using a consensus voting system from three teacher models, with humans only checking the hard cases.
•A two-stage “curriculum” adapts the model to the Isan dialect without forgetting Central Thai, first tuning sounds, then words and grammar.
•On clean speech (Gigaspeech2-Typhoon), the streaming model gets 6.81% CER; on the tough TVSpeech set, it gets 9.99% CER, close to giant offline models.
•Their data pipeline alone boosts accuracy when used with Whisper too, proving data quality matters as much as model size.
•They release a standardized Thai ASR benchmark so everyone can measure performance fairly and consistently.
•Limits include strict Thai transliteration of English words, weaker code-switching, and less world knowledge than giant foundation models.
•Future work targets inverse text normalization, domain biasing, multi-speaker handling, more dialects, and on-device deployment.

Why This Research Matters

Real-time, reliable Thai speech recognition makes classrooms, meetings, and broadcasts more accessible by providing instant captions that don’t lag. Banks and call centers can transcribe customer calls quickly and consistently, especially for numbers and names that usually cause errors. Healthcare providers can document spoken notes faster, reducing paperwork time while keeping sensitive data on-device in the future. Small businesses and apps can afford high-quality Thai ASR because the model is compact and efficient to run. Regional inclusion improves as dialects like Isan get strong support, bringing modern voice tools to more communities. A standardized benchmark lets everyone measure progress fairly, speeding up research and real-world impact.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine you’re watching a live TV show with Thai captions that appear instantly. If the captions arrive late or guess wrong about numbers and names, the show becomes confusing.

🥬 The Concept: Automatic Speech Recognition (ASR) is how computers turn speech into text in real time or after the fact. How it works (simple):

Listen to the audio and turn it into sound features.
Match those features to likely sounds and words.
Write down the best text the sounds could mean. Why it matters: Without reliable ASR, live captions, call center notes, and voice assistants either lag or make too many mistakes.

🍞 Anchor: When you say “สิบหก” (sixteen), good ASR writes “สิบหก” instead of mixing digits and words like “16” or hearing it as “สิบเอก.”

The World Before: In recent years, huge encoder–decoder models (like Whisper) became super good at offline transcription in many languages. They listen to long chunks, use both past and future context, and deliver very accurate text—if you can wait. For Thai, these models made a big leap in quality because lots of pre-trained checkpoints were freely available.

The Problem: Live apps—like meeting captions, phone agents, or TV subtitles—can’t wait. Offline models decode word-by-word in a way that adds unpredictable delays. They also tend to hallucinate, especially with Thai numbers, dates, and special symbols. Thai adds extra hurdles: no spaces between words, tones, dialects, and multiple ways to speak the same digits depending on context (postal code vs. quantity). Standard Word Error Rate doesn’t fit Thai well, so measuring accuracy fairly is tough.

🍞 Hook: You know how different teachers might grade the same essay differently if there are no rules? That’s what happened with Thai transcripts—everyone wrote things a different way.

🥬 The Concept: Text normalization is the rulebook that turns many valid spellings into one agreed “canonical” form. How it works:

Decide consistent rules (e.g., how to write numbers, how to handle the repetition mark ๆ).
Convert all training and test texts to match these rules.
Train and evaluate so models aren’t punished for style, only for sound-based mistakes. Why it matters: Without normalization, models learn noisy targets and get graded unfairly.

🍞 Anchor: If the audio says the postal code “10150,” the rule may say: always write digits as spoken digits—“หนึ่งศูนย์หนึ่งห้าศูนย์.” Everyone trains and tests on that same version.

Failed Attempts: Many teams tried to just scale up models or add more data, assuming bigger is better. But bigger offline models stayed slow and still hallucinated Thai-specific tricky bits like ๆ and number ranges (6–7 as range vs. minus). Others trained on mixed, messy transcripts, so even good models learned inconsistent outputs.

The Gap: What was missing was a fast, streaming-friendly architecture paired with squeaky-clean, consistent training targets—and a fair, shared test to measure progress.

🍞 Hook: Imagine three referees watching a game. If two agree on a call, you accept it; if not, a head referee decides.

🥬 The Concept: Consensus-based pseudo-labeling uses multiple teacher models to label audio, picking the majority vote and sending only hard cases to humans. How it works:

Run three strong Thai Whisper models on the same audio.
If at least two match, keep that transcript.
If they don’t match, fall back to the best-known teacher.
Flag tricky lines (digits, punctuation) for a quick human check, using strict normalization rules. Why it matters: It scales to thousands of hours while keeping label quality high.

🍞 Anchor: For an audio clip with a phone number, if two teachers agree exactly, that version is accepted; if they differ, a human reviews it to enforce the canonical form.

Real Stakes: Live captions in classrooms, hospital calls, and banking hotlines need low delay, low hallucination, and reliable handling of numbers and names. Thailand’s rich dialects (like Isan) deserve technology that understands them too. This paper shows you can get both speed and trustworthiness by combining the right architecture with careful data rules—and it gives the community a fair benchmark so progress is measurable and reproducible.

02Core Idea

🍞 Hook: Think of a speedy cashier who’s fast and accurate because the price tags are clean and consistent. They don’t need to be a supercomputer; they just need good labels and a checkout lane built for speed.

🥬 The Concept: The “aha!” is that clean, consistent data plus a streaming-ready model beats simply making the model huge for Thai ASR. How it works:

Use a FastConformer-Transducer designed for live, low-latency decoding.
Build a strict text normalization pipeline so the targets are consistent and unambiguous.
Create big training sets with consensus voting among teacher models, with humans checking tricky lines.
Adapt to the Isan dialect with a two-stage curriculum so you learn new speech patterns without forgetting Central Thai. Why it matters: Without this combo, you either get fast but sloppy captions or slow but accurate ones. This yields fast and accurate.

🍞 Anchor: A 115M-parameter streamer gets within reach of giant offline models and slashes compute cost by ~45× compared to Whisper Large-v3.

Multiple Analogies:

Library analogy: If all books follow the same shelf rules (normalization), even a smaller librarian team (compact model) finds the right book fast.
Sports analogy: A well-practiced playbook (curriculum) lets a lean team (115M params) outplay heavier squads (1.55B params) by timing and coordination.
Kitchen analogy: Prepped ingredients (clean labels) and a line-cook station (streaming architecture) beat a gourmet kitchen that’s slow to plate (offline decoding).

Before vs After:

Before: Thai ASR leaned on huge offline models. Results were good but slow, often inconsistent with numbers and the ๆ mark, and hard to compare fairly.
After: A streaming FastConformer-Transducer, trained on strictly normalized, consensus-checked data, reaches similar accuracy with far less compute and stable real-time behavior. A common benchmark makes comparisons trustworthy.

🍞 Hook: You know how cutting a long movie into small, important clips makes it quicker to review?

🥬 The Concept: FastConformer-Transducer is a speech model with an encoder that compresses audio frames early (downsampling) and a decoder (RNN-T) that emits text as audio arrives. How it works:

Early subsampling shrinks the timeline so attention looks over fewer frames.
Local attention and convolutions capture nearby patterns without heavy global compute.
The RNN-T decoder predicts characters as the sound streams in, no need to wait. Why it matters: Without early subsampling and streaming decoding, latency and cost shoot up.

🍞 Anchor: Instead of processing padded 30-second chunks like Whisper, the model listens and types continuously, like a fast stenographer.

Why It Works (intuition):

Subsampling = fewer steps per second of audio, so less compute and lower delay.
Local attention = enough context to be accurate, without expensive full-sequence attention.
RNN-T = output tokens as soon as evidence appears, avoiding the bottleneck of peeking far ahead.
Normalization = the model learns one way to write a sound; no confusion, fewer hallucinations.
Curriculum = first tune ears (acoustics), then tune words (lexicon), preventing forgetting.

Building Blocks:

FastConformer encoder: 8× depthwise conv subsampling, local attention for stability.
Transducer (RNN-T) decoder and joint network: frame-synchronous streaming predictions.
Consensus-based pseudo-labeling: three teacher votes + human checks for tricky cases.
Text normalization: canonical rules for numbers, repetition mark ๆ, dashes, and loanwords.
Two-stage dialect curriculum: Stage 1 low-LR full-model acoustic adaptation; Stage 2 freeze encoder, higher-LR decoder/joint for Isan vocabulary.

🍞 Anchor: Put together, it’s like a relay team: the encoder runs fast with short steps (subsampling), the decoder grabs the baton as soon as it’s close enough to the finish (streaming), and the coach gives one clear playbook (normalization).

03Methodology

At a high level: Audio → Consensus pseudo-labels + strict normalization → Large general training → Streaming model fine-tuning → Two-stage dialect curriculum → Real-time Thai ASR.

Step 1: Consensus-Based Transcription 🍞 Hook: Imagine three classmates take notes from the same lecture; if two notes match, you trust them. If not, a teacher double-checks.

🥬 The Concept: Consensus pseudo-labeling builds a big, high-quality training set cheaply. What happens:

Run three Thai Whisper-Large models on each audio file.
If two or more outputs match exactly, accept that as the label.
If no match, fall back to the strongest teacher (Pathumma-Whisper Large-v3).
Auto-check for tricky patterns (digits, special punctuation). Flagged lines go to human reviewers. Why it exists: To scale to ~11,000 hours without drowning in manual labeling. Example: For an audio “ติดต่อ 02-123-4567,” two teachers agree on the same transcription; it’s accepted. If they disagree on spacing or digits, a human enforces the canonical form.

🍞 Anchor: Like majority vote in a club—simple, fair, fast—and a supervisor only steps in for tough calls.

Step 2: Text Normalization Pipeline 🍞 Hook: You know how a class agrees on one way to format homework so the grader isn’t confused? Same idea here.

🥬 The Concept: Normalize all transcripts to one canonical Thai form that mirrors speech. What happens:

Numbers: choose spoken-digit or spoken-number rules consistently (e.g., postal codes as spoken digits: 10150 → “หนึ่งศูนย์หนึ่งห้าศูนย์”).
Repetition mark ๆ: split into explicit repeats with spaces where needed (“เก่งๆ” → “เก่ง เก่ง”; fix phrase-level repeats like “เป็น อย่าง อย่าง”).
Dashes and ranges: disambiguate “6-7” into “หกถึงเจ็ด” (range), “หกลบเจ็ด” (minus), or “หกขีดเจ็ด” (separator) using context.
Loanwords: prefer stable Thai transliterations (e.g., “เว็บไซต์”). Why it exists: Inconsistent labels cause the model to learn the wrong patterns or hallucinate. Example: “ตี34นาที” (ambiguous) → “ตีสามสิบสี่นาที.” Mixed formats like “ไข่เป็ดเบอร์ 04 ฟอง” → “ไข่เป็ดเบอร์ศูนย์สี่ฟอง.”

🍞 Anchor: Everyone hands in homework in the same neat format, so the model learns faster and makes fewer silly mistakes.

Step 3: General Thai Training Mix (~11,000 hours) 🍞 Hook: A strong athlete trains on many terrains—track, hills, and sand—to handle anything.

🥬 The Concept: Mix big diverse audio with targeted sets that enforce tricky rules. What happens:

Gigaspeech2 provides most hours and acoustic variety.
Internal curated media and Common Voice add conversational and read-speech robustness.
A tiny but crucial internal TTS set drills numeric and formatting edge cases so the model stops hallucinating numbers. Why it exists: Breadth builds robustness; targeted drills fix systematic weak spots. Example: TTS clips like long account numbers teach the model to always say digits correctly.

🍞 Anchor: It’s like practicing both free throws and full games—mechanics plus game sense.

Step 4: Model Architecture and Fine-Tuning 🍞 Hook: If you shorten a marathon into fewer checkpoints, you finish faster with less energy.

🥬 The Concept: FastConformer-Transducer turns long audio into fewer, richer frames and decodes as it listens. What happens:

Encoder: 8× depthwise conv subsampling (256 channels) with smaller kernels; local attention stabilizes long-form audio.
Decoder: RNN-T emits characters/units in a streaming way (frame-synchronous), no 30s padding or lookahead.
Initialize from a strong English FastConformer-Transducer and fine-tune all parameters on Thai for 1 epoch.
Train efficiently on 2× H100 GPUs with AdamW and cosine LR (peak 0.001; 5k warmup), batch size 128; ~17 hours total. Why it exists: To achieve low latency, low compute, and stable streaming without sacrificing too much accuracy. Example: A 30-second clip processes with far fewer attention steps than a standard Conformer, slashing FLOPs.

🍞 Anchor: Like compressing a video before sending it—same content, less time and bandwidth.

Step 5: Two-Stage Curriculum for Isan Dialect 🍞 Hook: First tune your ears to a new accent, then learn its favorite phrases.

🥬 The Concept: A curriculum prevents forgetting Central Thai while learning Isan. What happens:

Data: ~303 hours total, with gold-standard Isan plus Central Thai anchors (media, numeric TTS, repetition subsets).
Stage 1 (Global/Acoustic): Fine-tune the full model with a tiny learning rate (1e-5) for 10 epochs to catch Isan tones and acoustics.
Stage 2 (Linguistic): Freeze the encoder; fine-tune decoder + joint with higher LR (1e-3) for 15 epochs to learn Isan lexicon/particles (e.g., “บ่,” “เฮ็ด”). Why it exists: Separate sound adaptation from word adaptation; avoid catastrophic forgetting. Example: CER improves from 16.22% (Stage 1) to 10.65% (Stage 2), a 5.57% absolute gain.

🍞 Anchor: Like music training—first train your ear, then learn the lyrics.

The Secret Sauce:

Data quality as a first-class citizen: consensus labels + strict normalization.
Architecture built for streaming: subsampling + local attention + RNN-T.
Curriculum that respects how humans learn accents: ears first, words second.
Standardized benchmarks so results are apples-to-apples.

04Experiments & Results

🍞 Hook: If a class uses different grading rubrics, scores are messy. Use one fair rubric, and you see who learned best.

🥬 The Concept: They tested accuracy using Character Error Rate (CER), with two tracks: clean speech (Gigaspeech2-Typhoon) and tough real-world audio (TVSpeech). They also checked FLEURS to show how normalization mismatches can skew scores. How it works:

Measure CER: how many character edits (insertions, deletions, substitutions) to match the reference.
Compare against strong baselines: Pathumma-Whisper Large-v3, Biodatlab models, and Gemini.
Note compute: GFLOPs per 30s of audio and parameter count. Why it matters: Without consistent tests, you can’t tell if gains come from better listening or lucky formatting.

🍞 Anchor: On the standardized sets, the streaming model lands near the accuracy of much larger offline systems while being far cheaper to run.

The Tests:

Standard Track (Gigaspeech2-Typhoon): 1,000 utterances, clean read speech, normalized to canonical Thai rules.
Robustness Track (TVSpeech): 570 challenging clips (3.75 hours) from domains like Finance/Tech; high lexical density and noisy conditions.
FLEURS: Included to reveal orthography mismatches—digits vs. spoken words.

The Competition:

Offline open-source: Pathumma-Whisper Large-v3, Biodatlab Whisper Large and Distil.
Proprietary: Gemini 3 Pro.
Ours (for separation of concerns): streaming Typhoon ASR Realtime and Typhoon Isan Realtime; and offline Typhoon Whisper variants trained on the same Typhoon data.

Scoreboard with Context:

Central Thai (clean): Typhoon ASR Realtime gets 6.81% CER on Gigaspeech2-Typhoon; Pathumma-Whisper Large-v3 gets 5.84%. That’s like getting a solid A- while the top offline student gets an A—but our model runs ~45× cheaper.
Real-world robustness: On TVSpeech, Typhoon ASR Realtime gets 9.99% CER; Pathumma-Whisper Large-v3 gets 10.36%. That’s like beating the heavyweight on a noisy away game.
FLEURS caveat: Our strict spoken-form outputs appear worse against digit-style references. When references are re-normalized to our rules, the Typhoon-trained Whisper Large-v3 achieves 5.69% CER, even better than Gemini 3 Pro’s 6.91%—showing the earlier gap was stylistic, not phonetic.

Isolating Data Quality from Architecture:

When training the exact same Whisper Large-v3 architecture on the Typhoon data pipeline, CER drops from 5.84% (Pathumma baseline) to 4.69% on Gigaspeech2-Typhoon and from 10.36% to 6.32% on TVSpeech—a big win. Translation: better labels and normalization beat raw size.

Dialect Results (Isan):

Best offline: Typhoon-Whisper-Medium-Isan at 8.85% CER.
Best streaming: Typhoon Isan Realtime at 10.65% CER, much better than Whisper-Medium-Dialect (17.72%).
Curriculum ablation: Stage 1 (acoustic-only) is 16.22%; adding Stage 2 (linguistic specialization) improves to 10.65%—a 5.57% absolute gain.

Surprising Findings:

Smaller can be stronger: A 115M streaming model competes with 1.55B offline models on tough audio when trained on clean, consistent data.
Normalization flips leaderboards: When you align reference styles, some “lower” scores become top scores.
Foundation-model trade-off: Gemini reads more “pleasantly” but doesn’t always match verbatim normalization, which matters in strict ASR grading.

🍞 Anchor: Think of a spelling bee with one dictionary. Once everyone uses the same book, the real champs stand out—and the compact champ keeps up without needing a giant brain.

05Discussion & Limitations

🍞 Hook: Even the best bike has gears it’s not built for. Knowing the limits helps you ride smarter.

🥬 The Concept: This system is optimized for fast, faithful Thai transcription under strict rules; it’s not a catch-all for every scenario. Limitations:

Orthographic rigidity: Outputs prefer spoken-form Thai and transliterations (e.g., เว็บไซต์) over English script, which can look less user-friendly without post-processing.
Code-switching: Heavy Thai–English mixes may be forced into Thai characters, not ideal for bilingual contexts.
Semantics: With 115M parameters, the model has less world knowledge than giant foundation models; it can stumble on homophones when acoustics are unclear. Required Resources:
Training uses modern GPUs (e.g., 2× H100), well-prepared data, and the normalization pipeline.
For production, you need streaming deployment infra and optional human review for continued data curation. When NOT to Use:
If you need glossy, mixed-script transcripts for public display without any post-processing.
If your audio is dominated by English or heavy code-switching.
If you need deep semantic enrichment (summaries, entity linking) beyond verbatim transcription. Open Questions:
Inverse Text Normalization (ITN): Can we reliably convert spoken-form outputs back to preferred written formats (dates, currency, postal codes) using context?
Contextual biasing: How to inject on-the-fly domain terms (names, SKUs) without full retraining?
Multi-speaker overlap: How to robustly handle overlapping conversation and diarization in Thai?
Broader dialects: Can the curriculum scale to Northern and Southern Thai and enable zero-shot dialect ID?
On-device: How far can quantization and graph optimizations push toward mobile/IoT privacy-first use?

🍞 Anchor: It’s the right tool for fast, consistent Thai captions—add a polishing step when you need pretty formatting or English mixed in.

06Conclusion & Future Work

Three-sentence summary: This paper introduces Typhoon ASR Realtime, a 115M-parameter FastConformer-Transducer for Thai that delivers low-latency transcription with accuracy close to giant offline models at about 45× less compute. The key is data quality: consensus pseudo-labeling plus strict text normalization, paired with a dialect curriculum that learns Isan without forgetting Central Thai. A standardized benchmark makes evaluations fair and reproducible for the community. Main achievement: Proving that clean, consistent data and a streaming-native architecture can rival billion-parameter offline systems for Thai, while running much faster and cheaper. Future directions: Build robust inverse text normalization to turn spoken-form Thai into user-friendly written formats; add contextual biasing for domain terms; strengthen multi-speaker handling; extend the curriculum to more dialects; and push toward on-device ASR with quantization and optimized runtimes. Why remember this: It flips the script from “bigger is better” to “cleaner and smarter is better” for low-resource, morphologically complex languages—showing a practical path to inclusive, real-time voice tech across Thailand.

Practical Applications

•Live Thai subtitles for online classes, TV, and webinars with low latency.
•Call center transcription that reliably captures phone numbers, dates, and amounts without hallucinations.
•Voice-controlled assistants and kiosks in banks, hospitals, and government offices that respond in real time.
•Meeting notes that auto-transcribe Central Thai and Isan, ready for quick search and summarization.
•Customer service QA: scan transcripts for policy keywords and compliance hints with accurate number handling.
•Field workflows (delivery, maintenance) where workers dictate updates hands-free on mobile devices.
•Media monitoring of Thai broadcasts and podcasts with robust handling of noisy environments.
•Dictation for students and writers that respects Thai orthographic conventions for clean drafts.
•On-device ASR prototypes for privacy-preserving use in clinics or ATMs using quantized versions.
•Domain-adapted deployments that bias toward company names, product SKUs, and local place names.

Version: 1