MOSS-Audio-Tokenizer: Scaling Audio Tokenizers for Future Audio Foundation Models

Yitian Gong; Kuangwei Chen; Zhaoye Fei; Xiaogui Yang; Ke Chen; Yang Wang; Kexin Huang; Mingshu Chen; Ruixiao Li; Qingyuan Cheng; Shimin Li; Xipeng Qiu

MOSS-Audio-Tokenizer: Scaling Audio Tokenizers for Future Audio Foundation Models

Intermediate

Yitian Gong, Kuangwei Chen, Zhaoye Fei et al.2/11/2026

arXiv

Key Summary

•This paper builds a new audio tokenizer, called MOSS-Audio-Tokenizer, that turns sound into tiny tokens the way text tokenizers turn sentences into words.
•It uses one simple kind of brain (just Transformers) from start to finish and learns everything together end-to-end, which makes it easier to grow bigger and better.
•The model is trained on 3 million hours of speech, sounds, and music and has 1.6 billion parameters, so it knows many kinds of audio.
•It reconstructs audio with very high quality even at low bitrates, and keeps getting better as you allow more bits.
•Its tokens power the first fully autoregressive TTS system that beats popular non-autoregressive and multi-stage systems in voice similarity while keeping errors very low.
•A new training trick, Progressive Sequence Dropout, lets one TTS model work well across many bitrates without changing the model.
•The tokenizer also enables competitive ASR performance directly from tokens, without needing a separate audio encoder.
•End-to-end training (no frozen parts) scales predictably: bigger models, more data, and larger batches steadily improve quality.
•Using only causal (look-back-only) Transformers means it can work in streaming mode with low delay, matching how real-time systems operate.

Why This Research Matters

High-quality, low-bitrate audio means clearer calls, smoother video meetings, and faster voice assistants that respond in real time. A unified, scalable tokenizer lets one system handle speech, music, and sound effects without switching toolkits. Variable bitrate gives users control: quick, low-latency replies on the go or richer studio sound when bandwidth allows. End-to-end training creates tokens that carry both meaning and acoustics, which boosts both TTS naturalness and ASR accuracy. Streaming and strict causality make it practical for live scenarios like customer support, gaming chat, and accessibility tools. Predictable scaling ensures that more data and compute reliably translate into better audio experiences for everyone.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine you’re building a LEGO city. Text is like neat bricks with labels, and big language models already know how to stack those. But audio? It’s like a chunky pile of mixed LEGO pieces of many shapes and sizes. Hard to pick up and stack quickly.

🥬 The Concept: Audio tokenization is turning wiggly sound waves into tidy, bite-sized pieces (tokens) so AI can handle them like it handles words.

How it works (big picture):
1. Chop a long audio wave into small time chunks.
2. For each chunk, find a tiny code (a token) that represents its sound.
3. String the tokens together so a language model can read or write them.
Why it matters: Without good tokens, audio models are clumsy—either slow, low-quality, or hard to scale—like building a city from random scraps.

🍞 Anchor: When ChatGPT answers a voice question or speaks back, a tokenizer is the bridge between raw sound and smart language modeling.

The World Before:

Text LLMs soared because subword tokenizers gave them a clean, compact, discrete interface.
Audio lagged: it mixes tiny details (like s‑sounds, breaths, drum hits) and long patterns (sentences, songs). Early tokenizers often used CNNs, pretrained encoders, or multi-stage recipes. They worked, but came with fixed habits (inductive biases) that limited how far you could scale and how well you could reconstruct.
Many systems weren’t fully causal or truly streaming, so training-time tricks didn’t always match real-time use.

The Problem:

We need a single, unified tokenizer for all audio types (speech, sounds, music) that:
- Rebuilds audio with high fidelity.
- Is friendly to autoregressive sequence models (like LLMs) with low frame rates (shorter token sequences).
- Works in streaming and strictly causal fashion for real-time use.
- Scales smoothly with more data, bigger models, and deeper quantization (more bits) without getting stuck.

Failed Attempts:

Pretrained encoders: borrow smarts from other models, but you inherit their limits and can’t fully retune for perfect reconstruction.
Hybrid CNN–Transformer stacks: helpful inductive biases at small scale, but become bottlenecks as you grow; parts don’t always improve together.
Stage-wise training: freezing some parts stops the whole system from improving as a team; quality plateaus early.

The Gap:

Audio needed its own “SentencePiece moment”: a simple, homogeneous, causal, end-to-end architecture trained big on diverse data—so the tokenizer becomes a native, scalable interface for autoregressive audio language models.

🍞 Hook (new concept): You know how a relay team runs faster when they practice together, not one runner at a time?

🥬 The Concept: End-to-end optimization means the encoder, quantizer, decoder, and helpers learn together under one objective.

How it works:
1. Send raw audio through the encoder (learned from scratch).
2. Discretize with a quantizer that’s also learnable.
3. Rebuild with the decoder, and judge quality with multiple losses (reconstruction, adversarial, semantic).
4. Update all parts at once so they co-adapt.
Why it matters: If you freeze any teammate, the whole relay slows; with end-to-end learning, the team keeps getting faster together.

🍞 Anchor: When training a choir, you practice the whole song with everyone, not just the sopranos—so harmonies lock in. That’s end-to-end.

Real Stakes:

Better calls, meetings, and captions: clearer speech at low data rates saves bandwidth and battery.
Faster, smoother voice assistants: strict causality and streaming mean lower lag.
Music and sound creativity: tokens that capture fine detail and long structure enable richer generation and editing.
Accessibility: stronger TTS and ASR improve communication for many users. In short, if we want truly native, real-time audio AI, we need a tokenizer that scales like text tokenizers did for LLMs.

02Core Idea

🍞 Hook: Picture replacing a toolbox full of mismatched gadgets with one sturdy, multipurpose tool that grows stronger the more you use it.

🥬 The Concept: The key insight is to build a fully end-to-end, purely Transformer, strictly causal audio tokenizer (CAT) trained at scale so its discrete tokens become a native interface for autoregressive audio models.

How it works:
1. Use only causal Transformer blocks for both encoder and decoder—no CNNs.
2. Patchify raw 24 kHz audio and compress to a very low frame rate (12.5 Hz) for short token sequences.
3. Discretize with 32-layer residual vector quantization (RVQ) for variable bitrate.
4. Train everything together with reconstruction, adversarial, and semantic (audio-to-text) objectives.
Why it matters: This simplicity removes fixed bottlenecks, aligns perfectly with AR modeling, and scales predictably with more data, compute, and bits.

🍞 Anchor: It’s like teaching one flexible robot to cook any recipe instead of juggling many single-use gadgets; as you train it on millions of meals, it just keeps getting better.

Multiple Analogies:

Library analogy: Old systems were like books split across multiple libraries (CNNs, pretrained encoders); CAT is one big, well-organized library where every shelf (Transformer block) speaks the same language.
Camera analogy: RVQ is like adding lens filters from coarse to fine; use 5 filters for a quick snap (low bitrate) or all 32 for studio quality (high bitrate).
Highway analogy: Low frame rate is fewer cars on the road; AR models can plan far ahead without getting stuck in traffic.

Before vs After:

Before: Hybrid parts, pretrained dependencies, stage-wise tuning; good but hard to scale uniformly; training/inference mismatch.
After: One causal Transformer stack, end-to-end learning, variable bitrate, and streaming; steady gains as model size, data, and batch scale up.

Why It Works (intuition):

Homogeneous architecture = no rigid choke points; every layer can co-adapt.
Strict causality = training matches real-time generation; no cheating with future info.
Low frame rate = shorter token sequences; AR models focus on long-range structure.
RVQ depth = adjustable detail dial; the same tokens serve low to high bitrate uses.
Joint semantic supervision = tokens carry meaning and acoustics, helpful for TTS and ASR.

Building Blocks (each as a mini ‘sandwich’):

🍞 Hook: You know how cutting bread into thick or thin slices changes how much fits in a lunchbox? 🥬 Concept: Frame rate is how often we take a token snapshot per second.
- How: CAT maps 24 kHz audio to 12.5 token frames/sec.
- Why: Fewer frames mean shorter sequences and easier AR modeling. 🍞 Anchor: Audiobooks ship faster if you send chapter summaries (low frame rate) that still keep the story.
🍞 Hook: Imagine stacking stickers layer by layer to refine a drawing. 🥬 Concept: Residual Vector Quantization (RVQ) adds coarse-to-fine discrete codes across 32 layers.
- How: Each later layer encodes the remaining error (residual) left by the previous.
- Why: Lets you choose bitrate by how many layers you keep. 🍞 Anchor: Quick doodle? Use 8 stickers. Masterpiece? Use all 32.
🍞 Hook: Think of baking bread while a food critic tastes and gives feedback. 🥬 Concept: Adversarial training uses discriminators to push for realistic sound.
- How: A judge network learns to tell real vs. reconstructed; the generator learns to fool it.
- Why: Better texture and crispness in audio, not just average correctness. 🍞 Anchor: It’s why cymbals shimmer and voices feel alive.
🍞 Hook: Like attaching subtitles to a video. 🥬 Concept: Semantic supervision adds an audio-to-text head so tokens carry meaning.
- How: A decoder-only LLM predicts text from token hidden states.
- Why: Helps TTS/ASR—tokens aren’t just sound; they’re sound-with-meaning. 🍞 Anchor: The token stream for “cat” sounds aligns with the word “cat,” not random noise.

03Methodology

At a high level: Raw audio → Causal Transformer Encoder + Patchify → 32-layer RVQ Quantizer → Causal Transformer Decoder → Reconstructed audio. Side head: token hidden states → decoder-only LLM (ASR/caption) → text.

Step-by-step details:

Input and Patchify

What happens: 24 kHz mono waveform is broken into patches; between encoder stages, patchify reduces time resolution (e.g., 240 → ×2 → ×2 → ×2) until reaching 12.5 Hz token frames.
Why this exists: Long audio is too many steps; patching compresses time so the model can see long contexts while staying causal and streamable.
Example: A 10-second clip becomes about 125 frames (12.5 Hz), instead of thousands of steps.

Causal Transformer Encoder

What happens: A stack of causal self-attention blocks processes patches, only looking backward in time; no CNNs.
Why: Causality matches AR generation and enables streaming (no peeking at the future). Homogeneity (all Transformer) avoids fixed biases and scales cleanly.
Example: At t=5.2s, the encoder attends to frames ≤5.2s, never beyond.

Residual Vector Quantization (RVQ) with 32 layers

What happens: The encoder’s continuous vectors are quantized layer by layer; each new codebook fixes what’s still missing from previous layers.
Why: Makes bitrate controllable. Keep fewer layers for small files/fast generation; use more for higher fidelity.
Example: Using 8 layers might target ~1 kbps; using 24–32 layers raises to ~3–4 kbps with crisper highs and clearer consonants.

Quantizer Dropout (training time)

What happens: Randomly drop deeper RVQ layers during training so the decoder learns to reconstruct from partial stacks.
Why: Builds robustness across bitrates; the decoder won’t panic if fewer layers are provided at inference.
Example: Some batches keep only 10 layers; others keep all 32.

Causal Transformer Decoder (waveform reconstruction)

What happens: The decoder mirrors the encoder in reverse to rebuild the time-domain waveform from the RVQ tokens, strictly causally.
Why: Streaming-compatible and matches AR use cases (e.g., TTS decoding on the fly).
Example: As tokens arrive, the decoder outputs audio frames with low latency.

Multi-Objective Training (end-to-end)

What happens: Optimize everything together with a mix of losses:
- Reconstruction: Multi-scale mel/STFT losses encourage spectral accuracy.
- Adversarial + feature matching: Discriminators push for realism.
- Quantizer code/commit losses: Keep codebooks stable and useful.
- Semantic (audio-to-text): A decoder-only LLM reads token hidden states and predicts text for ASR/caption tasks when labels are available.
Why: Each loss fixes a different failure mode; together they create tokens that are faithful, natural-sounding, and meaningful.
Example: If cymbals sound dull, adversarial feedback sharpens them; if words drift, the semantic loss realigns.

Strict Causality and Streaming Everywhere

What happens: All components (encoder/quantizer/decoder/LLM) use causal attention and sliding windows; inference mirrors training.
Why: Prevents train–test mismatch and enables real-time apps.
Example: Your assistant can listen and respond with minimal delay.

Variable-Bitrate AR TTS with Progressive Sequence Dropout

What happens: A Temporal Transformer (over time) and a Depth Transformer (over RVQ layers) autoregressively predict tokens from text and a speaker prompt. Progressive Sequence Dropout randomly trains on shorter RVQ prefixes so one model handles many bitrates.
Why: One TTS model, many quality/speed trade-offs, no extra parameters.
Example: For a fast reply, generate the first 8 layers; for a studio read, generate 24–32 layers.

ASR Directly from Tokens

What happens: Feed token embeddings (summed across depth per frame) into a decoder-only LLM to predict text.
Why: Tests whether tokens carry enough linguistic info without an extra encoder.
Example: On LibriSpeech/AISHELL-2, the model achieves competitive WER/CER.

The Secret Sauce:

Homogeneous causal Transformers: simplicity that scales without architectural bottlenecks.
End-to-end joint optimization: every part improves together; no early plateaus.
Low frame rate + deep RVQ: short sequences for AR plus adjustable detail.
Semantic supervision: tokens that blend acoustics and meaning, boosting TTS/ASR.
Training at scale: 3M hours of diverse audio + large batch sizes = predictable gains.

Mini ‘sandwiches’ for new ideas introduced:

🍞 Hook: Like turning a fast-forward dial. 🥬 Concept: Bitrate is how many bits per second you spend on audio details.
- How: Choose how many RVQ layers to keep (e.g., 8 vs. 32).
- Why: Lower bitrate is smaller/faster; higher bitrate is richer/clearer. 🍞 Anchor: Phone call vs. studio recording—same song, different bit budgets.
🍞 Hook: Think of sliding a window across a book line by line. 🥬 Concept: Sliding-window causal attention limits how far back the model looks, enabling streaming.
- How: Attend inside a moving window (e.g., 10 s) instead of the entire past.
- Why: Saves memory and reduces delay. 🍞 Anchor: You read the current paragraph, not the whole book every time.
🍞 Hook: Like judging bread by both recipe and taste test. 🥬 Concept: Multi-scale mel/STFT losses plus adversarial losses balance accuracy and realism.
- How: Match spectra at multiple window sizes; discriminators score naturalness.
- Why: Prevents audio that is numerically right but sounds wrong. 🍞 Anchor: The music looks right on paper and sounds right to your ears.

04Experiments & Results

The Test:

Reconstruction across speech, general audio, and music at multiple bitrates (≈0.75–4 kbps).
Metrics: SIM (speaker similarity), STOI (intelligibility), PESQ (perceptual quality), mel loss and STFT distance (lower is better for audio/music).
Streaming/causality maintained.

The Competition:

Open-source tokenizers/codecs: Encodec, DAC, BigCodec, StableCodec, Mimi, XCodec2.0, XY-Tokenizer, SpeechTokenizer, Higgs-Audio-Tokenizer, MiMo-Audio-Tokenizer, Qwen3-TTS-Tokenizer.
TTS systems: cascaded (e.g., MaskGCT, IndexTTS2), non-autoregressive (F5-TTS), and prior discrete AR models (Llasa, SparkTTS, etc.).

The Scoreboard (with context):

Reconstruction: MOSS-Audio-Tokenizer is state-of-the-art across bitrates. For speech at higher bitrates (3–4 kbps), SIM≈0.96–0.97 and PESQ≈3.9+, which is like getting an A+ when others get A or B+. At low bitrates (~0.75–1 kbps), it stays strong where some baselines drop sharply.
Audio/music: Lower mel/STFT distances than or competitive with others; quality predictably improves as bitrate and model capacity increase.
Subjective listening (MUSHRA): Consistent high scores across bitrates; models tuned for a single bitrate can match near their sweet spot but fall off outside it, while MOSS stays robust.

TTS Results:

The fully autoregressive CAT-TTS (Temporal + Depth Transformers) beats prior discrete AR models in speaker similarity and matches or exceeds top cascaded/NAR systems in low error rates (WER typically <2%).
Progressive Sequence Dropout: With dropout p in {0.25, 0.5, 1.0}, quality at each bitrate is stable; the exact p doesn’t matter much, but higher p saves GPU memory—a pleasant surprise.
Variable bitrate at inference: choose RVQ depth to trade speed for quality with a single model.

ASR Results:

Directly feeding CAT tokens into a decoder-only LLM yields competitive WER/CER on LibriSpeech and AISHELL-2—even without a specialized audio encoder—showing tokens preserve strong linguistic content.

Scaling Studies:

End-to-end vs. partial training: End-to-end keeps improving without early saturation; partial (freezing parts) plateaus, like a team that stops practicing together.
Parameters: 319M → 1.17B+ encoder-decoder; bigger models help more at higher bitrates. At very low bitrates, bitrate (RVQ depth) becomes the main bottleneck, revealing co-dependence of model size and quantization depth.
Batch size: Larger global batches consistently yield better reconstruction at the same step budget, and curves keep rising, indicating compute translates reliably into quality.

Surprising Findings:

The dropout probability in Progressive Sequence Dropout has little effect on TTS quality across bitrates but greatly reduces training memory use—so you can train efficiently without quality loss.
Co-scaling is essential: simply making the model larger without enough RVQ layers (bits) underutilizes capacity; adding bits without model capacity also underdelivers.

05Discussion & Limitations

Limitations:

Data hunger: Training on ~3M hours is costly; smaller organizations may not replicate scale.
Compute demand: 1.6B parameters plus discriminators and a semantic head require serious training infrastructure.
Ultra-low-bitrate edge cases: While robust, extreme compression can still blur sibilants or fast transients.
Music nuance: Complex genres and dense mixes at very low bitrates remain challenging.
Language/script diversity: Semantic alignment quality can vary with underrepresented languages in paired data.

Required Resources:

Large-scale audio corpora (speech, sounds, music), some with text labels for semantic supervision.
Multi-GPU/TPU training with bf16, memory-efficient attention, and adversarial training stability tricks.
Streaming-aware data pipelines to sustain big batch sizes.

When NOT to Use:

If you need a tiny on-device model with minimal memory/compute, simpler codecs may be preferable.
If your task insists on handcrafted features (e.g., strict telephony standards), a flexible end-to-end approach might be overkill.
If you cannot provide enough training data or compute, you won’t realize the scaling benefits.

Open Questions:

How far can causal Transformers scale for even lower frame rates (<12.5 Hz) without losing fidelity?
What’s the best co-scaling schedule between parameters, RVQ depth, and data size?
Can we reduce reliance on adversarial training while keeping the same realism?
How well do tokens transfer to non-speech audio understanding tasks (e.g., event detection) without extra heads?
Can on-device distilled versions keep streaming benefits while fitting mobile constraints?

06Conclusion & Future Work

Three-Sentence Summary:

This paper introduces CAT, a fully end-to-end, purely Transformer, strictly causal audio tokenizer that serves as a native discrete interface for autoregressive audio models.
Scaled to 1.6B parameters and trained on ~3M hours, MOSS-Audio-Tokenizer achieves state-of-the-art reconstruction across speech/sound/music and powers a fully autoregressive TTS that outperforms prior NAR and cascaded systems, while enabling competitive ASR without an auxiliary encoder.
Its homogeneous design, low frame rate, deep RVQ, semantic supervision, and end-to-end optimization deliver predictable scaling with model size, data, and batch.

Main Achievement:

Turning audio tokenization into a simple, scalable, end-to-end causal Transformer pipeline that yields robust, semantically rich tokens for compression, generation, and understanding—all in one interface.

Future Directions:

Co-scaling recipes for parameters, RVQ depth, and data; compressed/distilled variants for edge devices; broader multilingual/long-form music training; exploring alternative realism objectives beyond adversarial losses.

Why Remember This:

Just as subword tokenizers unlocked LLMs, CAT-style tokenization may be the standard interface that unlocks native, real-time, end-to-end audio foundation models—one simple architecture that keeps getting better as you scale.

Practical Applications

•Real-time voice assistants with lower latency and more natural voices.
•High-quality, low-bitrate conferencing to save bandwidth without losing clarity.
•Audio editing and remixing tools that work directly on discrete tokens for fine control.
•Accessible reading aids that produce clearer, more expressive TTS in many languages.
•Smart captioning and transcription services powered by token-based ASR.
•Game and VR sound engines that stream rich audio efficiently with adjustable bitrate.
•Music generation systems that balance long-range structure and fine detail at will.
•On-device speech features via distilled versions for mobile and embedded hardware.
•Broadcast and podcast production pipelines with token-aware compression and repair.
•Multimodal chatbots that natively understand and speak across speech, text, and sound.

Version: 1